# Automatic Meaning Discovery Using Google On Lossy Compression Paul Vitanyi CWI, University of Amsterdam, National ICT Australia Joint work with Kolya Vereshchagin You can import music in a variety of formats, such as MP3 or AAC, and at whatever quality level youd prefer. Lossy Compression You can even choose the new Apple Lossless

encoder. Music encoded with that option offers sound quality indistinguishable from the original CDs at about half the file size of the original. Lossless Compression Lossy Compression drives the Web Pictures: JPEG Sound: MP3 Video: MPEG

Majority of Web transfers are Lossy Compressed Data--- http traffic was exceeded by peer-to-peer music and video sharing in 2002. Lena Compressed by JPEG Original Lena Image (256 x 256 Pixels, 24-Bit RGB) JPEG Compressed (Compression Ratio 43:1) JPEG2000 Compressed (Compression Ratio 43:1) As can be seen from the comparison images above, at compression ratios above 40:1 the JPEG algorithm begins to lose

, its effectiveness while the JPEG2000 compressed image shows very little distortion. Rate Distortion Theory Underlies Lossy Compression Claude Elwood Shannon, 1948 & 1959, Defines Rate Distortion (With learning mouse Theseus in the picture) Rate Distortion X

is a set of source words Y is a set of code words If |X| < |Y|, then no code is faithfull Distortion Distortion Choose a distortion measure d: X Y Real Numbers X Source words x

Distortion = d(x,y) coding Y Code words y Distortion = fidelity of the coded version versus the source data. Example Distortion Measures List Distortion for bit rate R:

x x coding y Hamming Source word x is a finite binary string; Code word y is a finite set of source words containing x, and y is described in R bits. Distortion d(x,y) = log |y| (rounded up to integer value)

Distortion for bit rate R : x= y= Bit flips y can be described in R bits coding

Source word x and code word y are binary strings of length n. Distortion d(x,y) = number of flipped bits between x and y. Example Distortion Measures Euclidean Distortion for parameter R : X is a real number

0 1 coding 0 y is a rational number that can be described in R bits 1

Distortion d(x,y) = |x-y| Distortion-rate function Minimal distortion as function of given rate R: x_1 Random source: x_2 y_1 = c_1(x_1) x_n

y_2 = c_2(x_2) y_n = c_n(x_n) Code length | y_1 y_2 ... y_n | nR bits Distortion-rate function: D(R)= lim n min

over all code x_1x_2...x_n sequences c_1,c_2,...,c_n satisfying rate constraints n p(x_1x_2...x_n) 1/n d(x_i,y_i) i=1

Coding using a sequence of codes c_1.c_2,...,c_n from prescribed code class Rate-distortion function Minimal rate as function of maximal allowed distortion D: x_1 Random source: x_2

y_1 = c_1(x_1) x_n y_2 = c_2(x_2) Expected distortion y_n = c_n(x_n)

Coding using a sequence of codes c_1.c_2,...,c_n from prescribed code class p(x_1x_2...x_n) 1/n d(x_i,y_i) D x_1x_2...x_n Rate-distortion function: R(D)= lim n

min x_1x_2...x_n over all code sequences c_1,c_2,...,c_n satisfying distortion constraints n

p(x_1x_2...x_n) 1/n |y_i| i=1 Since D(R) is convex and nonincreasing, R(D) is its inverse. Function graphs Rate-distortion graph List distortion: R(D) = n D |x_i|=n, D = expected log-cardinality of list (set). Rate-distortion graph Hamming distortion: R(D)=n(1-H(D/n))

|x_i|=n, D= expected # bit flips Rate-distortion graph Euclidean distortion: R(D)= log 1/D x_i is a real in [0,1], D = expected distance between x_i and rational code word y_i Problems with this approach

Functions give expectations or at best rate-distortion relation for a high-probability set of typical sequences It is often assumed that the random source is stationary ergodic (to be able to determine the curve) This is fine for data that satisfy simple statistical properties, But not for complex data that satisfy global relations like images, music, video Such complex pictures are usually atypical. Just like lossless compression requires lots of tricks to be able to compress meaningful data, so does lossy compression.

There is a wealth of ad hoc theories and solutions for special application fields and problems. Can we find a general theory for lossy compression of individual data? Andrey Nikolaevich Kolmogorov (1903-1987, Tambov, Russia) Measure Theory Probability Analysis Intuitionistic Logic

Cohomology Dynamical Systems Hydrodynamics Kolmogorov complexity Background Kolmogorov complexity: Randomness of individual objects. First: A story of Dr. Samuel Johnson Dr. Beattie observed, as something remarkable which had happened to him, that he chanced to see both No.1 and No.1000 hackney-coaches. Why sir, said Johnson there is an equal chance

for ones seeing those two numbers as any other two. Boswells Life of Johnson Defining Randomness: Precursor Ideas Von Mises: A sequence is random if it has about same # of 1s and 0s, and this holds for its reasonably selected subsequences. P. LaPlace: A sequence is extraordinary (nonrandom) because it contains rare regularity.

But what is reasonable? A. Wald: Countably many selection functions A. Church: Recursive functions

J. Ville: von Mises-Wald-Church randomness no good. Kolmogorov Complexity Solomonoff (1960)-Kolmogorov (1965)-Chaitin (1969): The amount of information in a string is the size of the smallest program generating that string. K x min p : U p x p Invariance Theorem: It does not matter which

universal Turing machine U we choose. I.e. all encoding methods are ok. Kolmogorov complexity K(x)= length of shortest description of x K(x|y)=length of shortest description of x given y. A string is random if K(x) |x|. K(x)-K(x|y) is information y knows about x. Theorem (Mutual Information).

K(x)-K(x|y) = K(y)-K(y|x) Applications of Kolmogorov complexity Mathematics --- probability theory, logic. Physics --- chaos, thermodynamics. Computer Science Biology: complex systems Philosophy randomness. Information theory Todays topic.

Individual Rate-Distortion Given datum x, class of models Y={y}, distortion d(x , y): Rate-distortion function: x y r (d) = min {K(y): d(x,y) d} Distortion-rate

function: x y d (r) = min {d(x,y): K(y) r} Individual characteristics: More detail, especially for meaningful (nonrandom) Data Example list distortion: data x,y,z of length n, with K(y) = n/2, K(x)= n/3, K(z)= n/9. All >(1-1/n)2^n data strings u of complexity n- log n K(u) n +O(log n) have individual rate-distortion curves approximately coinciding with Shannons single curve . Therefore, the expected individual rate-distortion

curve coincides with Shannons single curve (up to small error). Those data are typical data that are essentially random (noise) and have no meaning. Data with meaning we may be interested in, music, text, picture, are extraordinary (rare) and have regularities expressing that meaning, And hence small Kolmogorov complexity, and rate-distortion curves differing in size and shape from Shannons. Upper bound Rate-Distortion graph

For all data x the rate-distortion function is monotonic non-increasing and: Ball of all data Cardinality ball x within distortion d of rx (dmax ) K(y0) is B_y(d) = |{x: d(x,y) d}| Set of source words X is a ball of radius d_max with center y_0

code word (`center) y. We often dont write the center if it is understood r x (d) r x(d)+ log [ B(d)/B(d)] + O(small) [all d d ] This means the funxtion r_x(d)+log B(d) is monotonic non-increasing up to fluctuations of size O(log ). For all d d such that B(d)>0, every ball of radius d in X can be covered by at most B(d)/B(d) balls of radius d Ball of radius d For this to be usefull, we require that be polynomial Covering by balls

of radius d d in nthe number of bits in data x. This is satisfied for many distortion measures B(d) B(d) B(d) B(d) Lower Bound Rate-Distortion Graph r x

(d) K(x) log B(d) + O(small) If we have the center of the ball in r x(d) bits, together with value d in O(log d) bits, then we can enumerate all B(d) elements and give the index of x in log B(d) bits. Rate-distortion functions of every shape Lemma: Let r(d)+log B(d) be monotonic non-decreasing, and r(d_max) =0. Then there is datum x such that |r(d)-r_x(d)| O(small)

That is, for very code and distortion, every function between lower bound and upper bound is realized by some datum x (up to some small error and provided the function decreases at at least the proper slope) Hamming Distortion Lemma: For n-bit strings, = O(n^4) B(d) D is Hamming distance, and radius d=D/n. There is a cover of a ball of Hamming radius d

with O(n^4) B(d)/B(d)) balls of Hamming radius d, for every d d. B(d) New result (as far as we know) of sparse covering large Hamming balls by small Hamming balls. Lemma: i) B(d) = n H(d)+O(log n) with d = D/n and H(d)= d log 1/d + (1-d) log (1-d) ; ii) d_max = with D = n/2. Every string is within n/2 bit flips of either center

00...0 or center 11...1 Hamming Distortion, Continued r_x (d): rate n Upper bound: n(1-H(d)) Every monotonic non-increasing function r(d), At K(x) rate we can Describe data x Perfectly: no distortion:

D=D/n=0 Minumum sufficient statistic K(x) Actual curve: r_x(d) with r(d)+log B(d) is monotonic non-decreasing, and r( )=0 That is, in between the lower- and upper bounds and

descending at at least the proper slope, can be realized as rate-distortion function of some datum x, with precision |r(d)-r_x(d)| O(n log n)+K(r) Lower bound: K(x)-nH(d) log n With distortion d=n/D = we only need to specify number of bits of data x

In O(log n) bits d = D/n: distortion , Theory to practice, using real compressorswith Steven de Rooij Rate Distortion 16.076816 20.000000 0000100100110011000111 16.491853 19.000000 00000100000100010100100 16.813781 18.000000 000001001100100010000101 17.813781 17.000000 0101010010001010100101000

18.076816 16.000000 00101101110111001111011101 18.299208 15.000000 001011011101110011101011100 19.299208 14.000000 0101010010001001010011010010 19.884171 13.000000 00001010010101010010100010101 20.299208 12.000000 001011010010101010010101010100 20.621136 11.000000 0010100100010101010010100010101 21.621136 10.000000 01010100100010101010010101010100 22.106563 9.000000 0010110110011010110100110110110101 23.106563 8.000000 01010110110011010110100110110110101 24.106563 7.000000 1101011001010101011010101010111010101 24.691525 6.000000 110101101010100101011010101010111010101 26.621136 5.000000 010101001010001010010101010101001010101 29.206099 4.000000 010101001010001011010100101010101110101

32.469133 3.000000 0101010010101010010101010101101010111110101 33.884171 2.000000 0101010110100010101010010101010111110101 38.130911 1.000000 01010100101000101010101010101010111110101 42.697952 0.000000 010101010101000101010101010101010111110101 Distortion Datum x Rate Mouse: Original Picture Mouse: Increasing Rate of Codes

Mouse: MDL code-length Penguin: Original (Linux) Penguin: Rate of code-lengths Euclidean Distortion Lemma: d=|x-y| (Euclidean distance between real data x and rational code y) = 2; d_max = ; r_x() =O(1); r_x(d) r_x(d)+log d/d [all 0

non-decreasing, and r( )=0, can be realized as rate-distortion function of some real x, with precision |r(d)-r_x(d)| O(log 1/d) [all 0

some string x of length n, with precision |r(d)-r_x(d)| O(log n +K(r)) [all 1

List distortion continued: Positive and negative randomness |x|=|x| K(x)=K(x) d_x(r) log |y| d_x(r) x r

K(x)=K(x) X List distortion continued: Precision of following given function d(r) d(r) Distortion log |y| d_x(r)

Rate r d Expected individual rate-distortion equals Shannons rate-distortion Lemma: Given m repetitions of an i.i.d. random variable with probability f(x) of obtaining outcome x, and f is a total recursive function (K(f) is finite), lim p(x^m) (1/m) d_x^m (mR) = D(R)

m x^m where x^m = x_1 ... x_m, and p(.) is the extension of f to m repetitions of the random variable. Algorithmic Statistics Paul Vitanyi CWI, University of Amsterdam, National ICT Australia Joint work with Kolya Vereshchagin

Kolmogorovs Structure function Non-Probabilistic Statistics Classic Statistics--Recalled Sufficient Statistic Sufficient Statistic, Contnd Kolmogorov Complexity--Revisited

Kolmogorov complexity and Shannon Information Randomness Deficiency Algorithmic Sufficient Statistic Maximum Likelihood Estimator, Best-Fit Estimator Minimum Description Length estimator, Relations between estimators

Primogeniture of ML/MDL estimators ML/MDL estimators can be approximated from above; Best-fit estimator cannot be approximated Either from above or below, up to any Precision. But the approximable ML/MDL estimators yield the best-fitting models, even though we dont know the quantity of goodnessof-fit ML/MDL estimators implicitly optimize goodness-of-fit. Positive- and Negative Randomness,

and Probabilistic Models List distortion continued Recapitulation Selected Bibliography N.K. Vereshchagin, P.M.B. Vitanyi, A theory of lossy compression of individual data, http://arxiv.org/abs/cs.IT/0411014, Submitted. P.D. Grunwald, P.M.B. Vitanyi, Shannon Information and Kolmogorov complexity, IEEE Trans. Information Theory, Submitted. N.K. Vereshchagin and P.M.B. Vitanyi, Kolmogorov's Structure functions and model selection, IEEE Trans. Inform. Theory, 50:12(200

4), 3265- 3290. P. Gacs, J. Tromp, P. Vitanyi , Algorithmic statistics, IEEE Trans. Inform. Theory, 47:6(2001), 2443-2463. Q. Gao, M. Li and P.M.B. Vitanyi , Applying MDL to learning best model granularity, Artificial Intelligence, 121:1-2( 2000), 1--29. P.M.B. Vitanyi and M. Li, Minimum Description Length Induction, Bayesianism, and Kolmogorov Complexity, IEEE Trans. Inform. Theory, IT-46:2(2000), 446--464.