AntiCyberForensics: Theory of Data Compression
Theory of Data Compression
Contents
 Introduction and Background
 Source Modeling
 Entropy Rate of a Source
 Shannon Lossless Source Coding Theorem
 RateDistortion Theory
 The Gap Between Theory and Practice
 FAQs (Frequently Asked Questions)
 References
I. Introduction and Background
In his 1948 paper, “A Mathematical Theory of Communication,” Claude E. Shannon formulated the theory of data compression. Shannon established that there is a fundamental limit to lossless data compression. This limit, called the entropy rate, is denoted by H. The exact value of H depends on the information source — more specifically, the statistical nature of the source. It is possible to compress the source, in a lossless manner, with compression rate close to H. It is mathematically impossible to do better than H.
Shannon also developed the theory of lossy data compression. This is better known as ratedistortion theory. In lossy data compression, the decompressed data does not have to be exactly the same as the original data. Instead, some amount of distortion, D, is tolerated. Shannon showed that, for a given source (with all its statistical properties known) and a given distortion measure, there is a function, R(D), called the ratedistortion function. The theory says that if D is the tolerable amount of distortion, then R(D) is the best possible compression rate.
When the compression is lossless (i.e., no distortion or D=0), the best possible compression rate is R(0)=H (for a finite alphabet source). In other words, the best possible lossless compression rate is the entropy rate. In this sense, ratedistortion theory is a generalization of lossless data compression theory, where we went from no distortion (D=0) to some distortion (D>0).
Lossless data compression theory and ratedistortion theory are known collectively as source coding theory. Source coding theory sets fundamental limits on the performance of all data compression algorithms. The theory, in itself, does not specify exactly how to design and implement these algorithms. It does, however, provide some hints and guidelines on how to achieve optimal performance. In the following sections, we will described how Shannon modeled an information source in terms of a random process, we will present Shannon lossless source coding theorem, and we will discuss Shannon ratedistortion theory. A background in probability theory is recommended.
Imagine that you go to the library. This library has a large selection of books — say there are 100 million books in this library. Each book in this library is very thick — say, for example, that each book has 100 million characters (or letters) in them. When you get to this library, you will, in some random manner, select a book to check out. This chosen book is the information source to be compressed. The compressed book is then stored on your zip disk to take home, or transmitted directly over the internet into your home, or whatever the case may be.
Mathematically, the book you select is denoted by
where represents the whole book, represents the first character in the book, represents the second character, and so on. Even though in reality the length of the book is finite, mathematically we assume that it has infinite length. The reasoning is that the book is so long we can just imagine that it goes on forever. Furthermore, the mathematics turn out to be surprisingly simpler if we assume an infinite length book. To simplify things a little, let us assume that all the characters in all the books are either a lowercase letter (`a’ through `z’) or a SPACE (e. e. cummings style of writing shall we say). The source alphabet, , is defined to be the set of all 27 possible values of the characters: Now put yourself in the shoes of the engineer who designs the compression algorithm. She does not know in advance which book you will select. All she knows is that you will be selecting a book from this library. From her perspective, the characters in the book are random variables which take values on the alphabet . The whole book, is just an infinite sequence of random variables — that is is a random process. There are several ways in which this engineer can model the statistical properties of the book.
 ZeroOrder Model: Each character is statistically independent of all other characters and the 27 possible values in the alphabet are equally likely to occur. If this model is accurate, then a typical opening of a book would look like this (all of these examples came directly from Shannon’s 1948 paper):xfoml rxkhrjffjuj zlpwcfwkcyj ffjeyvkcqsghyd qpaamkbzaacibzlhjqd This does not look like the writing of an intelligent being. In fact, it resembles the writing of a “monkey sitting at a typewriter.”
 FirstOrder Model: We know that in the English language some letters occur more frequently than others. For example, the letters `a’ and `e’ are more common than `q’ and `z’. Thus, in this model, the character are still independent of one another, but the probability distribution of the characters are according to the firstorder statistical distribution of English text. A typical text for this model looks like this:ocroh hli rgwr nmielwis eu ll nbnesebya th eei alhenhttpa oobttva nah brl
 SecondOrder Model: The previous two models assumed statistical independence from one character to the next. This does not accurately reflect the nature of the English language. For exa#ple, some letters in thi# sent#nce are missi#g. However, we are still able to figure out what those letters should have been by looking at the context. This implies that there are some dependency between the characters. Naturally, characters which are in close proximity are more dependent than those that are far from each other. In this model, the present character depends on the previous character but it is conditionally independent of all previous characters . According to this model, the probability distribution of the character varies according to what the previous character is. For example, the letter `u’ rarely occurs (probability=0.022). However, given that the previous character is `q’, the probability of a `u’ in the present character is much higher (probability=0.995). For a complete description, see the secondorder statistical distribution of English text. A typical text for this model would look like this:on ie antsoutinys are t inctore st be s deamy achin d ilonasive tucoowe at teasonare fuso tizin andy tobe seace ctisbe
 ThirdOrder Model: This is an extension of the previous model. Here, the present character depends on the previous two characters but it is conditionally independent of all previous characters before those: . In this model, the distribution of varies according to what are. See the thirdorder statistical distribution of English text. A typical text for this model would look like this: in no ist lat whey cratict froure birs grocid pondenome of demonstures of the reptagin is regoactiona of cre The resemblance to ordinary English text increases quite noticeably at each of the above steps.
 General Model: In this model, the book is an arbitrary stationary random process. The statistical properties of this model are too complex to be deemed practical. This model is interesting only from a theoretical point of view.
Model A above is a special case of Model B. Model B is a special case of Model C. Model C is a special case of Model D. Model D is a special case of Model E.
The entropy rate of a source is a number which depends only on the statistical nature of the source. If the source has a simple model, then this number can be easily calculated. Here, we consider an arbitrary source:
while paying special attention to the case where is English text.
 ZeroOrder Model: The characters are statistically independent of each other and every letter of the alphabet,, are equally likely to occur. Let be the size of the alphabet. In this case, the entropy rate is given by For English text, the alphabet size is m=27. Thus, if this had been an accurate model for English text, then the entropy rate would have been H=log_{2} 27=4.75 bits/character.
 FirstOrder Model: The characters are statistically independent. Let be the size of the alphabet and let be the probability of the ith letter in the alphabet. The entropy rate is Using the firstorder distribution, the entropy rate of English text would have been 4.07 bits/character had this been the correct model.
 SecondOrder Model: Let be the conditional probability that the present character is the jth letter in the alphabet given that the previous character is the ith letter. The entropy rate is Using the secondorder distribution, the entropy rate of English text would have been 3.36 bits/character had this been the correct model.
 ThirdOrder Model: Let be the conditional probability that the present character is the kth letter in the alphabet given that the previous character is the jth letter and the one before that is the ith letter. The entropy rate is Using the thirdorder distribution, the entropy rate of English text would have been 2.77 bits/character had this been the correct model.
 General Model: Let represents the first characters. The entropy rate in the general case is given by where the sum is over all possible values of . It is virtually impossible to calculate the entropy rate according to the above equation. Using a prediction method, Shannon has been able to estimate that the entropy rate of the 27letter English text is 2.3 bits/character (see Shannon:Collected Papers).
We emphasize that there is only one entropy rate for a given source. All of the above definitions for the entropy rate are consistent with one another.
IV. Shannon Lossless Source Coding Theorem
Shannon lossless source coding theorem is based on the concept of block coding. To illustrate this concept, we introduce a special information source in which the alphabet consists of only two letters:
Here, the letters `a’ and `b’ are equally likely to occur. However, given that `a’ occurred in the previous character, the probability that `a’ occurs again in the present character is 0.9. Similarly, given that `b’ occurred in the previous character, the probability that `b’ occurs again in the present character is 0.9. This is known as a binary symmetric Markov source
.
An nth order block code is just a mapping which assigns to each block of n consecutive characters a sequence of bits of varying length. The following examples illustrate this concept.
 FirstOrder Block Code: Each character is mapped to a single bit.
Codeword a 0.5 0 b 0.5 1 R=1 bit/character An example:
Note that 24 bits are used to represent 24 characters — an average of 1 bit/character.  SecondOrder Block Code: Pairs of characters are mapped to either one, two, or three bits.
Codeword aa 0.45 0 bb 0.45 10 ab 0.05 110 ba 0.05 111 R=0.825 bits/character An example:
Note that 20 bits are used to represent 24 characters — an average of 0.83 bits/character.  ThirdOrder Block Code: Triplets of characters are mapped to bit sequence of lengths one through six.
Codeword aaa 0.405 0 bbb 0.405 10 aab 0.045 1100 abb 0.045 1101 bba 0.045 1110 baa 0.045 11110 aba 0.005 111110 bab 0.005 111111 R=0.68 bits/character An example:
Note that 17 bits are used to represent 24 characters — an average of 0.71 bits/character.
Note that:
 The rates shown in the tables are calculated from where is the length of the codeword for block .
 The higher the order, the lower the rate (better compression).
 The codes used in the above examples are Huffman codes.
 We are only interested in lossless data compression code. That is, given the code table and given the compressed data, we should be able to rederive the original data. All of the examples given above are lossless.
Theorem: Let be the rate of an optimal nth order lossless data compression code (in bits/character). Then
Since both upper and lower bounds of approach the entropy rate, H, as n goes to infinity, we have
Thus, the theorem established that the entropy rate is the rate of an optimal lossless data compression code. The limit exists as long as the source is stationary.
In lossy data compression, the decompressed data need not be exactly the same as the original data. Often, it suffices to have a reasonably close approximation. A distortion measure is a mathematical entity which specifies exactly how close the approximation is. Generally, it is a function which assigns to any two letters and in the alphabet a nonnegative number denoted as
Here, is the original data, is the approximation, and is the amount of distortion between and . The most common distortion measures are the Hamming distortion measure: and the squarederror distortion measure (which is only applicable when is a set of numbers):
Ratedistortion theory says that for a given source and a given distortion measure, there exists a function, R(D), called the ratedistortion function. The typical shape of R(D) looks like this:
If the source samples are independent of one another, the ratedistortion function can be obtained by solving the constrained minimization problem:
subject to the constraints that and where is the distortion between the ith and jth letter in the alphabet. Using Blahut’s algorithm, it may be possible to numerically calculate the ratedistortion function.
Ratedistortion theory is based on the concept of block coding (similar to above). A lossy block code is known as a vector quantizer (VQ). The block length n of the code is known as the VQ dimension.
Theorem: Assuming that the source is stationary and the source samples are independent. For every and for every , there exists a VQ of (sufficiently large) dimension n with distortion no greater than and rate no more than Further more, there does not exist a VQ with distortion and rate less than .
In essence, the theorem states that R(D) is the ratedistortion performance of an optimal VQ. The above theorem can be generalized to the case where the source is stationary and ergodic.
VI. The Gap Between Theory and Practice
 The theory holds when the block length n approaches infinity. In realtime compression, the compression algorithm must wait for n consecutive source samples before it can begin the compression. When n is large, this wait (or delay) may be too long. For example, in realtime speech compression, the speech signal is sampled at 8000 samples/second. If n is say 4000, then the compression delay is half a second. In a twoway conversation, this long delay may not be desirable.
 The theory does not take into consideration the complexities associated with the compression and decompression operations. Typically, as the block length n increases, the complexities also increase. Often, the rate of increase is exponential in n.
 The theory assumes that the statistical properties of the source is known. In practice, this information may not be available.
 The theory assumes that there is no error in the compressed bit stream. In practice, due to noise in the communication channel or imperfection in the storage medium, there are errors in the compressed bit stream.
All of these problems have been successfully solved by researchers in the field.
VII. FAQs (Frequently Asked Questions)
 What is the difference between lossless and lossy compression?
 What is the difference between compression rate and compression ratio?
 What is the difference between “data compression theory” and “source coding theory”?
 What does “stationary” mean?
 What does “ergodic” mean?
 What is the difference between lossless and lossy compression?In lossless data compression, the compressedthendecompressed data is an exact replication of the original data. On the other hand, in lossy data compression, the decompressed data may be different from the original data. Typically, there is some distortion between the original and reproduced signal.The popular WinZip program is an example of lossless compression. JPEG is an example of lossy compression.
 What is the difference between compression rate and compression ratio?Historically, there are two main types of applications of data compression: transmission and storage. An example of the former is speech compression for realtime transmission over digital cellular networks. An example of the latter is file compression (e.g. Drivespace).The term “compression rate” comes from the transmission camp, while “compression ratio” comes from the storage camp.
Compression rate is the rate of the compressed data (which we imagined to be transmitted in “realtime”). Typically, it is in units of bits/sample, bits/character, bits/pixels, or bits/second. Compression ratio is the ratio of the size or rate of the original data to the size or rate of the compressed data. For example, if a grayscale image is originally represented by 8 bits/pixel (bpp) and it is compressed to 2 bpp, we say that the compression ratio is 4to1. Sometimes, it is said that the compression ratio is 75%.
Compression rate is an absolute term, while compression ratio is a relative term.
We note that there are current applications which can be considered as both transmission and storage. For example, the above photograph of Shannon is stored in JPEG format. This not only saves storage space on the local disk, it also speeds up the delivery of the image over the internet.
 What is the difference between “data compression theory” and “source coding theory”?There is no difference. They both mean the same thing. The term “coding” is a general term which could mean either “data compression” or “error control coding”.
 What does “stationary” mean?Mathematically, a random process is stationary (or strictsense stationary) if for every positive integers and the vectors: and have the same probability distribution.We can think of “stationarity” in terms of our library example. Suppose that we look at the first letter of all 100 million books and see how often the first letter is `a’, how often it is `b’, and so on. We will then get a probability distribution of the first letter of the (random) book. If we repeat this experiment for the fifth and 105th letters, we will get two other distributions. If all three distributions are the same, we are inclined to believe that the book process is stationary. To be sure, we should look at the joint distribution of the first and second letters and compare it with the joint distribution of the 101st and 102nd letters. If these two joint distributions are roughly the same, we will be more convinced that the book process is stationary.
In reality, the book process can not be stationary because the first character can not be a SPACE.
 What does “ergodic” mean?The strict mathematical definition of ergodicity is too complex to explain. However, Shannon offered the following intuitive explanation: “In an ergodic process every sequence produced by the process is the same in statistical properties. Thus the letter frequencies, (the pairwise) frequencies, etc., obtained from particular sequences, will, as the lengths of the sequences increase, approach definite limits independent of the particular sequence. Actually this is not true of every sequence but the set for which it is false has probability zero. Roughly the ergodic property means statistical homogeneity.”
 C. E. Shannon, A Mathematical Theory of Communication (free pdf version)
 C. E. Shannon and W. Weaver, The Mathematical Theory of Communication. ($13 paper version)
 C. E. Shannon, N. J. Sloane, and A. D. Wyner, Claude Elwood Shannon: Collected Papers.
 C. E. Shannon, “Prediction and Entropy of Printed English,” available in Shannon: Collected Papers.
 C. E. Shannon, “Coding Theorems for a Discrete Source with a Fidelity Criterion,” available in Shannon: Collected Papers.
 T. M. Cover and J. A. Thomas, Elements of Information Theory.
 R. Gallager, Information Theory and Reliable Communication.
 A. Gersho and R. M. Gray, Vector Quantization and Signal Compression.
 K. Sayood, Introduction to Data Compression.
 M. Nelson and J.L. Gailly, The Data Compression Book.
 A. LeonGarcia, Probability and Random Processes for Electrical Engineering.
 A. Papoulis, Probability, Random Variables, and Stochastic Processes.
 M. R. Schroeder, Computer Speech: Recognition, Compression, Synthesis.
 G. Held and T. R. Marshall, Data and Image Compression: Tools and Techniques.
 D. Hankerson, P. D. Johnson, and G. A. Harris, Introduction to Information Theory and Data Compression .
(Student Solution Manual).
Note: To anyone interested in information theory, I can highly recommend two books: Shannon’s Collected Papers and Cover and Thomas’ Elements of Information Theory. Collected Papers includes a 1987 interview of Shannon by OMNI Magazine, almost all of his papers on information theory and cryptography, his celebrated Master’s Theses on A Symbolic Analysis of Relay and Switching Circuits, his Ph.D. dissertation on An Algebra for Theoretical Genetics, his paper on the Scientific Aspects of Juggling, his paper on A Mind Reading Machine, and much more. Elements of Information Theory is now the standard textbook on information theory. I have been using this textbook in my graduatelevel information theory course for the past seven years. I have had nothing but positive feedback from students who have studied this book.

March 23, 2014 at 5:05 ameWTOIadTEv4

March 23, 2014 at 9:44 amweb seo design

March 27, 2014 at 2:25 pmcar dealer

May 7, 2014 at 12:29 amMerit Financial buy bullion

May 7, 2014 at 3:01 amempresa seo

May 7, 2014 at 3:55 amarthritis meds for dogs

May 7, 2014 at 5:37 amseo companies(seo companies)

May 7, 2014 at 7:24 amrehab centers

May 7, 2014 at 7:47 ama forever recovery rehab facility

May 7, 2014 at 8:21 amchristopher pruijsen

May 7, 2014 at 9:22 amford special offers

May 7, 2014 at 6:24 pma forever recovery methods

May 7, 2014 at 11:07 pmseo

May 8, 2014 at 12:25 amGoldline bullion

May 8, 2014 at 12:25 amalcoholics

May 8, 2014 at 12:53 amhomepage besuchen

May 8, 2014 at 3:00 amsearch engine optimization (seo)

May 8, 2014 at 6:06 amagencia seo

May 8, 2014 at 10:40 amrehab centers

May 8, 2014 at 2:42 pmarthritis medicine for dogs

May 8, 2014 at 3:08 pmforever recovery

May 9, 2014 at 1:30 amRegal Assets problems

May 9, 2014 at 1:56 amcreator a forever recovery

May 9, 2014 at 2:11 ama forever recovery methods

May 9, 2014 at 10:43 amweb hosting sydney

May 9, 2014 at 12:11 pmGoldline reviews

May 9, 2014 at 12:41 pmaddiction by forever recovery

May 9, 2014 at 3:45 pmford mustang

May 9, 2014 at 7:36 pm??????????

May 9, 2014 at 7:59 pmseo solutions

May 9, 2014 at 10:37 pmmortgage life insurance companies

May 9, 2014 at 11:21 pmseo vancouver in best business articles

May 10, 2014 at 12:14 amCapital Gold Group complaints

May 10, 2014 at 4:27 amhttp://www.marketraise.com/blog/?p=81

May 10, 2014 at 9:34 amposicionamiento natural

May 10, 2014 at 3:31 pmaddiction by forever recovery

May 10, 2014 at 3:43 pmarthritis cat treatment

May 10, 2014 at 4:34 pmare life insurance proceeds taxable

May 10, 2014 at 5:16 pmposicionamiento organico

May 10, 2014 at 6:51 pmCapital Gold Group partners

May 10, 2014 at 10:38 pmcorey blake

May 11, 2014 at 12:38 amperthseod.com

May 11, 2014 at 3:08 amalcoholism causes

May 11, 2014 at 10:16 amhttp://perthseod.com/semanticweb/

May 11, 2014 at 11:41 amno medical exam life insurance

May 11, 2014 at 3:32 pmalcoholism by forever recovery

May 11, 2014 at 5:32 pmpersönliche website von »raulbuaw« besuchen

May 11, 2014 at 9:33 pmweb design agency

May 11, 2014 at 11:18 pma forever recovery methods

May 12, 2014 at 12:57 amwhole life insurance

May 12, 2014 at 1:04 pmpay per click

May 12, 2014 at 2:00 pmindexed universal life insurance

May 12, 2014 at 8:44 pmlist of life insurance companies

May 12, 2014 at 8:45 pmmeta keywords

May 12, 2014 at 11:41 pmseo

May 13, 2014 at 1:14 amseo madrid

May 13, 2014 at 2:18 amsem services

May 13, 2014 at 9:57 amestrategias posicionamiento

May 13, 2014 at 9:19 pmtecnicas seo

May 13, 2014 at 11:06 pmexpert seo services

May 13, 2014 at 11:36 pmseo australia

May 14, 2014 at 2:06 amposicionamiento seo

May 14, 2014 at 8:55 pmseo pert

May 15, 2014 at 4:53 amseo directory submission

May 15, 2014 at 6:46 amposicionamiento web torreon

May 15, 2014 at 8:30 amwww.youtube.com/

May 15, 2014 at 12:50 pmQuirk Volkswagen

May 15, 2014 at 10:46 pmposicionamiento web

May 15, 2014 at 11:38 pmlife insurance quotes

May 16, 2014 at 12:33 amposicionamiento organico

May 16, 2014 at 3:13 amPaul Cerame Kia

May 16, 2014 at 3:43 amwww.youtube.com

May 16, 2014 at 4:06 amJim Glover Chevrolet

May 16, 2014 at 8:58 amwww.youtube.com/

May 16, 2014 at 10:13 amhttp://www.youtube.com

May 16, 2014 at 3:25 pmhttp://www.youtube.com/watch?v=rHhXwjBsXo

May 20, 2014 at 3:32 pmemployment laws

May 21, 2014 at 12:45 pmovertime lawyer bay area

May 22, 2014 at 8:15 amcalifornia medical malpractice attorney

July 14, 2014 at 12:55 pmBaltimore Home Inspection Companies

July 17, 2014 at 12:32 amidesignac.com/

July 18, 2014 at 9:42 amHome Inspector MD
Recent Comments