Home > Electronic Evidence, Expert Evidence > AntiCyberForensics: Theory of Data Compression

AntiCyberForensics: Theory of Data Compression

Theory of Data Compression


Contents

  1. Introduction and Background
  2. Source Modeling
  3. Entropy Rate of a Source
  4. Shannon Lossless Source Coding Theorem
  5. Rate-Distortion Theory
  6. The Gap Between Theory and Practice
  7. FAQs (Frequently Asked Questions)
  8. References

I. Introduction and Background

Claude E. Shannon
Claude Shannon

In his 1948 paper, “A Mathematical Theory of Communication,” Claude E. Shannon formulated the theory of data compression. Shannon established that there is a fundamental limit to lossless data compression. This limit, called the entropy rate, is denoted by H. The exact value of H depends on the information source — more specifically, the statistical nature of the source. It is possible to compress the source, in a lossless manner, with compression rate close to H. It is mathematically impossible to do better than H.

Shannon also developed the theory of lossy data compression. This is better known as rate-distortion theory. In lossy data compression, the decompressed data does not have to be exactly the same as the original data. Instead, some amount of distortion, D, is tolerated. Shannon showed that, for a given source (with all its statistical properties known) and a given distortion measure, there is a function, R(D), called the rate-distortion function. The theory says that if D is the tolerable amount of distortion, then R(D) is the best possible compression rate.

When the compression is lossless (i.e., no distortion or D=0), the best possible compression rate is R(0)=H (for a finite alphabet source). In other words, the best possible lossless compression rate is the entropy rate. In this sense, rate-distortion theory is a generalization of lossless data compression theory, where we went from no distortion (D=0) to some distortion (D>0).

Lossless data compression theory and rate-distortion theory are known collectively as source coding theory. Source coding theory sets fundamental limits on the performance of all data compression algorithms. The theory, in itself, does not specify exactly how to design and implement these algorithms. It does, however, provide some hints and guidelines on how to achieve optimal performance. In the following sections, we will described how Shannon modeled an information source in terms of a random process, we will present Shannon lossless source coding theorem, and we will discuss Shannon rate-distortion theory. A background in probability theory is recommended.

II. Source Modeling

Available at Amazon.com
Shannon, Sloan & Wyner

Imagine that you go to the library. This library has a large selection of books — say there are 100 million books in this library. Each book in this library is very thick — say, for example, that each book has 100 million characters (or letters) in them. When you get to this library, you will, in some random manner, select a book to check out. This chosen book is the information source to be compressed. The compressed book is then stored on your zip disk to take home, or transmitted directly over the internet into your home, or whatever the case may be.

Mathematically, the book you select is denoted by

where represents the whole book, represents the first character in the book, represents the second character, and so on. Even though in reality the length of the book is finite, mathematically we assume that it has infinite length. The reasoning is that the book is so long we can just imagine that it goes on forever. Furthermore, the mathematics turn out to be surprisingly simpler if we assume an infinite length book. To simplify things a little, let us assume that all the characters in all the books are either a lower-case letter (`a’ through `z’) or a SPACE (e. e. cummings style of writing shall we say). The source alphabet, , is defined to be the set of all 27 possible values of the characters: Now put yourself in the shoes of the engineer who designs the compression algorithm. She does not know in advance which book you will select. All she knows is that you will be selecting a book from this library. From her perspective, the characters in the book are random variables which take values on the alphabet . The whole book, is just an infinite sequence of random variables — that is is a random process. There are several ways in which this engineer can model the statistical properties of the book.

  1. Zero-Order Model: Each character is statistically independent of all other characters and the 27 possible values in the alphabet are equally likely to occur. If this model is accurate, then a typical opening of a book would look like this (all of these examples came directly from Shannon’s 1948 paper):xfoml rxkhrjffjuj zlpwcfwkcyj ffjeyvkcqsghyd qpaamkbzaacibzlhjqd This does not look like the writing of an intelligent being. In fact, it resembles the writing of a “monkey sitting at a typewriter.”
  2. First-Order Model: We know that in the English language some letters occur more frequently than others. For example, the letters `a’ and `e’ are more common than `q’ and `z’. Thus, in this model, the character are still independent of one another, but the probability distribution of the characters are according to the first-order statistical distribution of English text. A typical text for this model looks like this:ocroh hli rgwr nmielwis eu ll nbnesebya th eei alhenhttpa oobttva nah brl
  3. Second-Order Model: The previous two models assumed statistical independence from one character to the next. This does not accurately reflect the nature of the English language. For exa#ple, some letters in thi# sent#nce are missi#g. However, we are still able to figure out what those letters should have been by looking at the context. This implies that there are some dependency between the characters. Naturally, characters which are in close proximity are more dependent than those that are far from each other. In this model, the present character depends on the previous character but it is conditionally independent of all previous characters . According to this model, the probability distribution of the character varies according to what the previous character is. For example, the letter `u’ rarely occurs (probability=0.022). However, given that the previous character is `q’, the probability of a `u’ in the present character is much higher (probability=0.995). For a complete description, see the second-order statistical distribution of English text. A typical text for this model would look like this:on ie antsoutinys are t inctore st be s deamy achin d ilonasive tucoowe at teasonare fuso tizin andy tobe seace ctisbe
  4. Third-Order Model: This is an extension of the previous model. Here, the present character depends on the previous two characters but it is conditionally independent of all previous characters before those: . In this model, the distribution of varies according to what are. See the third-order statistical distribution of English text. A typical text for this model would look like this: in no ist lat whey cratict froure birs grocid pondenome of demonstures of the reptagin is regoactiona of cre The resemblance to ordinary English text increases quite noticeably at each of the above steps.
  5. General Model: In this model, the book is an arbitrary stationary random process. The statistical properties of this model are too complex to be deemed practical. This model is interesting only from a theoretical point of view.

Model A above is a special case of Model B. Model B is a special case of Model C. Model C is a special case of Model D. Model D is a special case of Model E.

III. Entropy Rate of a Source

Shannon in 1948
Claude Shannon

The entropy rate of a source is a number which depends only on the statistical nature of the source. If the source has a simple model, then this number can be easily calculated. Here, we consider an arbitrary source:

while paying special attention to the case where is English text.

  1. Zero-Order Model: The characters are statistically independent of each other and every letter of the alphabet,, are equally likely to occur. Let be the size of the alphabet. In this case, the entropy rate is given by For English text, the alphabet size is m=27. Thus, if this had been an accurate model for English text, then the entropy rate would have been H=log2 27=4.75 bits/character.
  2. First-Order Model: The characters are statistically independent. Let be the size of the alphabet and let be the probability of the i-th letter in the alphabet. The entropy rate is Using the first-order distribution, the entropy rate of English text would have been 4.07 bits/character had this been the correct model.
  3. Second-Order Model: Let be the conditional probability that the present character is the j-th letter in the alphabet given that the previous character is the i-th letter. The entropy rate is Using the second-order distribution, the entropy rate of English text would have been 3.36 bits/character had this been the correct model.
  4. Third-Order Model: Let be the conditional probability that the present character is the k-th letter in the alphabet given that the previous character is the j-th letter and the one before that is the i-th letter. The entropy rate is Using the third-order distribution, the entropy rate of English text would have been 2.77 bits/character had this been the correct model.
  5. General Model: Let represents the first characters. The entropy rate in the general case is given by where the sum is over all possible values of . It is virtually impossible to calculate the entropy rate according to the above equation. Using a prediction method, Shannon has been able to estimate that the entropy rate of the 27-letter English text is 2.3 bits/character (see Shannon:Collected Papers).

We emphasize that there is only one entropy rate for a given source. All of the above definitions for the entropy rate are consistent with one another.

IV. Shannon Lossless Source Coding Theorem

Excellent Textbook
Cover & Thomas

Shannon lossless source coding theorem is based on the concept of block coding. To illustrate this concept, we introduce a special information source in which the alphabet consists of only two letters:

Here, the letters `a’ and `b’ are equally likely to occur. However, given that `a’ occurred in the previous character, the probability that `a’ occurs again in the present character is 0.9. Similarly, given that `b’ occurred in the previous character, the probability that `b’ occurs again in the present character is 0.9. This is known as a binary symmetric Markov source

.

An n-th order block code is just a mapping which assigns to each block of n consecutive characters a sequence of bits of varying length. The following examples illustrate this concept.

  1. First-Order Block Code: Each character is mapped to a single bit.
    Codeword
    a 0.5 0
    b 0.5 1
    R=1 bit/character

    An example:

    Example of 1st-order block code
    Note that 24 bits are used to represent 24 characters — an average of 1 bit/character.

  2. Second-Order Block Code: Pairs of characters are mapped to either one, two, or three bits.
    Codeword
    aa 0.45 0
    bb 0.45 10
    ab 0.05 110
    ba 0.05 111
    R=0.825 bits/character

    An example:

    Example of 2nd-order block code
    Note that 20 bits are used to represent 24 characters — an average of 0.83 bits/character.

  3. Third-Order Block Code: Triplets of characters are mapped to bit sequence of lengths one through six.
    Codeword
    aaa 0.405 0
    bbb 0.405 10
    aab 0.045 1100
    abb 0.045 1101
    bba 0.045 1110
    baa 0.045 11110
    aba 0.005 111110
    bab 0.005 111111
    R=0.68 bits/character

    An example:

    Example of 3rd-order block code
    Note that 17 bits are used to represent 24 characters — an average of 0.71 bits/character.

Note that:

  • The rates shown in the tables are calculated from where is the length of the codeword for block .
  • The higher the order, the lower the rate (better compression).
  • The codes used in the above examples are Huffman codes.
  • We are only interested in lossless data compression code. That is, given the code table and given the compressed data, we should be able to rederive the original data. All of the examples given above are lossless.

Theorem: Let be the rate of an optimal n-th order lossless data compression code (in bits/character). Then

Since both upper and lower bounds of approach the entropy rate, H, as n goes to infinity, we have

Thus, the theorem established that the entropy rate is the rate of an optimal lossless data compression code. The limit exists as long as the source is stationary.

V. Rate-Distortion Theory

Available at Amazon.com
Gersho & Gray

In lossy data compression, the decompressed data need not be exactly the same as the original data. Often, it suffices to have a reasonably close approximation. A distortion measure is a mathematical entity which specifies exactly how close the approximation is. Generally, it is a function which assigns to any two letters and in the alphabet a non-negative number denoted as

Here, is the original data, is the approximation, and is the amount of distortion between and . The most common distortion measures are the Hamming distortion measure: and the squared-error distortion measure (which is only applicable when is a set of numbers):

Rate-distortion theory says that for a given source and a given distortion measure, there exists a function, R(D), called the rate-distortion function. The typical shape of R(D) looks like this:

Typical R(D)

If the source samples are independent of one another, the rate-distortion function can be obtained by solving the constrained minimization problem:

subject to the constraints that and where is the distortion between the i-th and j-th letter in the alphabet. Using Blahut’s algorithm, it may be possible to numerically calculate the rate-distortion function.

Rate-distortion theory is based on the concept of block coding (similar to above). A lossy block code is known as a vector quantizer (VQ). The block length n of the code is known as the VQ dimension.

Theorem: Assuming that the source is stationary and the source samples are independent. For every and for every , there exists a VQ of (sufficiently large) dimension n with distortion no greater than and rate no more than Further more, there does not exist a VQ with distortion and rate less than .

In essence, the theorem states that R(D) is the rate-distortion performance of an optimal VQ. The above theorem can be generalized to the case where the source is stationary and ergodic.

VI. The Gap Between Theory and Practice

  • The theory holds when the block length n approaches infinity. In real-time compression, the compression algorithm must wait for n consecutive source samples before it can begin the compression. When n is large, this wait (or delay) may be too long. For example, in real-time speech compression, the speech signal is sampled at 8000 samples/second. If n is say 4000, then the compression delay is half a second. In a two-way conversation, this long delay may not be desirable.
  • The theory does not take into consideration the complexities associated with the compression and decompression operations. Typically, as the block length n increases, the complexities also increase. Often, the rate of increase is exponential in n.
  • The theory assumes that the statistical properties of the source is known. In practice, this information may not be available.
  • The theory assumes that there is no error in the compressed bit stream. In practice, due to noise in the communication channel or imperfection in the storage medium, there are errors in the compressed bit stream.

All of these problems have been successfully solved by researchers in the field.

VII. FAQs (Frequently Asked Questions)

  1. What is the difference between lossless and lossy compression?
  2. What is the difference between compression rate and compression ratio?
  3. What is the difference between “data compression theory” and “source coding theory”?
  4. What does “stationary” mean?
  5. What does “ergodic” mean?

  1. What is the difference between lossless and lossy compression?In lossless data compression, the compressed-then-decompressed data is an exact replication of the original data. On the other hand, in lossy data compression, the decompressed data may be different from the original data. Typically, there is some distortion between the original and reproduced signal.The popular WinZip program is an example of lossless compression. JPEG is an example of lossy compression.

  2. What is the difference between compression rate and compression ratio?Historically, there are two main types of applications of data compression: transmission and storage. An example of the former is speech compression for real-time transmission over digital cellular networks. An example of the latter is file compression (e.g. Drivespace).The term “compression rate” comes from the transmission camp, while “compression ratio” comes from the storage camp.

    Compression rate is the rate of the compressed data (which we imagined to be transmitted in “real-time”). Typically, it is in units of bits/sample, bits/character, bits/pixels, or bits/second. Compression ratio is the ratio of the size or rate of the original data to the size or rate of the compressed data. For example, if a gray-scale image is originally represented by 8 bits/pixel (bpp) and it is compressed to 2 bpp, we say that the compression ratio is 4-to-1. Sometimes, it is said that the compression ratio is 75%.

    Compression rate is an absolute term, while compression ratio is a relative term.

    We note that there are current applications which can be considered as both transmission and storage. For example, the above photograph of Shannon is stored in JPEG format. This not only saves storage space on the local disk, it also speeds up the delivery of the image over the internet.

  3. What is the difference between “data compression theory” and “source coding theory”?There is no difference. They both mean the same thing. The term “coding” is a general term which could mean either “data compression” or “error control coding”.
  4. What does “stationary” mean?Mathematically, a random process is stationary (or strict-sense stationary) if for every positive integers and the vectors: and have the same probability distribution.We can think of “stationarity” in terms of our library example. Suppose that we look at the first letter of all 100 million books and see how often the first letter is `a’, how often it is `b’, and so on. We will then get a probability distribution of the first letter of the (random) book. If we repeat this experiment for the fifth and 105th letters, we will get two other distributions. If all three distributions are the same, we are inclined to believe that the book process is stationary. To be sure, we should look at the joint distribution of the first and second letters and compare it with the joint distribution of the 101st and 102nd letters. If these two joint distributions are roughly the same, we will be more convinced that the book process is stationary.

    In reality, the book process can not be stationary because the first character can not be a SPACE.

  5. What does “ergodic” mean?The strict mathematical definition of ergodicity is too complex to explain. However, Shannon offered the following intuitive explanation: “In an ergodic process every sequence produced by the process is the same in statistical properties. Thus the letter frequencies, (the pairwise) frequencies, etc., obtained from particular sequences, will, as the lengths of the sequences increase, approach definite limits independent of the particular sequence. Actually this is not true of every sequence but the set for which it is false has probability zero. Roughly the ergodic property means statistical homogeneity.”

VIII. References

  1. C. E. Shannon, A Mathematical Theory of Communication (free pdf version)
  2. C. E. Shannon and W. Weaver, The Mathematical Theory of Communication. ($13 paper version)
  3. C. E. Shannon, N. J. Sloane, and A. D. Wyner, Claude Elwood Shannon: Collected Papers.
  4. C. E. Shannon, “Prediction and Entropy of Printed English,” available in Shannon: Collected Papers.
  5. C. E. Shannon, “Coding Theorems for a Discrete Source with a Fidelity Criterion,” available in Shannon: Collected Papers.
  6. T. M. Cover and J. A. Thomas, Elements of Information Theory.
  7. R. Gallager, Information Theory and Reliable Communication.
  8. A. Gersho and R. M. Gray, Vector Quantization and Signal Compression.
  9. K. Sayood, Introduction to Data Compression.
  10. M. Nelson and J.-L. Gailly, The Data Compression Book.
  11. A. Leon-Garcia, Probability and Random Processes for Electrical Engineering.
  12. (Student Solution Manual).

  13. A. Papoulis, Probability, Random Variables, and Stochastic Processes.
  14. M. R. Schroeder, Computer Speech: Recognition, Compression, Synthesis.
  15. G. Held and T. R. Marshall, Data and Image Compression: Tools and Techniques.
  16. D. Hankerson, P. D. Johnson, and G. A. Harris, Introduction to Information Theory and Data Compression .

Note: To anyone interested in information theory, I can highly recommend two books: Shannon’s Collected Papers and Cover and Thomas’ Elements of Information Theory. Collected Papers includes a 1987 interview of Shannon by OMNI Magazine, almost all of his papers on information theory and cryptography, his celebrated Master’s Theses on A Symbolic Analysis of Relay and Switching Circuits, his Ph.D. dissertation on An Algebra for Theoretical Genetics, his paper on the Scientific Aspects of Juggling, his paper on A Mind Reading Machine, and much more. Elements of Information Theory is now the standard textbook on information theory. I have been using this textbook in my graduate-level information theory course for the past seven years. I have had nothing but positive feedback from students who have studied this book.

  1. No comments yet.
  1. March 23, 2014 at 5:05 am
    eWTOIadTEv4
  2. March 23, 2014 at 9:44 am
    web seo design
  3. March 27, 2014 at 2:25 pm
    car dealer
  4. May 7, 2014 at 12:29 am
    Merit Financial buy bullion
  5. May 7, 2014 at 3:01 am
    empresa seo
  6. May 7, 2014 at 3:55 am
    arthritis meds for dogs
  7. May 7, 2014 at 5:37 am
    seo companies(seo companies)
  8. May 7, 2014 at 7:24 am
    rehab centers
  9. May 7, 2014 at 7:47 am
    a forever recovery rehab facility
  10. May 7, 2014 at 8:21 am
    christopher pruijsen
  11. May 7, 2014 at 9:22 am
    ford special offers
  12. May 7, 2014 at 6:24 pm
    a forever recovery methods
  13. May 7, 2014 at 11:07 pm
    seo
  14. May 8, 2014 at 12:25 am
    Goldline bullion
  15. May 8, 2014 at 12:25 am
    alcoholics
  16. May 8, 2014 at 12:53 am
    homepage besuchen
  17. May 8, 2014 at 3:00 am
    search engine optimization (seo)
  18. May 8, 2014 at 6:06 am
    agencia seo
  19. May 8, 2014 at 10:40 am
    rehab centers
  20. May 8, 2014 at 2:42 pm
    arthritis medicine for dogs
  21. May 8, 2014 at 3:08 pm
    forever recovery
  22. May 9, 2014 at 1:30 am
    Regal Assets problems
  23. May 9, 2014 at 1:56 am
    creator a forever recovery
  24. May 9, 2014 at 2:11 am
    a forever recovery methods
  25. May 9, 2014 at 10:43 am
    web hosting sydney
  26. May 9, 2014 at 12:11 pm
    Goldline reviews
  27. May 9, 2014 at 12:41 pm
    addiction by forever recovery
  28. May 9, 2014 at 3:45 pm
    ford mustang
  29. May 9, 2014 at 7:36 pm
    ??????????
  30. May 9, 2014 at 7:59 pm
    seo solutions
  31. May 9, 2014 at 10:37 pm
    mortgage life insurance companies
  32. May 9, 2014 at 11:21 pm
    seo vancouver in best business articles
  33. May 10, 2014 at 12:14 am
    Capital Gold Group complaints
  34. May 10, 2014 at 4:27 am
    http://www.marketraise.com/blog/?p=81
  35. May 10, 2014 at 9:34 am
    posicionamiento natural
  36. May 10, 2014 at 3:31 pm
    addiction by forever recovery
  37. May 10, 2014 at 3:43 pm
    arthritis cat treatment
  38. May 10, 2014 at 4:34 pm
    are life insurance proceeds taxable
  39. May 10, 2014 at 5:16 pm
    posicionamiento organico
  40. May 10, 2014 at 6:51 pm
    Capital Gold Group partners
  41. May 10, 2014 at 10:38 pm
    corey blake
  42. May 11, 2014 at 12:38 am
    perthseod.com
  43. May 11, 2014 at 3:08 am
    alcoholism causes
  44. May 11, 2014 at 10:16 am
    http://perthseod.com/semantic-web/
  45. May 11, 2014 at 11:41 am
    no medical exam life insurance
  46. May 11, 2014 at 3:32 pm
    alcoholism by forever recovery
  47. May 11, 2014 at 5:32 pm
    persönliche website von »raulbuaw« besuchen
  48. May 11, 2014 at 9:33 pm
    web design agency
  49. May 11, 2014 at 11:18 pm
    a forever recovery methods
  50. May 12, 2014 at 12:57 am
    whole life insurance
  51. May 12, 2014 at 1:04 pm
    pay per click
  52. May 12, 2014 at 2:00 pm
    indexed universal life insurance
  53. May 12, 2014 at 8:44 pm
    list of life insurance companies
  54. May 12, 2014 at 8:45 pm
    meta keywords
  55. May 12, 2014 at 11:41 pm
    seo
  56. May 13, 2014 at 1:14 am
    seo madrid
  57. May 13, 2014 at 2:18 am
    sem services
  58. May 13, 2014 at 9:57 am
    estrategias posicionamiento
  59. May 13, 2014 at 9:19 pm
    tecnicas seo
  60. May 13, 2014 at 11:06 pm
    expert seo services
  61. May 13, 2014 at 11:36 pm
    seo australia
  62. May 14, 2014 at 2:06 am
    posicionamiento seo
  63. May 14, 2014 at 8:55 pm
    seo pert
  64. May 15, 2014 at 4:53 am
    seo directory submission
  65. May 15, 2014 at 6:46 am
    posicionamiento web torreon
  66. May 15, 2014 at 8:30 am
    www.youtube.com/
  67. May 15, 2014 at 12:50 pm
    Quirk Volkswagen
  68. May 15, 2014 at 10:46 pm
    posicionamiento web
  69. May 15, 2014 at 11:38 pm
    life insurance quotes
  70. May 16, 2014 at 12:33 am
    posicionamiento organico
  71. May 16, 2014 at 3:13 am
    Paul Cerame Kia
  72. May 16, 2014 at 3:43 am
    www.youtube.com
  73. May 16, 2014 at 4:06 am
    Jim Glover Chevrolet
  74. May 16, 2014 at 8:58 am
    www.youtube.com/
  75. May 16, 2014 at 10:13 am
    http://www.youtube.com
  76. May 16, 2014 at 3:25 pm
    http://www.youtube.com/watch?v=-rHhXwjB-sXo
  77. May 20, 2014 at 3:32 pm
    employment laws
  78. May 21, 2014 at 12:45 pm
    overtime lawyer bay area
  79. May 22, 2014 at 8:15 am
    california medical malpractice attorney
  80. July 14, 2014 at 12:55 pm
    Baltimore Home Inspection Companies
  81. July 17, 2014 at 12:32 am
    idesignac.com/
  82. July 18, 2014 at 9:42 am
    Home Inspector MD

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: