Data compression. Question for Criminal Perpetrator 2012, CESPE / UnB

8

Hello, I've already looked in several places for an answer to this question, but until today I did not find it. The question is to be judged on Right or wrong.

  • Question
  

Consider a file composed of a large number of independent characters belonging to an alphabet with four distinct elements.   Also, consider that the probability of occurrence of each element is equal to 1/2, 1/4, 1/8 and 1/8, where each character is mapped to 2 bits.   In this case, since the compression ratio is equal to the ratio of the compressed file size to the original file, it will not be possible to compress this file without loss at a compression rate of 80%.

Feedback: Right

    
asked by anonymous 11.06.2014 / 23:23

2 answers

7

I would say the correct answer is Wrong .

Two points to consider:

  • Probability is not certainty, and
  • Lossless compression is a process closely linked to two factors: Noise and ability of the compression algorithm to encode a pattern dictionary.

Let's assume that the 4-character alphabet is composed of the following symbols:

  • A (binary representation 00)
  • B (binary representation 01)
  • C (binary representation 10)
  • D (binary representation 11)
  • Simulation 1: A file containing 1024 characters [A]

    The probability of occurrence of a file containing only one distinct character repeated N times is low (1/2 ^ 1024 in this simulation), but not impossible. Its total size is 2048 bits (2Kb).

    Assume a compression algorithm that uses the following rule as an escape code:

    If the AD pair is found, check the next 2 characters.

    • If the content is AD: The final content is only an AD pair.
    • If the content is different from AD: Repeat the first character after the escape code by the decimal value expressed in the following 12 bits.

    The compressive representation of the simulation 1 file would be as follows:

    A  D  A  B  A  A  A  A  A
    00 11 00 01 00 00 00 00 00
    |   | |  |_______________|_ Valor hexa &H400, ou 1024 decimal
    |   | |____________________ Caracter a ser repetido (**A**)
    |___|______________________ Código de escape
    

    Result of parsing of this statement: The 1024-character A generated lossless representation of the original file.

    That is, you can represent the file with only 18 bits, or 0.8% of the original size.

    But let's assume that the statement, instead of:

      

    [...] Consider, also, that the probability of occurrence of each   element is equal to 1/2, 1/4, 1/8 and 1/8 [...]

    state this sentence as follows:

      

    [...] Consider, also, that the occurrence of each   element is equal to 1/2, 1/4, 1/8 and 1/8 [...]

    That is no longer a probability, it is a fact that all the characters of the alphabet appear in the file. Let's go to the next simulation:

    Simulation 2: A content file, in order: 512 [A] 256 [B] 128 [C] 128 [D]

    Let's use the same algorithm as in the previous example:

    A  D  A  A  B  A  A  A  A
    00 11 00 00 01 00 00 00 00   = 512 caracteres A
    A  D  B  A  A  C  A  A  A
    00 11 01 00 00 10 00 00 00   = 256 caracteres B
    A  D  C  A  A  B  A  A  A
    00 11 10 00 00 01 00 00 00   = 128 caracteres C
    A  D  D  A  A  B  A  A  A
    00 11 01 00 00 01 00 00 00   = 128 caracteres D
    

    Final representation:

    ADAABAAAAADBAACAAAADCAABAAAADDAABAAA
    

    Total: 36 bits . Compression ratio: 1.75% .

    Considerations

    In both cases I used an algorithm that uses a standard dictionary of length N = 1 (repetition of one character), and files with the least possible noise (pattern variations). For example, a file containing the 256-fold ABCD sequence would generate output times larger than the original file.

    It is a terrible algorithm for any real application, but it serves as proof of concept that, given the characteristics of the problem and in ideal situations it is possible to compress the file at rates below 80%.

    -x -

    Attention to the end of the statement:

      

    [...] you will not be able to compress this file without loss with a   of 80% compression. [...]

    I'm assuming the statement actually wants to say 'with a compression ratio of less than 80%'. If the rate needs to be precisely 80% the algorithm can be modified to generate "padding" at the end of the content, for example.

        
    12.06.2014 / 08:14
    2

    The problem with the statement was not to specify whether it was the best case, medium case or worst case. If we use common sense, however, you can see that this question only makes sense when considering the average case:

    Worst case

    No lossless compression algorithm performs in the worst case. The best it can achieve is zero compaction (i.e. each file remains the same size). By the Pigeon house principle , if all 4 n files of size n are possible, and there is no "waste" in the original format (4 symbols, 2 bits per symbol, minimum representation), so for each reduced file a file will necessarily be increased. Only if the algorithm does not reduce nothing (keep everything as it is, or just permute) is that the worst case performance will not be negative - it will be zero.

    Best case

    Consider the following "encoding":

    • Take all the possible files and sort them according to some criteria (for example, first order them lexicographically then apply the nth permutation).
    • Encode each file using the natural number that represents its index in the generated list. Discard leading zeros.

    In this encoding, a file will be represented only by 0 , another by 10 , another by 11 , another by 100 , and so on. If we choose n (the permutation number) so that your particular file is the first one on the list, the compression ratio will be almost 100%! (but it can be shown that on average this coding is very bad)

    Not only is this an absurd premise (a gambiarra, so to speak), but it is also useless in determining the efficiency of the compaction process.

    Average case

    I do not have a reference to confirm this, but if I remember well in college in cases where the only information present is the relative frequency of each symbol, the #

    Caractere   Frequência   Repr. Binária
    ---------------------------------------
    A           0.5          0
    B           0.25         10
    C           0.125        110
    D           0.125        111
    

    The same file - which previously contained 2 bytes per character - is represented in this way:

    0.5 * 1 + 0.25 * 2 + 0.125 * 3 + 0.125 * 3 = 1.75
    

    Since 1.75 is 87.5% of 2 , then this is the average case if the set of files to be compressed obeys this distribution. Note that not every file encoded in this way will be smaller - those whose frequency of the last symbols are larger than their probability of occurrence will grow in size. But these must occur less frequently in the whole than the others (by the statement) *.

    * Without these probabilities, the performance in the average case also becomes zero, because in a homogeneous distribution the number of documents that grow times their additional size is equal to the number of documents that shrink sometimes their reduced size (also by the principle of the house of pigeons).

        
    19.09.2015 / 02:09