Lossless data compression Flashcards
What is the interpretation of entropy in data compression?
The entropy H(X) = H is the data compression limit as well as the number of bits needed in random number generation, and codes achieving H turn out to be optimal from many points of view.
Source code
A source code C for a random variable X is a mapping from the range of X to D*, the set of finite-length strings of symbols from a D-ary alphabet. Let C(x) denote the codeword corresponding to x and let l(x) denote the length of C(x).
Codebook
A codebook is a reference table or mapping used in encoding and decoding processes to associate input symbols with their corresponding codewords.
Expected codelength
The expected length L(C) of a source code C(x) for a random variable X with probability mass function p(x) is given by
L(C) = \sum_{x∈X} p(x) * l(x),
where l(x) is the length of the codeword associated with x.
The length of an optimal code is bounded by
H(X) ≤ L < H(X) + 1/n.
Hence, by using large block lengths we can achieve an expected code length per symbol arbitrarily close to the entropy.
What is the expected length of a code, if the assumed distribution is incorrect?
The expected length under p(x) of the code assignment l(x) = ⌈log(1 / q(x))⌉ satisfies
H(p) + D(p||q) ≤ E_p[ l(X)] < H(p) + D(p||q) + 1.
What are the classes of codes?
All codes > Nonsingular codes > Uniquely decodeable codes > Instantaneous codes
Nonsingular if every element of the range of X maps into a different string in D*
Uniquely decodable if its extension (concatenation of codewords) is nonsingular
Prefix code or instantaneous code if no codeword is a prefix of any other codeword
An instantaneous code can be decoded without reference to future codewords since the end of a codeword is immediately recognizable.
Krafts inequality
For any instantaneous code (prefix code) over an alphabet of size D, the codeword lengths l_1, l_2, . . . , l_m must satisfy the inequality
\sum_i D^{−l_i} ≤ 1
Conversely, given a set of codeword lengths that satisfy this inequality, there exists an instantaneous code with these word lengths.
It is also valid for infinite length codes. It is also valid for uniquely decodable codes.
McMillan inequality
It is the Kraft inequality extended to uniquely decodable D-ary codes.
The codeword lengths of any uniquely decodable D-ary code must satisfy the Kraft inequality
\sum D^{−l_i} ≤ 1.
Conversely, given a set of codeword lengths that satisfy this inequality, it is possible to construct a uniquely decodable code with these codeword lengths.
It is also valid for infinite length codes.
Huffman code
Looks like a tree. Initially, all nodes are leaf nodes, which contain the symbol itself, the weight (frequency of appearance) of the symbol and optionally, a link to a parent node which makes it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain a weight, links to two child nodes and an optional link to a parent node.
The process begins with the leaf nodes containing the probabilities of the symbol they represent. Then, the process takes the two nodes with smallest probability, and creates a new internal node having these two nodes as children. The weight of the new node is set to the sum of the weight of the children. We then apply the process again, on the new internal node and on the remaining nodes (i.e., we exclude the two leaf nodes), we repeat this process until only one node remains, which is the root of the Huffman tree.
Huffman coding generates prefix codes, meaning no code is a prefix of another. Any other code for the same alphabet can not have a lower expected length than the code constructed by the algorithm.
If we know an approximation X’ of the source X. How many bits do we on average need to send to know X?
We need H(X|X’) bits, which can be upper bounded using Fano’s inequality. This is called conditional lossless source coding.
Arithmetic coding
Arithmetic coding is a method of lossless data compression that encodes an entire message into a single fractional value between 0 and 1.
Range Division:
Each symbol in the input is assigned a probability (or frequency). The range [0,1) is divided into subranges proportional to these probabilities.
Encoding:
The message is encoded by successively narrowing the range based on the sequence of symbols in the input.
Precision:
Longer messages result in narrower ranges, requiring higher precision to represent.
Binary fraction to decimal coding
Start from the right of the bits
Add the value of the bit to the sum
Divide by 2
Continue using the next right most bit
Example:
1100
1/2(0+0) = 0
1/2(0+0) = 0
1/2(1+0) = 0.5
1/2(1+0.5) = 1.5
Operational rate
The operational rate, R, refers to the actual rate at which information is transmitted, stored, or processed in a system, taking into account practical constraints such as system design, coding schemes, and channel conditions. It is often expressed in terms of bits per symbol and is constrained to
H(X) ≤ R ≤ H(X) + 1/N
It is also known as the bitrate.
Fanos inequality
It is used for knowing how many bits on average is needed to know X based on an approximation X’.
For any estimator ˆX such that X → Y → ˆX, with P_e = Pr(X ≠ ˆX), we have
H(P_e) + P_e * log |X| ≥ H (X|ˆX) ≥ H(X|Y).
This inequality can be weakened to
1 + P_e * log |X| ≥ H(X|Y)
or
P_e ≥ (H(X|Y) − 1) / log |X|.
Expected code length
H (X) ≤ L_n < H (X) + 1/n
where L_n is the expected codeword length per input symbol