L15 - Data Compression: Entropy and Huffman Coding Flashcards
What is the purpose of compression?
To remove / reduce redundancy in data being transmitted.
Define Entropy…
- The probability of an outcome of a set of events, each element of which is associated with a probability P.
- We can’t compress better than the entropy of the data.
What does Entropy provide a benchmark for?
- Benchmark for how well data can be compressed.
- The number of bits or symbols of a decodable data source must be greater than or equal to the entropy.
What is the formula for Entropy H?
- Sum( Pi Log(1/Pi) ) for all M where M is the number of symbol / probability pairs.
How would you calculate then entropy for written plain english?
- Sum ( probability of word * log(1/prob.of word) ) for every word in the alphabet.
What effect does the increase or decrease of entropy have on compression ability of the data?
- Increase in entropy -> Decrease in compression ability.
- Decrease in entropy -> Increase in compression ability.
The limit to which data can be compressed will always be…
The entropy of the data source.
What is the purpose of the Huffman Coding Algorithm?
An algorithm used to compress data from various mediums e.g Text, Audio etc.
What are the steps of the Huffman Coding Algorithm?
- Sort symbols in data by their proabilities in decreasing order (highest at top)
- Recursively sum the probabilities of the lowest probability events.
- Terminate if the sum of the 2 symbol probabilities == 1.
Do a Huffman Coding example on the data { A:0.16, B:0.51, C:0.09, D:0.13, E:0.11 } and calculate the average number of bits per symbol and compare this to the entropy…
- { B -> 1, A -> 011, D -> 010, E -> 001, C -> 000 }
- Bits per symbol -> 10.51 + 30.16 + 30.13 + 30.11 + 3*0.09 = 1.98 bits/symbol.
- The entropy is 1.964
- Thus our calculated bits/symbol is close to entropy, and can be considered a good compression algorithm.
What type of algorithm is Huffman Coding?
Greedy
What is the time complexity of Huffman Coding?
O( n log n )
Give the pros and cons of Huffman Coding…
Pros:
- Prefix free -> Instantly and uniquely decodable.
- Use in many compression standards -> JPEG, MPEG etc.
Cons:
- Need to know probabilities of symbols
- Error propagation -> If error occurs it will cause error overflow to next probability calculation.