Lecture 4 - Huffman Encoding Flashcards
What kind of a text compression method if huffman encoding?
Statistical
What is each character replaced by?
A variable length code
What are characters represented by?
Unique codewords of varying lengths
What is special about frequently occuring characters?
They will be represented by a shorter code word than those that are less frequent.
Can a codeword be a prefix of another codeword? WHY?
No
- This would give ambigous decompression
What is this method based on?
The Huffman Tree.
What is a huffman tree?
A binary tree where each character is represented by a leaf node and the codeword for a character is given by the path from the root to the leaf.
Bit code for huffman tree.
left = 0
right = 1
- the prefix property follows from this
Steps for building a Huffman Tree?
- add leaves (one per char)
- add parent to parentless nodes of smallest weight
- weight of new node is equal to sum of weights of the child nodes
What is the weighted path length of a Huffman tree?
sum of (weight * distance to root) for each leaf
What is special about a Huffman tree in relation to WPL?
Huffman trees have a minimum WPL over all binary trees with the given leaf nodes
Does a Huffman tree need to be unique?
No, there can be many solutions that are optimal.
Why do we care about WPL in relation to compression?
This is because WPL is the number of bits in the compressed file
-bits = sum over chars (frequency of char × code length of char)
What is the complexity of building a Huffman tree?
O(n + mlogm) overall
- O(n) to find frequencies
- O(m logm) to construct the code
as it takes O(m) to build tree and O(log m) to insert/remove elements to tree - there are m-1 iterations before heap is empty
What is the complexity of building a tree if m (number of distinct chars) is treated as a constant.
O(m)
What does compression use?
A code table (array of codes indexed by char). It uses the built tree to get path
-> O (mlogm) to build the table as m characters so m paths of length <=log m
-> O(n) to compress. n characters in the text so n O(1) lookups
Compression is O(mlogm + n)
What does decompression use?
Uses the tree directly, which means decompression is O(nlogm).
This is beacause each codeword is replaced by a char found in the Huffman Tree.
If we assume m is a constant what is the time complexity of decompression and compression.
O(n)
What is a problem with Huffman Encoding?
That the huffman tree must be stored with a compressed file.
Why must a huffman tree be stored with the compressed file?
As otherwise decompression would be impossible.
What are the alternatives to Huffman Encoding?
- use a fixed set of frequencies based on a typical values for text (will likely reduce compression ratio)
- use adaptive huffman coding: the same tree is built and adapted by the compressor and by the decompressor as characters are encoded/decoded (likely to slow down compression and decompression)