Lecture 4 - Huffman Encoding Flashcards
What kind of a text compression method if huffman encoding?
Statistical
What is each character replaced by?
A variable length code
What are characters represented by?
Unique codewords of varying lengths
What is special about frequently occuring characters?
They will be represented by a shorter code word than those that are less frequent.
Can a codeword be a prefix of another codeword? WHY?
No
- This would give ambigous decompression
What is this method based on?
The Huffman Tree.
What is a huffman tree?
A binary tree where each character is represented by a leaf node and the codeword for a character is given by the path from the root to the leaf.
Bit code for huffman tree.
left = 0
right = 1
- the prefix property follows from this
Steps for building a Huffman Tree?
- add leaves (one per char)
- add parent to parentless nodes of smallest weight
- weight of new node is equal to sum of weights of the child nodes
What is the weighted path length of a Huffman tree?
sum of (weight * distance to root) for each leaf
What is special about a Huffman tree in relation to WPL?
Huffman trees have a minimum WPL over all binary trees with the given leaf nodes
Does a Huffman tree need to be unique?
No, there can be many solutions that are optimal.
Why do we care about WPL in relation to compression?
This is because WPL is the number of bits in the compressed file
-bits = sum over chars (frequency of char × code length of char)
What is the complexity of building a Huffman tree?
O(n + mlogm) overall
- O(n) to find frequencies
- O(m logm) to construct the code
as it takes O(m) to build tree and O(log m) to insert/remove elements to tree - there are m-1 iterations before heap is empty
What is the complexity of building a tree if m (number of distinct chars) is treated as a constant.
O(m)