Text Compression Flashcards
why is text compression different from other forms of data compression?
text compression must be exactly reconstructable
requires lossless coding
lossless compression
class of algorithms allowing original data to be perfectly reconstructed from compressed data
lossy compression
achieve data reduction by discarding information
this is suitable for certain media types such as images, video, and audio
dictionary methods
work by replacing a word/text with an index to an entry in a dictionary
diagram coding is an example
symbolwise methods
work by estimating the probabilities of symbols and coding one symbol at a time using shorter code words for the likely symbol
relies on the modelling step and a coding step
modeling step
estimation of the probability for the symbols in the text (the better the estimates, the higher compression can be achieved)
coding step
conversion of probabilities from a model into a bitstream
static method
uses a fixed model or fixed dictionary derived in advance of any text to be compressed
semi-static
uses current text to build a model or dictionary during one pass and then apply it in the second pass
adaptive method
build model/dictionary adaptively during one pass
information content
the number of bits in which a symbol s should be coded is its information content I(s)
I(s) = -log(base 2)P[s] where P[s] is the predicted probability
entropy
the average information per symbol over the whole alphabet is given by the following formula:
H = the sum of the product of P[s] and I(s)
zero-order models
ignore preceding context
what do preceding models influence
probability of encountering a given symbol at a particular place in a text
static modeling
derives ant then uses a single model for all texts.
will perform poorly on texts different from those used in constructing the model.
adaptive modeling
derives model during encoding
- begins with a base probability distribution
- refine the model as more symbols are encountered
semi-static modeling
derives a model for the file in the first pass
better suited than static but inefficient because must make two passes over tet and transmit model
coding
the task of determining the output representation of a symbol given the probability distributions
Huffman coding
use a code tree for encoding and decoding (0 or 1)
each leaf node is a symbol of the alphabet
The Algorithm:
- code tree constructed bottom-up from the probabilistic model according to the following algorithm
1. probabilities are associated with leaf nodes
2. identify two nodes with smallest probabilities -> join under a parent node whose probability is their sum
3. repeat step 2 until only one node remains
4. 0s and 1s are assigned to each branch
prefix free
no codeword is the prefix of another
canonical Huffman coding
address the issues of storage (high cost for memory) where the codes are generated in a standardized format
all codes for given code length assigned values sequentially
advantages over Huffman coding:
- efficient storage of codebooks
- more efficient coding algorithm
arithmetic coding
a coding mechanism primarily used for images and can code arbitrarily close to the entropy thus optimal in terms of compression. common in adaptive modeling
output: a stream of bits
disadvantages of arithmetic coding
- slower than Huffman coding
- not easy to start decoding in middle of the compressed stream
- therefore less suitable for text retrieval
LZ77
example of decoding
the encoder output is a series of triples
1st: indicates how far back in the decoded output to look for the next phrase
2nd: length of that phrase
3rd: next character from the input
synchronisation points
assume the smallest unit of random access in the compressed archive in the document
either store bit offset of the document or ensure it ends on a byte boundary
what is the algorithm of canonical Huffman coding
- Calculate the length of code for each symbol.
a. probability x length of the code - group symbols with same code length into blocks and order (i.e. alphabetically)
- assign code by counting up
a. code length of 2: 00, 01
b. code length of 3: 100, 1010, 110
c. code length of 4: 1110, 1111 - store/transmit code
a. provide sequences of symbols
b. number of symbols at each code length