Text Compression Flashcards by Malini Nair

why is text compression different from other forms of data compression?

text compression must be exactly reconstructable

requires lossless coding

How well did you know this?

Not at all

Perfectly

lossless compression

class of algorithms allowing original data to be perfectly reconstructed from compressed data

How well did you know this?

Not at all

Perfectly

lossy compression

achieve data reduction by discarding information

this is suitable for certain media types such as images, video, and audio

How well did you know this?

Not at all

Perfectly

dictionary methods

work by replacing a word/text with an index to an entry in a dictionary
diagram coding is an example

How well did you know this?

Not at all

Perfectly

symbolwise methods

work by estimating the probabilities of symbols and coding one symbol at a time using shorter code words for the likely symbol
relies on the modelling step and a coding step

How well did you know this?

Not at all

Perfectly

modeling step

estimation of the probability for the symbols in the text (the better the estimates, the higher compression can be achieved)

How well did you know this?

Not at all

Perfectly

coding step

conversion of probabilities from a model into a bitstream

How well did you know this?

Not at all

Perfectly

static method

uses a fixed model or fixed dictionary derived in advance of any text to be compressed

How well did you know this?

Not at all

Perfectly

semi-static

uses current text to build a model or dictionary during one pass and then apply it in the second pass

How well did you know this?

Not at all

Perfectly

adaptive method

build model/dictionary adaptively during one pass

How well did you know this?

Not at all

Perfectly

information content

the number of bits in which a symbol s should be coded is its information content I(s)
I(s) = -log(base 2)P[s] where P[s] is the predicted probability

How well did you know this?

Not at all

Perfectly

entropy

the average information per symbol over the whole alphabet is given by the following formula:
H = the sum of the product of P[s] and I(s)

How well did you know this?

Not at all

Perfectly

zero-order models

ignore preceding context

How well did you know this?

Not at all

Perfectly

what do preceding models influence

probability of encountering a given symbol at a particular place in a text

How well did you know this?

Not at all

Perfectly

static modeling

derives ant then uses a single model for all texts.

will perform poorly on texts different from those used in constructing the model.

How well did you know this?

Not at all

Perfectly

adaptive modeling

Study These Flashcards

derives model during encoding

begins with a base probability distribution
refine the model as more symbols are encountered

semi-static modeling

Study These Flashcards

derives a model for the file in the first pass

better suited than static but inefficient because must make two passes over tet and transmit model

coding

Study These Flashcards

the task of determining the output representation of a symbol given the probability distributions

Huffman coding

Study These Flashcards

use a code tree for encoding and decoding (0 or 1)
each leaf node is a symbol of the alphabet
The Algorithm:
- code tree constructed bottom-up from the probabilistic model according to the following algorithm
1. probabilities are associated with leaf nodes
2. identify two nodes with smallest probabilities -> join under a parent node whose probability is their sum
3. repeat step 2 until only one node remains
4. 0s and 1s are assigned to each branch

prefix free

Study These Flashcards

no codeword is the prefix of another

canonical Huffman coding

Study These Flashcards

address the issues of storage (high cost for memory) where the codes are generated in a standardized format
all codes for given code length assigned values sequentially

advantages over Huffman coding:

efficient storage of codebooks
more efficient coding algorithm

arithmetic coding

Study These Flashcards

a coding mechanism primarily used for images and can code arbitrarily close to the entropy thus optimal in terms of compression. common in adaptive modeling
output: a stream of bits

disadvantages of arithmetic coding

Study These Flashcards

slower than Huffman coding
not easy to start decoding in middle of the compressed stream
therefore less suitable for text retrieval

LZ77

Study These Flashcards

example of decoding
the encoder output is a series of triples
1st: indicates how far back in the decoded output to look for the next phrase
2nd: length of that phrase
3rd: next character from the input

synchronisation points

assume the smallest unit of random access in the compressed archive in the document either store bit offset of the document or ensure it ends on a byte boundary

what is the algorithm of canonical Huffman coding

1. Calculate the length of code for each symbol. a. probability x length of the code 2. group symbols with same code length into blocks and order (i.e. alphabetically) 3. assign code by counting up a. code length of 2: 00, 01 b. code length of 3: 100, 1010, 110 c. code length of 4: 1110, 1111 4. store/transmit code a. provide sequences of symbols b. number of symbols at each code length

Text Compression Flashcards

(26 cards)