Text Compression Flashcards

1
Q

why is text compression different from other forms of data compression?

A

text compression must be exactly reconstructable

requires lossless coding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

lossless compression

A

class of algorithms allowing original data to be perfectly reconstructed from compressed data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

lossy compression

A

achieve data reduction by discarding information

this is suitable for certain media types such as images, video, and audio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

dictionary methods

A

work by replacing a word/text with an index to an entry in a dictionary
diagram coding is an example

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

symbolwise methods

A

work by estimating the probabilities of symbols and coding one symbol at a time using shorter code words for the likely symbol
relies on the modelling step and a coding step

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

modeling step

A

estimation of the probability for the symbols in the text (the better the estimates, the higher compression can be achieved)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

coding step

A

conversion of probabilities from a model into a bitstream

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

static method

A

uses a fixed model or fixed dictionary derived in advance of any text to be compressed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

semi-static

A

uses current text to build a model or dictionary during one pass and then apply it in the second pass

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

adaptive method

A

build model/dictionary adaptively during one pass

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

information content

A

the number of bits in which a symbol s should be coded is its information content I(s)
I(s) = -log(base 2)P[s] where P[s] is the predicted probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

entropy

A

the average information per symbol over the whole alphabet is given by the following formula:
H = the sum of the product of P[s] and I(s)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

zero-order models

A

ignore preceding context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what do preceding models influence

A

probability of encountering a given symbol at a particular place in a text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

static modeling

A

derives ant then uses a single model for all texts.

will perform poorly on texts different from those used in constructing the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

adaptive modeling

A

derives model during encoding

  • begins with a base probability distribution
  • refine the model as more symbols are encountered
17
Q

semi-static modeling

A

derives a model for the file in the first pass

better suited than static but inefficient because must make two passes over tet and transmit model

18
Q

coding

A

the task of determining the output representation of a symbol given the probability distributions

19
Q

Huffman coding

A

use a code tree for encoding and decoding (0 or 1)
each leaf node is a symbol of the alphabet
The Algorithm:
- code tree constructed bottom-up from the probabilistic model according to the following algorithm
1. probabilities are associated with leaf nodes
2. identify two nodes with smallest probabilities -> join under a parent node whose probability is their sum
3. repeat step 2 until only one node remains
4. 0s and 1s are assigned to each branch

20
Q

prefix free

A

no codeword is the prefix of another

21
Q

canonical Huffman coding

A

address the issues of storage (high cost for memory) where the codes are generated in a standardized format
all codes for given code length assigned values sequentially

advantages over Huffman coding:

  • efficient storage of codebooks
  • more efficient coding algorithm
22
Q

arithmetic coding

A

a coding mechanism primarily used for images and can code arbitrarily close to the entropy thus optimal in terms of compression. common in adaptive modeling
output: a stream of bits

23
Q

disadvantages of arithmetic coding

A
  • slower than Huffman coding
  • not easy to start decoding in middle of the compressed stream
  • therefore less suitable for text retrieval
24
Q

LZ77

A

example of decoding
the encoder output is a series of triples
1st: indicates how far back in the decoded output to look for the next phrase
2nd: length of that phrase
3rd: next character from the input

25
Q

synchronisation points

A

assume the smallest unit of random access in the compressed archive in the document
either store bit offset of the document or ensure it ends on a byte boundary

26
Q

what is the algorithm of canonical Huffman coding

A
  1. Calculate the length of code for each symbol.
    a. probability x length of the code
  2. group symbols with same code length into blocks and order (i.e. alphabetically)
  3. assign code by counting up
    a. code length of 2: 00, 01
    b. code length of 3: 100, 1010, 110
    c. code length of 4: 1110, 1111
  4. store/transmit code
    a. provide sequences of symbols
    b. number of symbols at each code length