05 Index Compression Flashcards

Question 1

Q

Lossless compression

Answer

A

All information is preserved

Question 2

Q

Lossy compression

Answer

A

All information is not preserved, but better compression rates can be accom- plished. Case folding, stemming and stop word elimination are form of lossy compression.

Question 3

Q

Heap’s law

Answer

A

M = kT<sup>b</sup>, where
M = number of distinct terms in a collection
T = number of tokens in the collection
k = constant. Usually 30 \<= k \<= 100. Varies because vocabulary growth depends a lot on the nature of the collection and how it is processed. (Case-folding, stemming reduces the growth rate, but including numbers and spelling errors increase it).
b = constant. Usually cirka 0.5. Is linear in log-log space

Suggests:

The dictionary size continues to increase with more documents in the collection rather than maximum vocabulary size being reached.
The size of the dictionary is quite large for large collections.

Question 4

Q

Zips’s law

Answer

A

Helps us to understand how terms are distributed across docs. This helps us to characterize the properties of the algorithms for compressing postings lists in a later section.

CF_i = ci^k.
The collection frequency CF_i of the ith most common term is c (some constant) times i raised to the power of k with k = -1.

So if the most frequent terms occurs CF₁ times, then the second most frequent terms has half as many occurrences, the third most frequent terms a third as many occurces and so on. . .

Question 5

Q

Front coding

Answer

A

Compression technique where the dictionary is stored as a long string and where a common prefix is identified for a subsequence of the term list and then referred to with a special character.

Question 6

Q

Variable byte codes

Answer

A

Technique for compressing PLs. First, store the gap from each docIDs to the next. First bit for every byte: continuation bit.

Examples

Gap:

We see that as long as there are bytes to be read (from left to right) the first bit of every byte is 0. If its the last byte, its 1.

Question 7

Q

Gamma codes

Answer

A

Technique for compressing PLs. Implements variable length encoding by splitting the reresentation of a gap G into a pair of length and offset.

Examples

Offset is the binary rep of the number but with the leading 1 removed. Length is the length of the offset in unary code.

Is decoded by reading the unary code up to the 0 that terminates it. Now we know how long the offset is. The offset can then be read correctly and the 1 that was chopped off in encdoded is appended.

Question 8

Q

Simple9 (S9) coding

Answer

A

Try to produce a word-aligned code. Basic unit: 32 bits. Try to pack several numbers into one word (32 bits) Each word is split into 4 control bits and 28 data bits. These 4 bits gives information of which number(s) that is stored in the other 28 bits.

Question 9

Q

Universal code

Answer

A

A code like gamma code with the property of being a within a factor of optimal for an arbitrary distibution P is called universal.

Question 10

Q

Prefix free

Answer

A

Gamma codes, for instance, is prefix free because no gamma code is the prefix of another. This means that there is always a unique decoding of a sequence of gamma codes and we do not need delimiters between them.

Question 11

Q

Parameter free

Answer

A

Gamma codes, for instance, is parameters free. With coding that contains parameters, these must be stored and retrieved. And in dynamic indexing, the distribution of gaps can change, so that the original parameters are no longer appropriate.

Question 12

Q

Rice coding

Answer

A

Take average of all gaps in PLs. Round this to smaller power of two. If average is 113. 26 = 64. Therefore, k = 6.

Question 13

Q

Entropy

Answer

A

The characteristic of a discrete probability distribution3 P that determines its coding properties (including whether a code is optimal) is its entropy H(P).

X is the set of all possible numbers we need to be able to encode (and therefore ∑x∈X P(x) = 1.0).

Entropy is a measure of uncertainty

05 Index Compression Flashcards

(13 cards)