05 Index Compression Flashcards

1
Q

Lossless compression

A

All information is preserved

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Lossy compression

A

All information is not preserved, but better compression rates can be accom- plished. Case folding, stemming and stop word elimination are form of lossy compression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Heap’s law

A
M = kT<sup>b</sup>, where
M = number of distinct terms in a collection
T = number of tokens in the collection
k = constant. Usually 30 \<= k \<= 100. Varies because vocabulary growth depends a lot on the nature of the collection and how it is processed. (Case-folding, stemming reduces the growth rate, but including numbers and spelling errors increase it).
b = constant. Usually cirka 0.5. Is linear in log-log space

Suggests:

  1. The dictionary size continues to increase with more documents in the collection rather than maximum vocabulary size being reached.
  2. The size of the dictionary is quite large for large collections.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Zips’s law

A

Helps us to understand how terms are distributed across docs. This helps us to characterize the properties of the algorithms for compressing postings lists in a later section.

CFi = cik.
The collection frequency CFi of the ith most common term is c (some constant) times i raised to the power of k with k = -1.

So if the most frequent terms occurs CF1 times, then the second most frequent terms has half as many occurrences, the third most frequent terms a third as many occurces and so on. . .

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Front coding

A

Compression technique where the dictionary is stored as a long string and where a common prefix is identified for a subsequence of the term list and then referred to with a special character.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Variable byte codes

A

Technique for compressing PLs. First, store the gap from each docIDs to the next. First bit for every byte: continuation bit.

Examples

Gap:

We see that as long as there are bytes to be read (from left to right) the first bit of every byte is 0. If its the last byte, its 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Gamma codes

A

Technique for compressing PLs. Implements variable length encoding by splitting the reresentation of a gap G into a pair of length and offset.

Examples

Offset is the binary rep of the number but with the leading 1 removed. Length is the length of the offset in unary code.

Is decoded by reading the unary code up to the 0 that terminates it. Now we know how long the offset is. The offset can then be read correctly and the 1 that was chopped off in encdoded is appended.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Simple9 (S9) coding

A

Try to produce a word-aligned code. Basic unit: 32 bits. Try to pack several numbers into one word (32 bits) Each word is split into 4 control bits and 28 data bits. These 4 bits gives information of which number(s) that is stored in the other 28 bits.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Universal code

A

A code like gamma code with the property of being a within a factor of optimal for an arbitrary distibution P is called universal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Prefix free

A

Gamma codes, for instance, is prefix free because no gamma code is the prefix of another. This means that there is always a unique decoding of a sequence of gamma codes and we do not need delimiters between them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Parameter free

A

Gamma codes, for instance, is parameters free. With coding that contains parameters, these must be stored and retrieved. And in dynamic indexing, the distribution of gaps can change, so that the original parameters are no longer appropriate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Rice coding

A

Take average of all gaps in PLs. Round this to smaller power of two. If average is 113. 26 = 64. Therefore, k = 6.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Entropy

A

The characteristic of a discrete probability distribution3 P that determines its coding properties (including whether a code is optimal) is its entropy H(P).

X is the set of all possible numbers we need to be able to encode (and therefore ∑x∈X P(x) = 1.0).

Entropy is a measure of uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly