05 Index Compression Flashcards
Lossless compression
All information is preserved
Lossy compression
All information is not preserved, but better compression rates can be accom- plished. Case folding, stemming and stop word elimination are form of lossy compression.
Heap’s law
M = kT<sup>b</sup>, where M = number of distinct terms in a collection T = number of tokens in the collection k = constant. Usually 30 \<= k \<= 100. Varies because vocabulary growth depends a lot on the nature of the collection and how it is processed. (Case-folding, stemming reduces the growth rate, but including numbers and spelling errors increase it). b = constant. Usually cirka 0.5. Is linear in log-log space
Suggests:
- The dictionary size continues to increase with more documents in the collection rather than maximum vocabulary size being reached.
- The size of the dictionary is quite large for large collections.
Zips’s law
Helps us to understand how terms are distributed across docs. This helps us to characterize the properties of the algorithms for compressing postings lists in a later section.
CFi = cik.
The collection frequency CFi of the ith most common term is c (some constant) times i raised to the power of k with k = -1.
So if the most frequent terms occurs CF1 times, then the second most frequent terms has half as many occurrences, the third most frequent terms a third as many occurces and so on. . .
Front coding
Compression technique where the dictionary is stored as a long string and where a common prefix is identified for a subsequence of the term list and then referred to with a special character.
Variable byte codes
Technique for compressing PLs. First, store the gap from each docIDs to the next. First bit for every byte: continuation bit.
Examples
Gap:
We see that as long as there are bytes to be read (from left to right) the first bit of every byte is 0. If its the last byte, its 1.
Gamma codes
Technique for compressing PLs. Implements variable length encoding by splitting the reresentation of a gap G into a pair of length and offset.
Examples
Offset is the binary rep of the number but with the leading 1 removed. Length is the length of the offset in unary code.
Is decoded by reading the unary code up to the 0 that terminates it. Now we know how long the offset is. The offset can then be read correctly and the 1 that was chopped off in encdoded is appended.
Simple9 (S9) coding
Try to produce a word-aligned code. Basic unit: 32 bits. Try to pack several numbers into one word (32 bits) Each word is split into 4 control bits and 28 data bits. These 4 bits gives information of which number(s) that is stored in the other 28 bits.
Universal code
A code like gamma code with the property of being a within a factor of optimal for an arbitrary distibution P is called universal.
Prefix free
Gamma codes, for instance, is prefix free because no gamma code is the prefix of another. This means that there is always a unique decoding of a sequence of gamma codes and we do not need delimiters between them.
Parameter free
Gamma codes, for instance, is parameters free. With coding that contains parameters, these must be stored and retrieved. And in dynamic indexing, the distribution of gaps can change, so that the original parameters are no longer appropriate.
Rice coding
Take average of all gaps in PLs. Round this to smaller power of two. If average is 113. 26 = 64. Therefore, k = 6.
Entropy
The characteristic of a discrete probability distribution3 P that determines its coding properties (including whether a code is optimal) is its entropy H(P).
X is the set of all possible numbers we need to be able to encode (and therefore ∑x∈X P(x) = 1.0).
Entropy is a measure of uncertainty