06 Index Compression Flashcards
Compression
Why do we compress dictionaries ?
- less disk space (=save money)
- keep more stuff in memory (=increase speed)
- increase speed of transfering data
Decompression algorithms are fast
Compression
Lossy compression vs Lossless compression
- Lossy compression: discard some information (MP3, JPEG, some preprocessing steps)
- Lossless compression: all information is preserved
e.g. downcasing, stop words, stemming, number elimination
Term statistics
What is Heaps’ Law ?
emperical laws that describe the growth of
the vocabulary (terms) in collections
M = kT^b given that 30 <= k <= 100, b = 0.5
Term statistics
What is Zipf’s law ?
- describe the frequency of the vocabulary
- if the most frequent term (the) occurs cf1 times
the second most frequent term (of) has half as many occurrences
cf2 = 1/2 * cf1, cf3 = 1/3 * cf1
Few frequent terms, many rare terms
Dictionary compression
Why are Fixed-width entries bad?
- most of the bytes in the term column are wasted
- cannot handle long words (too big for fixed-width)
Dictionary compression
How does a dictionary as a string loook like?
store all of terms nect to each other and have pointers indicating word boundary
jump to positions using pointers (don’t have to read the whole string)
Dictionary compression
How does a dictionary as a string
with blocking loook like?
- has fewer pointers
- have to go through a part of a string until the right position is found
- tradeoff: runtime
Dictionary compression
What is front coding ?
store prefix once and store suffixes separately
Postings compression
What is the main key of posting compression ?
we store gaps insted og docIDs
gaps will never be larger than docIDs
Postings compression
Compute the VB code of 130
130(10) = 1000,0010(2)
0000 0001 1000 0010
Compute the Gamma code of 130
130(10) = 1000,0010(2)
11111110 0000010
gamma code = length in unary code + offset
what are properties of Gamma code ?
- prefix-free
- universal
- parameter-free
What is more efficient? Why?
a) gamma code, b) variable byte encoding
b) variable byte encoding
* conceptually simpler
* VB is aligned (hardware works in byte level)