06 Index Compression Flashcards

1
Q

Compression

Why do we compress dictionaries ?

A
  • less disk space (=save money)
  • keep more stuff in memory (=increase speed)
  • increase speed of transfering data

Decompression algorithms are fast

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Compression

Lossy compression vs Lossless compression

A
  • Lossy compression: discard some information (MP3, JPEG, some preprocessing steps)
  • Lossless compression: all information is preserved

e.g. downcasing, stop words, stemming, number elimination

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Term statistics

What is Heaps’ Law ?

A

emperical laws that describe the growth of
the vocabulary
(terms) in collections

M = kT^b given that 30 <= k <= 100, b = 0.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Term statistics

What is Zipf’s law ?

A
  • describe the frequency of the vocabulary
  • if the most frequent term (the) occurs cf1 times
    the second most frequent term (of) has half as many occurrences
    cf2 = 1/2 * cf1, cf3 = 1/3 * cf1

Few frequent terms, many rare terms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Dictionary compression

Why are Fixed-width entries bad?

A
  • most of the bytes in the term column are wasted
  • cannot handle long words (too big for fixed-width)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Dictionary compression

How does a dictionary as a string loook like?

A

store all of terms nect to each other and have pointers indicating word boundary

jump to positions using pointers (don’t have to read the whole string)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Dictionary compression

How does a dictionary as a string
with blocking
loook like?

A
  • has fewer pointers
  • have to go through a part of a string until the right position is found
  • tradeoff: runtime
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Dictionary compression

What is front coding ?

A

store prefix once and store suffixes separately

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Postings compression

What is the main key of posting compression ?

A

we store gaps insted og docIDs

gaps will never be larger than docIDs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Postings compression

Compute the VB code of 130

130(10) = 1000,0010(2)

A

0000 0001 1000 0010

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Compute the Gamma code of 130

130(10) = 1000,0010(2)

A

11111110 0000010

gamma code = length in unary code + offset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what are properties of Gamma code ?

A
  • prefix-free
  • universal
  • parameter-free
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is more efficient? Why?

a) gamma code, b) variable byte encoding

A

b) variable byte encoding
* conceptually simpler
* VB is aligned (hardware works in byte level)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly