Advanced Indexing Flashcards

1
Q

What is the difference between a term-partitioned and a document-partitioned index?

A

For term-partitioned, one machine handles a subrange of terms
For document-partitioned, one machine handles a subrange of documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the simplest approach to dynamic indexing to handle inserts?

A

For insert, maintain a main index on disk. New docs go to an auxiliary index in memory. Search across both then merge results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the simplest approach to dynamic indexing to handle deletes?

A

Use an invalidation bit-vector for deleted docs. Filter searches using this bit vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the issue with using a main and auxiliary index?

A
  1. The problem of frequent merges
  2. Poor performance during merge
  3. Collection-wide statistics are hard to maintain
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can we make the main and auxiliary index approach more efficient?

A

Merging is made efficient if we keep a separate file for each postings list as we would then just need to append the lists together.
However, we then need a lot of files which is inefficient for the OS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some reasons for using file compression?

A
  1. Uses less disk space
  2. Able to keep more stuff in memory
  3. Increase speed of data transfer from disk to memory
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why should we use file compression for a dictionary structure?

A
  1. Allows us to keep it small enough to fit in main memory
  2. Small enough to keep some postings lists in main memory as well
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do we solve the size issue of a postings list using gap encoding?

A

Instead of recording the docID of every document in the postings list, we instead record the size of the gap between doc IDs where the term appears. The hope is that gaps can be encoded with fewer bits than the doc IDs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can we use gamma codes for gap encoding?

A
  1. Represent a gap as a length and offset
  2. The offset is the gap in binary with the leading bit cut off
  3. Length is the length of offset
  4. Encode length as a unary code
  5. Concatenate the length and offset
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Show how 13 is gamma encoded

A
  1. Offset: 13 -> 1101 -> 101
  2. Length: 3 -> 1110
  3. Output: 1110101
How well did you know this?
1
Not at all
2
3
4
5
Perfectly