Advanced Indexing Flashcards

Question 1

Q

What is the difference between a term-partitioned and a document-partitioned index?

Answer

A

For term-partitioned, one machine handles a subrange of terms
For document-partitioned, one machine handles a subrange of documents

Question 2

Q

What is the simplest approach to dynamic indexing to handle inserts?

Answer

A

For insert, maintain a main index on disk. New docs go to an auxiliary index in memory. Search across both then merge results

Question 3

Q

What is the simplest approach to dynamic indexing to handle deletes?

Answer

A

Use an invalidation bit-vector for deleted docs. Filter searches using this bit vector

Question 4

Q

What is the issue with using a main and auxiliary index?

Answer

A

The problem of frequent merges
Poor performance during merge
Collection-wide statistics are hard to maintain

Question 5

Q

How can we make the main and auxiliary index approach more efficient?

Answer

A

Merging is made efficient if we keep a separate file for each postings list as we would then just need to append the lists together.
However, we then need a lot of files which is inefficient for the OS

Question 6

Q

What are some reasons for using file compression?

Answer

A

Uses less disk space
Able to keep more stuff in memory
Increase speed of data transfer from disk to memory

Question 7

Q

Why should we use file compression for a dictionary structure?

Answer

A

Allows us to keep it small enough to fit in main memory
Small enough to keep some postings lists in main memory as well

Question 8

Q

How do we solve the size issue of a postings list using gap encoding?

Answer

A

Instead of recording the docID of every document in the postings list, we instead record the size of the gap between doc IDs where the term appears. The hope is that gaps can be encoded with fewer bits than the doc IDs

Question 9

Q

How can we use gamma codes for gap encoding?

Answer

A

Represent a gap as a length and offset
The offset is the gap in binary with the leading bit cut off
Length is the length of offset
Encode length as a unary code
Concatenate the length and offset

Question 10

Q

Show how 13 is gamma encoded

Answer

A

Offset: 13 -> 1101 -> 101
Length: 3 -> 1110
Output: 1110101

Advanced Indexing Flashcards

(10 cards)