Advanced Indexing Flashcards
What is the difference between a term-partitioned and a document-partitioned index?
For term-partitioned, one machine handles a subrange of terms
For document-partitioned, one machine handles a subrange of documents
What is the simplest approach to dynamic indexing to handle inserts?
For insert, maintain a main index on disk. New docs go to an auxiliary index in memory. Search across both then merge results
What is the simplest approach to dynamic indexing to handle deletes?
Use an invalidation bit-vector for deleted docs. Filter searches using this bit vector
What is the issue with using a main and auxiliary index?
- The problem of frequent merges
- Poor performance during merge
- Collection-wide statistics are hard to maintain
How can we make the main and auxiliary index approach more efficient?
Merging is made efficient if we keep a separate file for each postings list as we would then just need to append the lists together.
However, we then need a lot of files which is inefficient for the OS
What are some reasons for using file compression?
- Uses less disk space
- Able to keep more stuff in memory
- Increase speed of data transfer from disk to memory
Why should we use file compression for a dictionary structure?
- Allows us to keep it small enough to fit in main memory
- Small enough to keep some postings lists in main memory as well
How do we solve the size issue of a postings list using gap encoding?
Instead of recording the docID of every document in the postings list, we instead record the size of the gap between doc IDs where the term appears. The hope is that gaps can be encoded with fewer bits than the doc IDs
How can we use gamma codes for gap encoding?
- Represent a gap as a length and offset
- The offset is the gap in binary with the leading bit cut off
- Length is the length of offset
- Encode length as a unary code
- Concatenate the length and offset
Show how 13 is gamma encoded
- Offset: 13 -> 1101 -> 101
- Length: 3 -> 1110
- Output: 1110101