04 Indexing Flashcards
Indexer
The machine that performs the index construction
Index construction techiques
With not enough main memory (MM), we need to use an external sorting algortihm, that is, one that uses disk. Must minimize the number of random disk seeks during sorting.
- Blocked sort-based indexing
- Single-pass in-memory indexing
- Distributed indexing (MapReduce)
- Dynamic indexing
Blocked sort-based indexing (BSBI)
Steps
- Accumulate terms and docIDs into MM and ÒconvertÓ terms into termID until a fixed block size is reached.
- Invert it (convert the block into an inverted index)
- Store indermediate sorted results in disk
- Merge all intermediate results into the final index
Single-pass in-memory indexing (SPIMI)
BSBI has excellent scaling properties, but needs a data structure for mapping terms to termIDs. For a very large collection, this data structure does not fit into memory. Therefore, use SPIMI
Steps: Best to look at the pseudo-code on page 67. Essentially, it adds terms to PLs. Because we do not know how large the PL will be when we create it, we allocate space for a short PL initially, and then double the space each time it is full. When there is no more free memory availiable, we write the index of the block to disk, after we sort the terms. Then, as in BSBI we merge blocks into the final index.
+ No sorting involved
+ Need not store termIDs
Distributed indexing
(MapReduce) When collections are to large to fit on a single machine (Web search). A mas- ter node directs the process of assigning and reassigning tasks to individuals worker nodes (machines).
Steps
- The input data are split into n splits where the size of the split is chosen to ensure that the work can be distibuted evenly and efficiently.
- Map phase: Consists of mapping splits of the input data to key- value pairs. (termID, docID). Each parser writes its output to a local intermediate file, the segment files.
- Reduce phase: Collecting all values (here: docID) for a given key (termID), into one list is the task of the inverters in the reduce phase.
- The list of values is sorted for each key and written to the final sorted postings lists
Remember, parsers and inverters are not seperate sets of machines. The master identifies idle machines and assigns tasks to them. This was the design of the Google indexing system in 2004, however it was much more complex.
Dynamic indexing
What if the document collection is dynamic? Auxiliary index: Store two indexes, a large main index and a small auxiliary index that stores new docs. The aux. index is kept in MM. Searches are run across both indexes and results are merged. Deletions are stored in a bit vector, we can then filter out deleted docs before returning the search results. When the aux. index is to large, we merge it with the main index.