04 Indexing Flashcards

Question 1

Q

Indexer

Answer

A

The machine that performs the index construction

Question 2

Q

Index construction techiques

Answer

A

With not enough main memory (MM), we need to use an external sorting algortihm, that is, one that uses disk. Must minimize the number of random disk seeks during sorting.

Blocked sort-based indexing
Single-pass in-memory indexing
Distributed indexing (MapReduce)
Dynamic indexing

Question 3

Q

Blocked sort-based indexing (BSBI)

Answer

A

Steps

Accumulate terms and docIDs into MM and ÒconvertÓ terms into termID until a fixed block size is reached.
Invert it (convert the block into an inverted index)
Store indermediate sorted results in disk
Merge all intermediate results into the final index

Question 4

Q

Single-pass in-memory indexing (SPIMI)

Answer

A

BSBI has excellent scaling properties, but needs a data structure for mapping terms to termIDs. For a very large collection, this data structure does not fit into memory. Therefore, use SPIMI

Steps: Best to look at the pseudo-code on page 67. Essentially, it adds terms to PLs. Because we do not know how large the PL will be when we create it, we allocate space for a short PL initially, and then double the space each time it is full. When there is no more free memory availiable, we write the index of the block to disk, after we sort the terms. Then, as in BSBI we merge blocks into the final index.

+ No sorting involved
+ Need not store termIDs

Question 5

Q

Distributed indexing

Answer

A

(MapReduce) When collections are to large to fit on a single machine (Web search). A mas- ter node directs the process of assigning and reassigning tasks to individuals worker nodes (machines).

Steps

The input data are split into n splits where the size of the split is chosen to ensure that the work can be distibuted evenly and efficiently.
Map phase: Consists of mapping splits of the input data to key- value pairs. (termID, docID). Each parser writes its output to a local intermediate file, the segment files.
Reduce phase: Collecting all values (here: docID) for a given key (termID), into one list is the task of the inverters in the reduce phase.
The list of values is sorted for each key and written to the final sorted postings lists

Remember, parsers and inverters are not seperate sets of machines. The master identifies idle machines and assigns tasks to them. This was the design of the Google indexing system in 2004, however it was much more complex.

Question 6

Q

Dynamic indexing

Answer

A

What if the document collection is dynamic? Auxiliary index: Store two indexes, a large main index and a small auxiliary index that stores new docs. The aux. index is kept in MM. Searches are run across both indexes and results are merged. Deletions are stored in a bit vector, we can then filter out deleted docs before returning the search results. When the aux. index is to large, we merge it with the main index.

04 Indexing Flashcards

(6 cards)