05 Index Construction Flashcards
Access to data is much faster in …. ?
a) memory, b) disk
a) memory
faster in memory than on disk
What does it mean by:
Disk seeks are “idle” time
No data is transferred from disk
while the disk head is being positioned
To optimize transfer time from disk to memory:
____ is faster. Why?
a) many small chunks, b) one large chunk
b) one large chunk
because Disk I/O is block-based
How big are block sizes?
8KB to 256 KB
Which option is cheaper?
a) many regular machines
b) one fault tolerant machine
a) many regular machines
How does External sorting algorithm works?
BSBI : Blocked sort-based indexing
- segments the collection into parts of equal size (= a block)
- sorts the termID–docID pairs of each part in memory
- stores intermediate sorted results on disk
- merges all intermediate results into the final index
Why do we need BSBI ?
BSBI : Blocked sort-based indexing
- large collections are too large to store all postings in memory and sort
- sorting on disk is too slow
we need external sorting algorithm
What is a problem with sort-based algorithm ?
Bottleneck
What i s SPIMI ?
SPIMI: Single-pass in-memory indexing
- a scalable alternative of BSBI
- uses terms instead of termIDs
- writes each block’s dictionary to disk, and then starts a new dictionary for the next block
In which case do we need Distributed indexing ?
Collections are often so large that we cannot perform
index construction efficiently on a single machine.
Distribute indexing exploit a pool of fault-prone computer cluster
How does Distribute indexing work ?
- master machine directs the indexing job
- break up indexing into sets of pararell tasks
- master assign task to an idle machine
e.g. MapReduce
What are two parallel tasks that two types of
machines have to solve?
1. Parsers: split documents into different patitions (j term-partitions)
2. Inverters: sort and write to posting list for one term-partition
Why do we need Dynamic indexing ?
Becasue documents are dynamic:
they are inserted, deleted, and modified
What is the naive approach for dynamic indexing?
keep rebuilding index from scratch from time to time
What is the simplest approach for dynamic indexing ?
- maintain big main index on disk
- new docs go into small auxiliary index in memory
- search across both, merge results
can cause poor search performance during index merge
Logarithmic Merge is better