05 Index Construction Flashcards

Question 1

Q

Access to data is much faster in …. ?

a) memory, b) disk

Answer

A

a) memory

faster in memory than on disk

Question 2

Q

What does it mean by:
Disk seeks are “idle” time

Answer

A

No data is transferred from disk
while the disk head is being positioned

Question 3

Q

To optimize transfer time from disk to memory:
____ is faster. Why?

a) many small chunks, b) one large chunk

Answer

A

b) one large chunk

because Disk I/O is block-based

Question 4

Q

How big are block sizes?

Answer

A

8KB to 256 KB

Question 5

Q

Which option is cheaper?
a) many regular machines
b) one fault tolerant machine

Answer

A

a) many regular machines

Question 6

Q

How does External sorting algorithm works?

BSBI : Blocked sort-based indexing

Answer

A

segments the collection into parts of equal size (= a block)
sorts the termID–docID pairs of each part in memory
stores intermediate sorted results on disk
merges all intermediate results into the final index

Question 7

Q

Why do we need BSBI ?

BSBI : Blocked sort-based indexing

Answer

A

large collections are too large to store all postings in memory and sort
sorting on disk is too slow

we need external sorting algorithm

Question 8

Q

What is a problem with sort-based algorithm ?

Answer

A

Bottleneck

Question 9

Q

What i s SPIMI ?

SPIMI: Single-pass in-memory indexing

Answer

A

a scalable alternative of BSBI
uses terms instead of termIDs
writes each block’s dictionary to disk, and then starts a new dictionary for the next block

Question 10

Q

In which case do we need Distributed indexing ?

Answer

A

Collections are often so large that we cannot perform
index construction efficiently on a single machine.

Distribute indexing exploit a pool of fault-prone computer cluster

Question 11

Q

How does Distribute indexing work ?

Answer

A

master machine directs the indexing job
break up indexing into sets of pararell tasks
master assign task to an idle machine

e.g. MapReduce

Question 12

Q

What are two parallel tasks that two types of
machines have to solve?

Answer

A

1. Parsers: split documents into different patitions (j term-partitions)
2. Inverters: sort and write to posting list for one term-partition

Question 13

Q

Why do we need Dynamic indexing ?

Answer

A

Becasue documents are dynamic:
they are inserted, deleted, and modified

Question 14

Q

What is the naive approach for dynamic indexing?

Answer

A

keep rebuilding index from scratch from time to time

Question 15

Q

What is the simplest approach for dynamic indexing ?

Answer

A

maintain big main index on disk
new docs go into small auxiliary index in memory
search across both, merge results

can cause poor search performance during index merge

Logarithmic Merge is better

Question 16

Q

What is Lucene ?

Answer

Study These Flashcards

A

Open source Java library for indexing and searching
Lets you add search to your application

05 Index Construction Flashcards

(16 cards)