05 Index Construction Flashcards

1
Q

Access to data is much faster in …. ?

a) memory, b) disk

A

a) memory

faster in memory than on disk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does it mean by:
Disk seeks are “idle” time

A

No data is transferred from disk
while the disk head is being positioned

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

To optimize transfer time from disk to memory:
____ is faster. Why?

a) many small chunks, b) one large chunk

A

b) one large chunk

because Disk I/O is block-based

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How big are block sizes?

A

8KB to 256 KB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which option is cheaper?
a) many regular machines
b) one fault tolerant machine

A

a) many regular machines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does External sorting algorithm works?

BSBI : Blocked sort-based indexing

A
  1. segments the collection into parts of equal size (= a block)
  2. sorts the termID–docID pairs of each part in memory
  3. stores intermediate sorted results on disk
  4. merges all intermediate results into the final index
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why do we need BSBI ?

BSBI : Blocked sort-based indexing

A
  • large collections are too large to store all postings in memory and sort
  • sorting on disk is too slow

we need external sorting algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a problem with sort-based algorithm ?

A

Bottleneck

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What i s SPIMI ?

SPIMI: Single-pass in-memory indexing

A
  • a scalable alternative of BSBI
  • uses terms instead of termIDs
  • writes each block’s dictionary to disk, and then starts a new dictionary for the next block
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In which case do we need Distributed indexing ?

A

Collections are often so large that we cannot perform
index construction efficiently on a single machine.

Distribute indexing exploit a pool of fault-prone computer cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does Distribute indexing work ?

A
  1. master machine directs the indexing job
  2. break up indexing into sets of pararell tasks
  3. master assign task to an idle machine

e.g. MapReduce

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are two parallel tasks that two types of
machines have to solve?

A

1. Parsers: split documents into different patitions (j term-partitions)
2. Inverters: sort and write to posting list for one term-partition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why do we need Dynamic indexing ?

A

Becasue documents are dynamic:
they are inserted, deleted, and modified

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the naive approach for dynamic indexing?

A

keep rebuilding index from scratch from time to time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the simplest approach for dynamic indexing ?

A
  • maintain big main index on disk
  • new docs go into small auxiliary index in memory
  • search across both, merge results

can cause poor search performance during index merge

Logarithmic Merge is better

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Lucene ?

A
  • Open source Java library for indexing and searching
  • Lets you add search to your application