03 English Terms, Skip Pointers, Phrase Queries, Dictionaries Flashcards

Question 1

Q

How do skip pointers improve the performance?

Answer

A

makes intersecting postings lists more efficient
by skip postings that will not figure in the search results

Question 2

Q

Whate is the tradeoff of placing more and fewer skips?

Answer

A

Tradeoff:
number of items skipped vs. frequency skip can be taken
More: skips only a few items, but we can frequently use it
Fewer: skips many items, but we can not use it very often.

Question 3

Q

Where do we place skips?

Answer

A

Simple heuristic: for postings list of length P, use √P evenly-spaced skip pointers.

Question 4

Q

Is there a case where skip pointers are not helpful?

Answer

A

Easy if the index is static; harder in a dynamic environment because of updates.

Question 5

Q

How much do skip pointers help?

Answer

A

Memory/Computation tradeoff

Question 6

Q

How do we deal with phrase query?

Answer

A

biword index (phrase index)
positional index

Question 7

Q

How to extend inverted index into biword index?

Answer

A

Index every consecutive pair of terms

Friends, Romans, Countrymen:
“friends romans” and “romans countrymen”

Question 8

Q

How do we deal with phrase with more than two words?

Answer

A

“stanford university palo alto” can be
represented as the Boolean query

“stanford university” AND “university palo” AND “palo alto”

Question 9

Q

Why are biword indexes rarely used?

Answer

A

False positives
Index blowup due to very large term vocabulary

Question 10

Q

What are positional index?

Answer

A

Inverted index but each posting is a docID and a list of positions

however, it is more expensive than regular boolean queries

Question 11

Q

What is proximity search?

Answer

A

use the positional index to for phrase search
where we can find documents contain w1 and w2
within n words from each other

Question 12

Q

How do positional index work?

Answer

A

run the intersection algorithm twice, 1) docID 2) position

Question 13

Q

What are pos an con of proximity search?

Answer

A

Pos: important for dynamic summaries
Con: inefficeint for frequent words

Question 14

Q

In what case where biwords are efficient?

Answer

A

when used with extremely frequent biword (e.g., Britney Spears)

Question 15

Q

What is combination scheme?

Answer

A

a combination of biword indexes and positional indexes
include frequent biwords as vocabulary terms in the index

Question 16

Q

Which data structure do we use to locate the entry (row)
in the array where q (a query term) is stored?

(assume that we store term vocabulary in fixed-length)

Answer

Study These Flashcards

A

hashes
trees

Question 17

Q

What are pros and cons for hashes?

Answer

Study These Flashcards

A

Pros:
* faster than tree
* looking up time is consistant (assuming no collision)

Con:
* can not find variants (resume vs. r´esum´e)
* lookup time could be higher depending on hash functions
* no prefix search

Question 18

Q

What problem does tree structure fix?

Answer

Study These Flashcards

A

the prefix problem

but slightly slower than in hashes

Question 19

Q

What is casefolding?

Answer

Study These Flashcards

A

reduce all letters to lower case

often best to lowercase everything

Question 20

Q

What are stopwords?

Answer

Study These Flashcards

A

extremely common words which would appear to be of little value in helping select documents matching a user need

a case where we need stopwords : “King of Denmark”

Question 21

Q

What is lemmatization?

Answer

Study These Flashcards

A

Reduce inflectional/variant forms to base form

Question 22

Q

What is stemming?

Answer

Study These Flashcards

A

Crudely chops off
the ends of words

Question 23

Q

Name 3 stemmers

Answer

Study These Flashcards

A

Porter stemmer
Lovins stemmer
Paice stemmer

Question 24

Q

What is SoundEx?

Answer

Study These Flashcards

A

the basis for finding phonetic
(as opposed to orthographic) alternatives

03 English Terms, Skip Pointers, Phrase Queries, Dictionaries Flashcards

(24 cards)