03 English Terms, Skip Pointers, Phrase Queries, Dictionaries Flashcards

1
Q

How do skip pointers improve the performance?

A
  • makes intersecting postings lists more efficient
  • by skip postings that will not figure in the search results
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Whate is the tradeoff of placing more and fewer skips?

A

Tradeoff:
number of items skipped vs. frequency skip can be taken
More: skips only a few items, but we can frequently use it
Fewer: skips many items, but we can not use it very often.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Where do we place skips?

A

Simple heuristic: for postings list of length P, use √P evenly-spaced skip pointers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Is there a case where skip pointers are not helpful?

A

Easy if the index is static; harder in a dynamic environment because of updates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How much do skip pointers help?

A

Memory/Computation tradeoff

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do we deal with phrase query?

A
  • biword index (phrase index)
  • positional index
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to extend inverted index into biword index?

A

Index every consecutive pair of terms

Friends, Romans, Countrymen:
“friends romans” and “romans countrymen”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do we deal with phrase with more than two words?

A

“stanford university palo alto” can be
represented as the Boolean query

“stanford university” AND “university palo” AND “palo alto”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why are biword indexes rarely used?

A
  • False positives
  • Index blowup due to very large term vocabulary
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are positional index?

A

Inverted index but each posting is a docID and a list of positions

however, it is more expensive than regular boolean queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is proximity search?

A

use the positional index to for phrase search
where we can find documents contain w1 and w2
within n words from each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do positional index work?

A

run the intersection algorithm twice, 1) docID 2) position

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are pos an con of proximity search?

A

Pos: important for dynamic summaries
Con: inefficeint for frequent words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

In what case where biwords are efficient?

A
  • when used with extremely frequent biword (e.g., Britney Spears)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is combination scheme?

A
  • a combination of biword indexes and positional indexes
  • include frequent biwords as vocabulary terms in the index
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which data structure do we use to locate the entry (row)
in the array where q (a query term) is stored?

(assume that we store term vocabulary in fixed-length)

A
  • hashes
  • trees
17
Q

What are pros and cons for hashes?

A

Pros:
* faster than tree
* looking up time is consistant (assuming no collision)

Con:
* can not find variants (resume vs. r´esum´e)
* lookup time could be higher depending on hash functions
* no prefix search

18
Q

What problem does tree structure fix?

A

the prefix problem

but slightly slower than in hashes

19
Q

What is casefolding?

A

reduce all letters to lower case

often best to lowercase everything

20
Q

What are stopwords?

A

extremely common words which would appear to be of little value in helping select documents matching a user need

a case where we need stopwords : “King of Denmark”

21
Q

What is lemmatization?

A

Reduce inflectional/variant forms to base form

22
Q

What is stemming?

A

Crudely chops off
the ends of words

23
Q

Name 3 stemmers

A
  1. Porter stemmer
  2. Lovins stemmer
  3. Paice stemmer
24
Q

What is SoundEx?

A

the basis for finding phonetic
(as opposed to orthographic) alternatives