03 English Terms, Skip Pointers, Phrase Queries, Dictionaries Flashcards
How do skip pointers improve the performance?
- makes intersecting postings lists more efficient
- by skip postings that will not figure in the search results
Whate is the tradeoff of placing more and fewer skips?
Tradeoff:
number of items skipped vs. frequency skip can be taken
More: skips only a few items, but we can frequently use it
Fewer: skips many items, but we can not use it very often.
Where do we place skips?
Simple heuristic: for postings list of length P, use √P evenly-spaced skip pointers.
Is there a case where skip pointers are not helpful?
Easy if the index is static; harder in a dynamic environment because of updates.
How much do skip pointers help?
Memory/Computation tradeoff
How do we deal with phrase query?
- biword index (phrase index)
- positional index
How to extend inverted index into biword index?
Index every consecutive pair of terms
Friends, Romans, Countrymen:
“friends romans” and “romans countrymen”
How do we deal with phrase with more than two words?
“stanford university palo alto” can be
represented as the Boolean query
“stanford university” AND “university palo” AND “palo alto”
Why are biword indexes rarely used?
- False positives
- Index blowup due to very large term vocabulary
What are positional index?
Inverted index but each posting is a docID and a list of positions
however, it is more expensive than regular boolean queries
What is proximity search?
use the positional index to for phrase search
where we can find documents contain w1 and w2
within n words from each other
How do positional index work?
run the intersection algorithm twice, 1) docID 2) position
What are pos an con of proximity search?
Pos: important for dynamic summaries
Con: inefficeint for frequent words
In what case where biwords are efficient?
- when used with extremely frequent biword (e.g., Britney Spears)
What is combination scheme?
- a combination of biword indexes and positional indexes
- include frequent biwords as vocabulary terms in the index