03 English Terms, Skip Pointers, Phrase Queries, Dictionaries Flashcards
How do skip pointers improve the performance?
- makes intersecting postings lists more efficient
- by skip postings that will not figure in the search results
Whate is the tradeoff of placing more and fewer skips?
Tradeoff:
number of items skipped vs. frequency skip can be taken
More: skips only a few items, but we can frequently use it
Fewer: skips many items, but we can not use it very often.
Where do we place skips?
Simple heuristic: for postings list of length P, use √P evenly-spaced skip pointers.
Is there a case where skip pointers are not helpful?
Easy if the index is static; harder in a dynamic environment because of updates.
How much do skip pointers help?
Memory/Computation tradeoff
How do we deal with phrase query?
- biword index (phrase index)
- positional index
How to extend inverted index into biword index?
Index every consecutive pair of terms
Friends, Romans, Countrymen:
“friends romans” and “romans countrymen”
How do we deal with phrase with more than two words?
“stanford university palo alto” can be
represented as the Boolean query
“stanford university” AND “university palo” AND “palo alto”
Why are biword indexes rarely used?
- False positives
- Index blowup due to very large term vocabulary
What are positional index?
Inverted index but each posting is a docID and a list of positions
however, it is more expensive than regular boolean queries
What is proximity search?
use the positional index to for phrase search
where we can find documents contain w1 and w2
within n words from each other
How do positional index work?
run the intersection algorithm twice, 1) docID 2) position
What are pos an con of proximity search?
Pos: important for dynamic summaries
Con: inefficeint for frequent words
In what case where biwords are efficient?
- when used with extremely frequent biword (e.g., Britney Spears)
What is combination scheme?
- a combination of biword indexes and positional indexes
- include frequent biwords as vocabulary terms in the index
Which data structure do we use to locate the entry (row)
in the array where q (a query term) is stored?
(assume that we store term vocabulary in fixed-length)
- hashes
- trees
What are pros and cons for hashes?
Pros:
* faster than tree
* looking up time is consistant (assuming no collision)
Con:
* can not find variants (resume vs. r´esum´e)
* lookup time could be higher depending on hash functions
* no prefix search
What problem does tree structure fix?
the prefix problem
but slightly slower than in hashes
What is casefolding?
reduce all letters to lower case
often best to lowercase everything
What are stopwords?
extremely common words which would appear to be of little value in helping select documents matching a user need
a case where we need stopwords : “King of Denmark”
What is lemmatization?
Reduce inflectional/variant forms to base form
What is stemming?
Crudely chops off
the ends of words
Name 3 stemmers
- Porter stemmer
- Lovins stemmer
- Paice stemmer
What is SoundEx?
the basis for finding phonetic
(as opposed to orthographic) alternatives