04 Tolerant Retrieval & Spelling Correction Flashcards
What is Permuterm index?
a time-efficient and elegant solution to the string dictionary problem in which pattern queries may possibly include one wild-card symbol
hello$, ello$h, llo$he, lo$hel, o$hell, and $hello
What is a problem about Permuterm index?
space inefficient: because it quadruples the dictionary size
What is k-gram index?
the dictionary which contains sequence of k characters (k-gram) occuring in a term
“etr” : beetroot, metric, petrify
k-gram index vs. permuterm index
- k-gram index is more space efficient
- Permuterm index doesn’t require postfiltering
What are problems with processing wildcard queries in the term-document index?
- potentially execute a large number of Boolean queries
- if users use them a lot, the cost of answering query will increase
What are two methods for spelling correction ?
1. Isolated word: check each word on its own for misspelling
2. Contex-sensitive: look at surrounding words
How does isolated word spelling correction works?
- get correct spellings from a list of correct words (e.g. dictionaries)
- return the correct word that has the smallest distance to misspelled word
What is edit distance ?
edit distance is the minimum number of basic operations that convert string1 to string2
What are basic operations in Levenshtein distance ?
- insert
- delete
- replace
What are basic operations in Damerau-Levenshtein distance ?
- insert
- delete
- replace
- transposition
What is weighted edit distance ?
Weight of an operation depends on the characters involved
Application example: capture keyboard errors
How to use edit distance for spelling correction ?
- Compare to all terms, but stop after specific edit distance has been reached
- Generate all writing variations of query and compare to dictionary
- Add (some) writing variations of the dictionary to the dictionary
How does k-gram indexes for spelling correction works ?
- Use the k-gram index to retrieve “correct” words that match query term k-grams
- Threshold by number of matching k-grams
Bigrams: bo, or, rd, dr, ro, oo, om
What is an issue with k-gram for spelling correction ?
- measure of overlap is not normalized
- fixed number of treshold does not work for word of different length
What is Jaccard coefficient ?
- A commonly used measure of overlap of two sets
- A and B don’t have to be the same size