04 Tolerant Retrieval & Spelling Correction Flashcards

1
Q

What is Permuterm index?

A

a time-efficient and elegant solution to the string dictionary problem in which pattern queries may possibly include one wild-card symbol

hello$, ello$h, llo$he, lo$hel, o$hell, and $hello

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a problem about Permuterm index?

A

space inefficient: because it quadruples the dictionary size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is k-gram index?

A

the dictionary which contains sequence of k characters (k-gram) occuring in a term

“etr” : beetroot, metric, petrify

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

k-gram index vs. permuterm index

A
  • k-gram index is more space efficient
  • Permuterm index doesn’t require postfiltering
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are problems with processing wildcard queries in the term-document index?

A
  1. potentially execute a large number of Boolean queries
  2. if users use them a lot, the cost of answering query will increase
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are two methods for spelling correction ?

A

1. Isolated word: check each word on its own for misspelling
2. Contex-sensitive: look at surrounding words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does isolated word spelling correction works?

A
  1. get correct spellings from a list of correct words (e.g. dictionaries)
  2. return the correct word that has the smallest distance to misspelled word
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is edit distance ?

A

edit distance is the minimum number of basic operations that convert string1 to string2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are basic operations in Levenshtein distance ?

A
  1. insert
  2. delete
  3. replace
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are basic operations in Damerau-Levenshtein distance ?

A
  1. insert
  2. delete
  3. replace
  4. transposition
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is weighted edit distance ?

A

Weight of an operation depends on the characters involved

Application example: capture keyboard errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to use edit distance for spelling correction ?

A
  1. Compare to all terms, but stop after specific edit distance has been reached
  2. Generate all writing variations of query and compare to dictionary
  3. Add (some) writing variations of the dictionary to the dictionary
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does k-gram indexes for spelling correction works ?

A
  • Use the k-gram index to retrieve “correct” words that match query term k-grams
  • Threshold by number of matching k-grams

Bigrams: bo, or, rd, dr, ro, oo, om

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is an issue with k-gram for spelling correction ?

A
  • measure of overlap is not normalized
  • fixed number of treshold does not work for word of different length
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Jaccard coefficient ?

A
  • A commonly used measure of overlap of two sets
  • A and B don’t have to be the same size
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does context-sensitive spelling correction work ?

A
  • list all possible phrases as queries
  • the one with most hits = “correct”

not very efficient