03 Dictionaries and tolerant retrieval Flashcards
Wildcard query
Ex: a*e*i*o*u, which seeks documents containing any term that includes all the five vowels in sequence. * indicates any (possibly empty) string of characters.
Binary tree
A data structures for storing a dictionary. (Must be balanced).
B-tree
A search tree in which every internal node usually has between 2 and 4 children. In binary tree and b-tree, there has to be an internal ordering of words. Some Asian languague do not always have a unique ordering.
Search tree vs HashMap as data structure for the dictionary
Governed by 3 or 4 questions:
- How many keys (entrie in the vocabulary) are we likely to have
- Is this number likely to remain static or dynamic?
* If dynamic, will keys often only be deleted or added? - What are the relative frequencies with which various keys will be accessed.
Hashing
Each term is a key and hashed into an integer over a large enough space that hash collisions are unlikely.
Cons:
- Collisions are resolved by auxiliary strucutes that demand care to maintain.
- At query time, we hash each query term separately, and, following a pointer to the corresponging postings, taking into account any logic for resolving hash collision.
- We cannot seeks to find minor variants of terms, as these would be hased to a very different integer.
Trees
Overcome many of these issues. For instance, they permit us to enumerate all vocabulary terms beginning with automat. Best known tree: Binary tree
Trailing wildcard query
A query where the * is at the end of the query word. Ex: mon*. These queries are easy to handle with a tree structure. We walk down the tree following the symbols m, o, n in turn at which we can enumerate the set of terms in the dictionary with the prefix mon.
Leading wildcard query
Ex: *mon.
Consider a reverse B-tree, one in which each root-to-leaf path of the tree corresponds to a term in the dictionary written backwards. Thus the term lemon woud be represented by the path root n o m e l. As before, you walk down the reverse B-tree then enumerate all terms in the vocabulary.
Infix wildcard query
Ex: se*mon.
First, seach for se* with a non-empty suffix. Then, the reverse B-tree enumerates the set of terms ending with the suffix mon. Next, we take the intersection of the results to arrive at the set of terms that begin with the prefix se and end with the suffix mon. We scan and filter out any terms that match the prefix as well as the suffix because these two strings overlap.
Permuterm index
Technique for handling general wildcard queries. A form of II. $ indicates the end of a term. To construct a permuterm index, we link all the rotations of a term to the original term. Therefore hello$, ello$h, llo$he, lo$hel all link to hello
Permuterm vocabulary
Set of rotated terms in a permuterm index
k-Gram index
Technique for handling general wildcard queries. In a k-gram index, the dictionary contains all k-grams that occur in any term in the vocabulary. Each PL points from a k-gram to all vocabulary terms containing that k- gram. For instance, the 3-gram etr would point to terms such as metric and retrieval
Edit distance/Levenshtein distance
The edit distance between them is the minumum number of edit operations required to transform a string to a different string.
Implementing spelling correction
There are two basic principles underlying most spelling correction algorithms:
- Of various alternative correct spellings for a misspelled query, choose the “nearest” one. This demands that we have a notion of nearness or proximity between a pair of queries. (Edit distance)
- When to correctly spelled queries are tied, select the one that is more common. For instance, if grunt and grant both both seem equally plausible as correction for grnt. Then the algorithm should choose the more common of grunt and grant as the correction.
Jaccard coefficient
A forumla for measuring the overlap between to sets A and B.
|A snitt B| / |A union B|
The two sets we consider are the set of k-grams in the query q and the set of k-grams in a vocabulary term. As we proceed from one voacabulary term to next next, computing the Jaccard coefficient on the fly. If the coefficient exceeds a preset threshold, we add the term to the output.