03 Dictionaries and tolerant retrieval Flashcards

1
Q

Wildcard query

A

Ex: a*e*i*o*u, which seeks documents containing any term that includes all the five vowels in sequence. * indicates any (possibly empty) string of characters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Binary tree

A

A data structures for storing a dictionary. (Must be balanced).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

B-tree

A

A search tree in which every internal node usually has between 2 and 4 children. In binary tree and b-tree, there has to be an internal ordering of words. Some Asian languague do not always have a unique ordering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Search tree vs HashMap as data structure for the dictionary

A

Governed by 3 or 4 questions:

  1. How many keys (entrie in the vocabulary) are we likely to have
  2. Is this number likely to remain static or dynamic?
    * If dynamic, will keys often only be deleted or added?
  3. What are the relative frequencies with which various keys will be accessed.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Hashing

A

Each term is a key and hashed into an integer over a large enough space that hash collisions are unlikely.

Cons:

  • Collisions are resolved by auxiliary strucutes that demand care to maintain.
  • At query time, we hash each query term separately, and, following a pointer to the corresponging postings, taking into account any logic for resolving hash collision.
  • We cannot seeks to find minor variants of terms, as these would be hased to a very different integer.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Trees

A

Overcome many of these issues. For instance, they permit us to enumerate all vocabulary terms beginning with automat. Best known tree: Binary tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Trailing wildcard query

A

A query where the * is at the end of the query word. Ex: mon*. These queries are easy to handle with a tree structure. We walk down the tree following the symbols m, o, n in turn at which we can enumerate the set of terms in the dictionary with the prefix mon.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Leading wildcard query

A

Ex: *mon.

Consider a reverse B-tree, one in which each root-to-leaf path of the tree corresponds to a term in the dictionary written backwards. Thus the term lemon woud be represented by the path root n o m e l. As before, you walk down the reverse B-tree then enumerate all terms in the vocabulary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Infix wildcard query

A

Ex: se*mon.

First, seach for se* with a non-empty suffix. Then, the reverse B-tree enumerates the set of terms ending with the suffix mon. Next, we take the intersection of the results to arrive at the set of terms that begin with the prefix se and end with the suffix mon. We scan and filter out any terms that match the prefix as well as the suffix because these two strings overlap.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Permuterm index

A

Technique for handling general wildcard queries. A form of II. $ indicates the end of a term. To construct a permuterm index, we link all the rotations of a term to the original term. Therefore hello$, ello$h, llo$he, lo$hel all link to hello

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Permuterm vocabulary

A

Set of rotated terms in a permuterm index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

k-Gram index

A

Technique for handling general wildcard queries. In a k-gram index, the dictionary contains all k-grams that occur in any term in the vocabulary. Each PL points from a k-gram to all vocabulary terms containing that k- gram. For instance, the 3-gram etr would point to terms such as metric and retrieval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Edit distance/Levenshtein distance

A

The edit distance between them is the minumum number of edit operations required to transform a string to a different string.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Implementing spelling correction

A

There are two basic principles underlying most spelling correction algorithms:

  1. Of various alternative correct spellings for a misspelled query, choose the “nearest” one. This demands that we have a notion of nearness or proximity between a pair of queries. (Edit distance)
  2. When to correctly spelled queries are tied, select the one that is more common. For instance, if grunt and grant both both seem equally plausible as correction for grnt. Then the algorithm should choose the more common of grunt and grant as the correction.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Jaccard coefficient

A

A forumla for measuring the overlap between to sets A and B.

|A snitt B| / |A union B|

The two sets we consider are the set of k-grams in the query q and the set of k-grams in a vocabulary term. As we proceed from one voacabulary term to next next, computing the Jaccard coefficient on the fly. If the coefficient exceeds a preset threshold, we add the term to the output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Tries

A

Is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key with which it is associated. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string.

17
Q

Permuterm index: Query: m*n.

A

Rotate it so that the * appears at the end of the string. Therefore: n$m*. Next, we look up this string in the permuterm index, where seeking n$m* (via a search tree) leads to rotations of (among others) the terms man and moron. Now that the permuterm index enables us to indentify the original vocabulary terms matching a wildcary query, we look up these terms in the standard II to retrieve matching docs.

18
Q

Permuterm index: Query:: fi*mo*er.

A

First enrumerate the terms in the dictionary that are in the permuterm index of er$fi*. Not all terms have mo in the middle so we do an exhaustive search to filter these out. We then run the surviving terms through the standard II for doc retrieval.

19
Q

k-gram index

Ex. query: re*ve.

A

Run the Boolean query reANDve. This is looked up in the 3-gram index and yields a list of matching terms such as relive, remove and retreive. Each of these mathing terms is then looked up in the standard II index to yield documents matching the query.

20
Q

k-gram index

Ex Difficulty: Consider using the 3-gram index described for the query red*.

A

We first issue the boolean query $re AND RED to the 3-gram index. This leads to match terms such as retired, which is not a match. To cope with this, we introduce a postfiletring step, in which the terms are checked individually against the original query.

21
Q

Tries: Pros

A
  1. Looking up data in a trie is faster in the worst case, O(m) time (where m is the length of a search string), compared to an imperfect hash table. An imperfect hash table can have key collisions. A key collision is the hash function mapping of different keys to the same position in a hash table. The worst-case lookup speed in an imperfect hash table is O(N) time, but far more typically is O(1), with O(m) time spent evaluating the hash.
  2. There are no collisions of different keys in a trie.
  3. Buckets in a trie, which are analogous to hash table buckets that store key collisions, are necessary only if a single key is associated with more than one value.
  4. There is no need to provide a hash function or to change hash functions as more keys are added to a trie.
  5. A trie can provide an alphabetical ordering of the entries by key.
22
Q

Tries: Cons

A
  1. Tries can be slower in some cases than hash tables for looking up data, especially if the data is directly accessed on a hard disk drive or some other secondary storage device where the random-access time is high compared to main memory
  2. Some keys, such as floating point numbers, can lead to long chains and prefixes that are not particularly meaningful. Nevertheless a bitwise trie can handle standard IEEE single and double format floating point numbers.
    ​3. Some tries can require more space than a hash table, as memory may be allocated for each character in the search string, rather than a single chunk of memory for the whole entry, as in most hash tables.