07 Computing scores in a complete search system Flashcards

1
Q

Computing scores in a complete search system

A

Given the cosine similarity from each doc vector V(d) to the query vector V(q), we want the top K highest scoring docs and present it to the user. This chapter will now consider schemes by which we produce K docs that are likely to be among the K highest scoring docs for a given query. In doing so, we want to lower the cost of computing the K docs. The following procedures have the following two-step scheme:

  1. Find a set of A docs that are contenders where K < |A| << N. A does not necessarily contain the K top-scoring docs for the query, but is likely to have many docs whith scores near those of top K.
  2. Return the K top-scoring docs in A.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Index elimination

A
  1. We only consider docs containing terms whose idf exceeds a preset thershold Thus, in the postings traversal, we only traverse the postings for terms with high idf. Additional benefit: The postings lists of low-idf terms are generally long.
  2. We only consider docs that contain many of the query terms.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Champion lists/fancy lists/top docs

A

Precompute, for each term t in the dictionary, the set of the r docs with the highest weights for t; the value of r is chosen in advance. We call the set of r docs the champion list for term t. Now, given a query q we create a set A as follows: We take the union of the champion lists for each of the terms comprising q. We now restrict cosine computation to only the docs in A. Important to find a good value for r.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Static quality scores and ordering

A

Further development of champion lists. Several ideas:

  1. Incorporate a static quality score that is query independent. Then the score for a doc can be some combination of the quality score together with a tf-idf score. The precise combination may be determined by machine learning.
  2. One can also consider ordering the postings by the static quality scores. One can still use the normal PL intersection algortihm, as long as there are a single common ordering.
  3. Then, one can also maintain for each term t a global champion list of the r docs with the highest values for g(d) + tf-idf. So at query time, we only compute the net-scores for docs in the union of these global champion lists.
  4. We maintain for eah term t two PLs consisting of disjoint sets of docs, each sorted by the static quality score. The first list, called high, contains the m docs with the highest tf values for t. The second list, called low, contains all other docs containing t. When we process a query, we first scan only the high lists, if we dont obtain K docs, we continue with scanning the low lists.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Impact ordering

A

PLs are not ordered by a single common ordering. Instead, order them in decreasing order of tf. Two ideas:

  1. When traversing a PL, we stop after considering a prefix of the PL, either after a fixed number or if the tf has dropped below some threshold.
  2. When accumulating scores in the outer loop, we consider the query terms in decreasing order of idf. If the idf is very low, and changes are minimal, we simply terminate the accumulation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Cluster pruning

A

Incorporate a preprocessing step during which we cluster the doc vectors. Then, at query time, we consider only docs in a small number of clusters as candidates for which we compute cosine scores. Use of randomly chosen leaders is fast and likely to reflect the distribution of the doc vectors in the vector space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Tiered indexes

A

What if the set of A contenders are fewer than the K docs? One can use tiered indexes. If we fail to get K results from tier 1, query processing falls back to tier 2 and so on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Query term proximity

A

Let _ be the width of the smallest window in a doc d that contains all the query terms. The smaller the _ is, the better the doc matches the query.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly