Chapter 6 Retrieval Model Flashcards

Question 1

Q

two main information retrieval mode

Answer

A

vector space retrieval model and probabilistic model

Question 2

Q

4 major retrieval models

Answer

A

pivoted length normalization,
Okapi BM25
query likelihood with JM smoothing,
PL2

Question 3

Q

Common form of a Retrieval Function

Answer

A

First, these models are all based on the assumption of using a bag-of-words representation of text
Term frequency(TF), document length, and document frequency(DF) capture some of the main ideas used in pretty much all state-of-art retrieval models

Question 4

Q

vector space retrieval model

Answer

A

the VS model is a framework. In this framework, we make some assumptions. One assumption is that we represent each document and query by a term vector. Here, a term can be any basic concept such as a word, a phrase or any other feature representation.

Each term is assumed to define one dimension. Since we have |V| terms in our vocabulary, we define a |V| dimensional space.

We place all documents in our collection in this vector space and they will be pointing to all kinds of directions. We can place our query in this space as another vector. We can then measure the similarity between the query vector and every document vector.

Question 5

Q

how to instantiate a vector space model so that we can get a very specific ranking function.

Answer

A

bag-of-words instantiation, we use each word in our vocabulary to define a dimension
1: present
0: absent
Sim(q,d) = q.d = x1y1+x2y2+….xN*yN

Now we can finally implement this ranking function using a programming language and then rank documents in our corpus given a particular query

Question 6

Q

Behavior of the Bit Vector Representation

Answer

A

The bit vector scoring function counts the number of unique query terms matched in each document. If a document matches more unique query terms, then the document will be assumed to be more relevant. The only problem is that three are tied with the same score.

Question 7

Q

improved instantiation

Answer

A

A natural thought is to consider multiple occurrences of a term in a document as opposed to binary representation
TF(w,d) = count(w,d)

Thus, the corresponding dimension would be weighted as two instead of one, and the score for d4 is higher. This means, by using TF, we can now rank d4 above d2 and d3 as we had hoped to.

Question 8

Q

stop word

Answer

A

About doesn’t carry that much content, so we should be able to ignore it. We call such a word a stop word.
They are generally very frequent and they occur everywhere such that matching it doesn’t have any significance.

Question 9

Q

inverse document frequency(IDF)

Answer

A

we want to reward a word that doesn’t occur in many documents.
We can penalize common words which generally have a low IDF and reward informative words that have a highter IDF.

IDF(w) = (M+1)/df(w)

Chapter 6 Retrieval Model Flashcards

(9 cards)