Chapter 6 Retrieval Model Flashcards
two main information retrieval mode
vector space retrieval model and probabilistic model
4 major retrieval models
pivoted length normalization,
Okapi BM25
query likelihood with JM smoothing,
PL2
Common form of a Retrieval Function
First, these models are all based on the assumption of using a bag-of-words representation of text Term frequency(TF), document length, and document frequency(DF) capture some of the main ideas used in pretty much all state-of-art retrieval models
vector space retrieval model
the VS model is a framework. In this framework, we make some assumptions. One assumption is that we represent each document and query by a term vector. Here, a term can be any basic concept such as a word, a phrase or any other feature representation.
Each term is assumed to define one dimension. Since we have |V| terms in our vocabulary, we define a |V| dimensional space.
We place all documents in our collection in this vector space and they will be pointing to all kinds of directions. We can place our query in this space as another vector. We can then measure the similarity between the query vector and every document vector.
how to instantiate a vector space model so that we can get a very specific ranking function.
bag-of-words instantiation, we use each word in our vocabulary to define a dimension
1: present
0: absent
Sim(q,d) = q.d = x1y1+x2y2+….xN*yN
Now we can finally implement this ranking function using a programming language and then rank documents in our corpus given a particular query
Behavior of the Bit Vector Representation
The bit vector scoring function counts the number of unique query terms matched in each document. If a document matches more unique query terms, then the document will be assumed to be more relevant. The only problem is that three are tied with the same score.
improved instantiation
A natural thought is to consider multiple occurrences of a term in a document as opposed to binary representation
TF(w,d) = count(w,d)
Thus, the corresponding dimension would be weighted as two instead of one, and the score for d4 is higher. This means, by using TF, we can now rank d4 above d2 and d3 as we had hoped to.
stop word
About doesn’t carry that much content, so we should be able to ignore it. We call such a word a stop word.
They are generally very frequent and they occur everywhere such that matching it doesn’t have any significance.
inverse document frequency(IDF)
we want to reward a word that doesn’t occur in many documents.
We can penalize common words which generally have a low IDF and reward informative words that have a highter IDF.
IDF(w) = (M+1)/df(w)