Retrieval models Flashcards
What is term frequency?
How many times a word/tern occurs in a document.
What can be inferred if a term occurs many times in a document?
The value of the function ( c(word,d) ) is high.
What are the three factors of the scoring function?
Term frequency, document length, document frequency.
What is document frequency?
DF is the count of documents that contain a particular term.
What is the difference between matching a rare and a common term?
Matching a rare term probably contributes more to the value of the ranking (score) function.
What are the characteristics of state of the art retrieval models?
Bag of words representation, TF, DF. These features are used for determining a ranking (score).
What do we assume with similarity based models?
We assume that relevance is roughly correlated to similarity between a document and a a query.
What are the dimensions in the vector space model?
Each term from the query defines a dimension.
What do we ignore with the representation in the vector space model?
For example, the order of the words.
Which document has the highest ranking in the vector space model?
The document vector which is closest to the query vector.
How do we represent documents and the query in the vector space model?
With term vectors.
What is the bag of words instantiation?
Every words represents a dimension.
What is the bit vector representation?
1 if word is present otherwise it is 0.
How can we measure similarity in vector space model?
With dot product.
sim(q,d) = sum(q_i,d_i)
How does the simplest form of the vector space model look like?
Bit vector representation, dot product, bag of words instantiation.
What are the problems with the bit vector representation?
More occurrences of a term in a document are not rewarded by bit vector representation, it just counts how many unique terms a document has.
How does the improved form of the vector space model looks like?
Term frequency instead of bit vector representation. Dot product and bag of words.
What is the problem of the improved form (just TF replaced) of the vector space model?
Stop words are treated as important as other words in the query.
What is inverse document frequency ( TDF) and what is it used for?
It is used for rewarding less common terms. It penalized popular terms.
IDF(w) = log[(M+1) / df(w)]
M is the total number of documents,
How effective is the TF-IDF weighting model?
The results are reasonable. However, it can also rank totally non-relevant documents high if one particular term occurs many times.
How the problem of the TF-IDF weighting model can be mitigated?
By transforming TF. The best transformation to date is BM25 TF where BM stands for best matching.
What is the upper bound of BM25 TF?
K+1, K controls the upper bound. K should be higher for longer documents.