Retrieval models Flashcards
What is term frequency?
How many times a word/tern occurs in a document.
What can be inferred if a term occurs many times in a document?
The value of the function ( c(word,d) ) is high.
What are the three factors of the scoring function?
Term frequency, document length, document frequency.
What is document frequency?
DF is the count of documents that contain a particular term.
What is the difference between matching a rare and a common term?
Matching a rare term probably contributes more to the value of the ranking (score) function.
What are the characteristics of state of the art retrieval models?
Bag of words representation, TF, DF. These features are used for determining a ranking (score).
What do we assume with similarity based models?
We assume that relevance is roughly correlated to similarity between a document and a a query.
What are the dimensions in the vector space model?
Each term from the query defines a dimension.
What do we ignore with the representation in the vector space model?
For example, the order of the words.
Which document has the highest ranking in the vector space model?
The document vector which is closest to the query vector.
How do we represent documents and the query in the vector space model?
With term vectors.
What is the bag of words instantiation?
Every words represents a dimension.
What is the bit vector representation?
1 if word is present otherwise it is 0.
How can we measure similarity in vector space model?
With dot product.
sim(q,d) = sum(q_i,d_i)
How does the simplest form of the vector space model look like?
Bit vector representation, dot product, bag of words instantiation.
What are the problems with the bit vector representation?
More occurrences of a term in a document are not rewarded by bit vector representation, it just counts how many unique terms a document has.
How does the improved form of the vector space model looks like?
Term frequency instead of bit vector representation. Dot product and bag of words.
What is the problem of the improved form (just TF replaced) of the vector space model?
Stop words are treated as important as other words in the query.
What is inverse document frequency ( TDF) and what is it used for?
It is used for rewarding less common terms. It penalized popular terms.
IDF(w) = log[(M+1) / df(w)]
M is the total number of documents,
How effective is the TF-IDF weighting model?
The results are reasonable. However, it can also rank totally non-relevant documents high if one particular term occurs many times.
How the problem of the TF-IDF weighting model can be mitigated?
By transforming TF. The best transformation to date is BM25 TF where BM stands for best matching.
What is the upper bound of BM25 TF?
K+1, K controls the upper bound. K should be higher for longer documents.
What is the difference between the BM25 TF function and the logarithm transformation?
The logarithm function does not have an upper bound.
What happens when k = 0 in BM25 TF?
It’s equivalent to zero one bit transformation.
What happens if k is very large in BM25 TF?
It looks more like a linear transformation function.
Why can BM25 TF be considered flexible?
It allows us to control the shape of the TF curve quite easily.
Why is the upper bound useful in BM25 TF?
It is useful to control the influence of a particular term. It ensure that all terms will be counted when we aggregate the weights to compute a score.
Why do we need some sublinearity in the TF function? Give 2 reasons.
This allows us to represent the intuition of diminishing return from high term counts. It also avoids the dominance of one term over the others.
What is the problem with long documents?
They have a higher chance to match any query.
Why do we have to be careful with penalizing long documents?
Long documents might simply be longer because they have more content.
What is the pivot in the pivoted length normalization and what is its meaning?
The average document length.
Documents above this length are penalized, documents below this length are rewarded.
What is pivoted length normalization?
It is used for document length penalization.
normalizer = 1-b + (b *|d|)/avdl
What is parameter ‘b’ used for in pivoted length normalization?
The degree of penalization is controlled by ‘b’. Its value ranges from 0 to 1.
Why do we need double logarithm transformation?
To achieve sublinearity.
Which representation is considered best in practice?
“Bag-of-phrases” representation.
What kind of representation can you think of?
Stemmed words, stop words removal, character n-grams.
What is BM25-F useful for?
For documents with structure (title, abstract, etc.). It applies BM25 on each field and then combines the score, but keeps global frequency counts. This has the advantage of avoiding over-counting the first occurrence of the term.
What is BM25+ useful for?
It addresses the problem of over-penalization of long documents by BM25.
How does BM25+ fix the problem of over-penalizing long documents?
It adds a constant to the TF normalization formula.
How can R (relevance) be estimated?
It can approximated by clickthrough data.
What is our assumption in query likelihood?
That the probability of relevance can be approximated by the probability of a query given a document and relevance. p(q | d, R = 1)
What happens if one term is not present in any of the documents in query likelihood model?
It would cause all these documents to have zero probability of generating this query even though the document might be relevant.
What happens if one term is not present in any of the documents in unigram language?
It does not necessarily assign zero probability for any word.
What is the form of unigram language model?
P(t_1,t_2,t_3,t_4) = P(t_1) * P(t_2) * P(t_3) * P(t_4)
What is the form of general language model?
P(t_1,t_2,t_3,t_4) = P(t_1) * P(t_2 | t_1) * P(t_3 | t_1,t_2) * P(t_4, t_1,t_2,t_3)
What is the idea of smoothing in the query likelihood model?
It assigns non-zero probabilities to words that are not present in the data.
What is the interpolation method?
It smooths the probability coming from the document with probabilities coming from the whole collection. Interpolation is typically done using a linear fashion.
What is smoothing’s behavior similar to?
IDF.
What is language modeling?
It assign probability to a sequence of words drawn from some vocabulary.
What is the probability of relevance given a document and the query?
p(R=1 | d,q) = count(R = 1, d,q) / count(d,q)
What do we do when we have a lot of unseen documents or queries?
We have to approximate in some way. In query likelihood model: p( q | d, R = 1)
What assumption do we have with the query likelihood model?
That a user formulates the query based on an imaginary document.
What is language modeling?
It assign probability to a sequence of words drawn from some vocabulary
What is the probability of relevance given a document and the query?
p(R=1 | d,q) = count(R = 1, d,q) / count(d,q)
What do we do when we have a lot of unseen documents or queries?
We have to approximate in some way. In query likelihood model: p( q | d, R = 1)
What assumption do we have with the query likelihood model?
That a user formulates the query based on an imaginary document.