Retrieval models Flashcards

1
Q

What is term frequency?

A

How many times a word/tern occurs in a document.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What can be inferred if a term occurs many times in a document?

A

The value of the function ( c(word,d) ) is high.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the three factors of the scoring function?

A

Term frequency, document length, document frequency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is document frequency?

A

DF is the count of documents that contain a particular term.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the difference between matching a rare and a common term?

A

Matching a rare term probably contributes more to the value of the ranking (score) function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the characteristics of state of the art retrieval models?

A

Bag of words representation, TF, DF. These features are used for determining a ranking (score).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What do we assume with similarity based models?

A

We assume that relevance is roughly correlated to similarity between a document and a a query.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the dimensions in the vector space model?

A

Each term from the query defines a dimension.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What do we ignore with the representation in the vector space model?

A

For example, the order of the words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which document has the highest ranking in the vector space model?

A

The document vector which is closest to the query vector.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do we represent documents and the query in the vector space model?

A

With term vectors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the bag of words instantiation?

A

Every words represents a dimension.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the bit vector representation?

A

1 if word is present otherwise it is 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can we measure similarity in vector space model?

A

With dot product.

sim(q,d) = sum(q_i,d_i)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does the simplest form of the vector space model look like?

A

Bit vector representation, dot product, bag of words instantiation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the problems with the bit vector representation?

A

More occurrences of a term in a document are not rewarded by bit vector representation, it just counts how many unique terms a document has.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How does the improved form of the vector space model looks like?

A

Term frequency instead of bit vector representation. Dot product and bag of words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the problem of the improved form (just TF replaced) of the vector space model?

A

Stop words are treated as important as other words in the query.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is inverse document frequency ( TDF) and what is it used for?

A

It is used for rewarding less common terms. It penalized popular terms.
IDF(w) = log[(M+1) / df(w)]
M is the total number of documents,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How effective is the TF-IDF weighting model?

A

The results are reasonable. However, it can also rank totally non-relevant documents high if one particular term occurs many times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How the problem of the TF-IDF weighting model can be mitigated?

A

By transforming TF. The best transformation to date is BM25 TF where BM stands for best matching.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the upper bound of BM25 TF?

A

K+1, K controls the upper bound. K should be higher for longer documents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the difference between the BM25 TF function and the logarithm transformation?

A

The logarithm function does not have an upper bound.

24
Q

What happens when k = 0 in BM25 TF?

A

It’s equivalent to zero one bit transformation.

25
Q

What happens if k is very large in BM25 TF?

A

It looks more like a linear transformation function.

26
Q

Why can BM25 TF be considered flexible?

A

It allows us to control the shape of the TF curve quite easily.

27
Q

Why is the upper bound useful in BM25 TF?

A

It is useful to control the influence of a particular term. It ensure that all terms will be counted when we aggregate the weights to compute a score.

28
Q

Why do we need some sublinearity in the TF function? Give 2 reasons.

A

This allows us to represent the intuition of diminishing return from high term counts. It also avoids the dominance of one term over the others.

29
Q

What is the problem with long documents?

A

They have a higher chance to match any query.

30
Q

Why do we have to be careful with penalizing long documents?

A

Long documents might simply be longer because they have more content.

31
Q

What is the pivot in the pivoted length normalization and what is its meaning?

A

The average document length.

Documents above this length are penalized, documents below this length are rewarded.

32
Q

What is pivoted length normalization?

A

It is used for document length penalization.

normalizer = 1-b + (b *|d|)/avdl

33
Q

What is parameter ‘b’ used for in pivoted length normalization?

A

The degree of penalization is controlled by ‘b’. Its value ranges from 0 to 1.

34
Q

Why do we need double logarithm transformation?

A

To achieve sublinearity.

35
Q

Which representation is considered best in practice?

A

“Bag-of-phrases” representation.

36
Q

What kind of representation can you think of?

A

Stemmed words, stop words removal, character n-grams.

37
Q

What is BM25-F useful for?

A

For documents with structure (title, abstract, etc.). It applies BM25 on each field and then combines the score, but keeps global frequency counts. This has the advantage of avoiding over-counting the first occurrence of the term.

38
Q

What is BM25+ useful for?

A

It addresses the problem of over-penalization of long documents by BM25.

39
Q

How does BM25+ fix the problem of over-penalizing long documents?

A

It adds a constant to the TF normalization formula.

40
Q

How can R (relevance) be estimated?

A

It can approximated by clickthrough data.

41
Q

What is our assumption in query likelihood?

A

That the probability of relevance can be approximated by the probability of a query given a document and relevance. p(q | d, R = 1)

42
Q

What happens if one term is not present in any of the documents in query likelihood model?

A

It would cause all these documents to have zero probability of generating this query even though the document might be relevant.

43
Q

What happens if one term is not present in any of the documents in unigram language?

A

It does not necessarily assign zero probability for any word.

44
Q

What is the form of unigram language model?

A

P(t_1,t_2,t_3,t_4) = P(t_1) * P(t_2) * P(t_3) * P(t_4)

45
Q

What is the form of general language model?

A

P(t_1,t_2,t_3,t_4) = P(t_1) * P(t_2 | t_1) * P(t_3 | t_1,t_2) * P(t_4, t_1,t_2,t_3)

46
Q

What is the idea of smoothing in the query likelihood model?

A

It assigns non-zero probabilities to words that are not present in the data.

47
Q

What is the interpolation method?

A

It smooths the probability coming from the document with probabilities coming from the whole collection. Interpolation is typically done using a linear fashion.

48
Q

What is smoothing’s behavior similar to?

A

IDF.

49
Q

What is language modeling?

A

It assign probability to a sequence of words drawn from some vocabulary.

50
Q

What is the probability of relevance given a document and the query?

A

p(R=1 | d,q) = count(R = 1, d,q) / count(d,q)

51
Q

What do we do when we have a lot of unseen documents or queries?

A

We have to approximate in some way. In query likelihood model: p( q | d, R = 1)

52
Q

What assumption do we have with the query likelihood model?

A

That a user formulates the query based on an imaginary document.

53
Q

What is language modeling?

A

It assign probability to a sequence of words drawn from some vocabulary

54
Q

What is the probability of relevance given a document and the query?

A

p(R=1 | d,q) = count(R = 1, d,q) / count(d,q)

55
Q

What do we do when we have a lot of unseen documents or queries?

A

We have to approximate in some way. In query likelihood model: p( q | d, R = 1)

56
Q

What assumption do we have with the query likelihood model?

A

That a user formulates the query based on an imaginary document.