W6 BIM, BM25, Probabilistic Models Flashcards

Question 1

Q

the idea of document ranking

Answer

A

rank by the probability of relevance of the document regarding the information need
P(R=1| document, query)
P(R=1) means user is happy with the doc
P(R=0) document is not relevant

Question 2

Q

p(A, B) =
P(A,B|C) =

Answer

A

p(A, B) = p(B|A) p(A) / p(B)

P(A,B|C) = p(B|A,C) p(A|C)

Question 3

Q

p(R=1|x)

Answer

A

the probability that document x is relevant: we want to find it

Question 4

Q

Binary independence model (BIM)

Answer

A

documents represented as binary independence vectors of terms

independence means the terms occur in documents independently

Question 5

Q

BIM: model design

Answer

A

Question 6

Q

RSV

Answer

A

Retrieval Status Value
RSV = sum(x_t = q_t = 1) log p_t(1 - u_t) / u_t(1 - p_t) = sum c_t

where p_t = p(x_t = 1|R=1, q)
and u_t = p(x_t = 1|R=0, q)
c_t is the term weight we want to estimate

Question 7

Q

the goal of BM25

Answer

A

be sensitive to term frequency and document length while not adding too many parameters

Question 8

Q

how to estimate c_t?

Answer

A

p_t = s / S = nr of relevant docs that contain x_t / total amount of relevant docs
u_t = (n-s) / (N-S) = nr of non-relevant docs that contain x_t / total nr of non-relevant docs

c_t = s/(S-s) / (n-s)/(N-n-S+s)

Question 9

Q

Poisson distribution

Answer

A

models probability of k, the number of events occurring in a fixed interval of time/space, with known average rate lambda

eg. number of cars arriving at a toll booth per minute

conditions:
- occurrences are independent, do not occur simultaneously
- rate is independent of any occurrence

if T (the interval) is large and p is small, we can approximate a binomial distribution with a Poisson where lambda = T*p
p(k) = (lambda^k / k!) * e^-lambda

Question 10

Q

Poisson model

Answer

A

assume that term frequency for a term in a document follows a Poisson distribution

flaw: it works for general terms, but is a poor fit for topic-specific terms => define 2 classes of terms?

Question 11

Q

eliteness

Answer

A

a term is elite in a document if the document is about the concept denoted by the term
- binary
- documents are composed of topcial elite terms and supportive, more common terms

Question 12

Q

2-Poisson model

Answer

A

the distribution is different, depending on whether the term is elite or not

Question 13

Q

BM25

Answer

A

takes into account:
- tf
- idf
- document length
- parameter k controls the term frequency scaling (how the relevance changes with the term frequency)
- parameter b controls the document length normalization

W6 BIM, BM25, Probabilistic Models Flashcards

(13 cards)