C6: BIM & BM25 Flashcards
the idea of document ranking
rank by the probability of relevance of the document regarding the information need P( R=1| document, query)
P(R=1) means user is happy with the doc
P(R=0) document is not relevant
p(A, B)
p(A n B) = p(A|B) p(B) = p(B|A) p(A)
p(B|A) p(A) / p(B)
p(B|A,C) p(A|C)
the probability that document x is relevant: we want to find it
Binary independence model (BIM)
- documents represented as binary independence vectors of terms
- independence means the terms occur in documents independently
BIM: model design
given query q, compute for each document p(R|q,x) where x is the binary term incidence vector representing the doc
odds and Bayes: O(R|q,x) = p(R=1|q,x) / p(R=0|q,x) =
p(R=1|q) p(x|R=1, q) / p(R=0|q) p(x|R=0, q) =
O(R|q) * pi(t_1 to n) p(x_t|R=1, q) / p(x_t|R=0, q)
Retrieval Status Value
RSV = sum(x_t = q_t = 1) log p_t(1 - u_t) / u_t(1 - p_t) = sum c_t
where p_t = p(x_t = 1|R=1, q)
and u_t = p(x_t = 1|R=0, q)
c_t is the term weight we want to estimate
the goal of BM25
be sensitive to term frequency and document length while not adding too many parameters
how to estimate c_t?
p_t = the probability of term x_t given q and R=1 = s / S = nr of relevant docs that contain x_t / total amount of relevant docs
u_t = the probability of term x_t given q and R=1 = (n-s) / (N-S) = nr of non-relevant docs that contain x_t / total nr of non-relevant docs
c_t = s/(S-s) / (n-s)/(N-n-S+s)
Poisson distribution
models probability of k, the number of events occurring in a fixed interval of time/space, with known average rate lambda
eg. number of cars arriving at a toll booth per minute
- occurrences are independent, do not occur simultaneously
- rate is independent of any occurrence
if T (the interval) is large and p is small, we can approximate a binomial distribution with a Poisson where lambda = T*p
p(k) = (lambda^k / k!) * e^-lambda
Poisson model
assume that term frequency for a term in a document follows a Poisson distribution
flaw: it works for general terms, but is a poor fit for topic-specific terms => define 2 classes of terms?
a term is elite in a document if the document is about the concept denoted by the term
- binary
- documents are composed of topcial elite terms and supportive, more common terms
2-Poisson model
the distribution is different, depending on whether the term is elite or not
takes into account:
- tf
- idf
- document length
- parameter k controls the term frequency scaling (how the relevance changes with the term frequency)
- parameter b controls the document length normalization