C6: BIM & BM25 Flashcards

1
Q

the idea of document ranking

A

rank by the probability of relevance of the document regarding the information need P( R=1| document, query)
P(R=1) means user is happy with the doc
P(R=0) document is not relevant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

p(A, B)

A

p(A n B) = p(A|B) p(B) = p(B|A) p(A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

p(A|B)

A

p(B|A) p(A) / p(B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

P(A,B|C)

A

p(B|A,C) p(A|C)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

p(R=1|x)

A

the probability that document x is relevant: we want to find it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Binary independence model (BIM)

A
  • documents represented as binary independence vectors of terms
  • independence means the terms occur in documents independently
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

BIM: model design

A

given query q, compute for each document p(R|q,x) where x is the binary term incidence vector representing the doc

odds and Bayes: O(R|q,x) = p(R=1|q,x) / p(R=0|q,x) =
p(R=1|q) p(x|R=1, q) / p(R=0|q) p(x|R=0, q) =
O(R|q) * pi(t_1 to n) p(x_t|R=1, q) / p(x_t|R=0, q)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

RSV

A

Retrieval Status Value
RSV = sum(x_t = q_t = 1) log p_t(1 - u_t) / u_t(1 - p_t) = sum c_t

where p_t = p(x_t = 1|R=1, q)
and u_t = p(x_t = 1|R=0, q)
c_t is the term weight we want to estimate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

the goal of BM25

A

be sensitive to term frequency and document length while not adding too many parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

how to estimate c_t?

A

p_t = the probability of term x_t given q and R=1 = s / S = nr of relevant docs that contain x_t / total amount of relevant docs

u_t = the probability of term x_t given q and R=1 = (n-s) / (N-S) = nr of non-relevant docs that contain x_t / total nr of non-relevant docs

c_t = s/(S-s) / (n-s)/(N-n-S+s)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Poisson distribution

A

models probability of k, the number of events occurring in a fixed interval of time/space, with known average rate lambda

eg. number of cars arriving at a toll booth per minute

conditions:
- occurrences are independent, do not occur simultaneously
- rate is independent of any occurrence

if T (the interval) is large and p is small, we can approximate a binomial distribution with a Poisson where lambda = T*p
p(k) = (lambda^k / k!) * e^-lambda

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Poisson model

A

assume that term frequency for a term in a document follows a Poisson distribution

flaw: it works for general terms, but is a poor fit for topic-specific terms => define 2 classes of terms?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

eliteness

A

a term is elite in a document if the document is about the concept denoted by the term
- binary
- documents are composed of topcial elite terms and supportive, more common terms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

2-Poisson model

A

the distribution is different, depending on whether the term is elite or not

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

BM25

A

takes into account:
- tf
- idf
- document length
- parameter k controls the term frequency scaling (how the relevance changes with the term frequency)
- parameter b controls the document length normalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly