C7: generative language models Flashcards

Question 1

Q

what is a language model?

Answer

A

a simplified statistical model of text
- data driven as opposed to rule-based
- local context predicts the following words
- can be used to compute the probability of observing a sentence given a model of a language (fragment) as opposed to syntactical wellformedness of that string

Question 2

Q

language model application in IR

Answer

A

each document is represented by a language model
rank documents according to P(D|Q) = P(Q|D) P(D) / P(Q)
simple model with memory = 0 (terms are chosen independently) works surprisingly well

Question 3

Q

query-likelihood model

Answer

A

rank documents by their probability that the query could be generated by the document model

RSV(Q,D) = pi(i=1 to n) P(q_i|D)

Question 4

Q

why do we apply smoothing?

Answer

A

document texts are a sample from the language model => missing words should not have zero probability of occuring

smoothing: technique for estimating probabilities for missing words
- lower (or discount) the probability estimates for words that are seen in the document text
- assign that “left-over” probability to the estimates for the words that are not seen in the text

Question 5

Q

what is the problem with discounting probability estimates?

Answer

A

all unseen terms are assigned an equal probability

new estimate for unseen terms: lambda * P(q_i|C)
this is the background probability: the probability for query word i in the collection language model

Question 6

Q

JM smoothing

Answer

A

P(q1…qn) = pi(j=1 to n) (1 − 𝜆) 𝑃(𝑞_𝑗|𝐷) + 𝜆𝑃(𝑞_𝑗|𝐶)

Question 7

Q

CLIR

Answer

A

Cross Language Information Retrieval: query and document are written in different languages => language models are instances of different feature spaces

solution:
1. translate documents or query
2. map language models

Question 8

Q

relevance model

Answer

A

a language model representing information need
1. first pass ranking
2. estimate relevance model from query and top-ranked documents
3. (re)rank documents by similarity of document model to relevance model

C7: generative language models Flashcards

(8 cards)