C7: generative language models Flashcards
what is a language model?
a simplified statistical model of text
- data driven as opposed to rule-based
- local context predicts the following words
- can be used to compute the probability of observing a sentence given a model of a language (fragment) as opposed to syntactical wellformedness of that string
language model application in IR
- each document is represented by a language model
- rank documents according to P(D|Q) = P(Q|D) P(D) / P(Q)
- simple model with memory = 0 (terms are chosen independently) works surprisingly well
query-likelihood model
rank documents by their probability that the query could be generated by the document model
RSV(Q,D) = pi(i=1 to n) P(q_i|D)
why do we apply smoothing?
document texts are a sample from the language model => missing words should not have zero probability of occuring
smoothing: technique for estimating probabilities for missing words
- lower (or discount) the probability estimates for words that are seen in the document text
- assign that “left-over” probability to the estimates for the words that are not seen in the text
what is the problem with discounting probability estimates?
all unseen terms are assigned an equal probability
new estimate for unseen terms: lambda * P(q_i|C)
this is the background probability: the probability for query word i in the collection language model
JM smoothing
P(q1…qn) = pi(j=1 to n) (1 − 𝜆) 𝑃(𝑞_𝑗|𝐷) + 𝜆𝑃(𝑞_𝑗|𝐶)
CLIR
Cross Language Information Retrieval: query and document are written in different languages => language models are instances of different feature spaces
solution:
1. translate documents or query
2. map language models
relevance model
a language model representing information need
1. first pass ranking
2. estimate relevance model from query and top-ranked documents
3. (re)rank documents by similarity of document model to relevance model