Probabilistic Model Flashcards
What is a sample space?
The set of possible outcomes of a random experiment
What is an event?
A subset of the sample space. A collection of outcomes
When is an event said to occur?
If the outcome of the random experiment is a member of the event set
How is relevance determined by a probabilistic retrieval model?
Using the probability that a user who likes d would enter query q
Relevance(q,d) = p(q|d)
What is the assumption made with the probabilistic model?
A user is formulating their query based on an imaginary relevant document
What is a statistical language model?
Represents a probability distribution over word sequences.
Ex: p(“Today is Wednesday) = 0.001 but p(Today Wednesday is) = 0.000000001
What is a language model?
A probabilistic model that estimates the likelihood of a sequence of words based on patterns observed in training data. Higher probabilities are given to more likely word sequences
What is the unigram language model?
A language model that generates text one word at a time, with each word being chosen independently from a distribution of words
How is probability of a phrase generated using a unigram language model?
Multiply the probabilities of the individual words
How is the probability of a word determined with a unigram language model?
Based on frequencies within a corpus of text that is relevant to the topic in question
What is the maximum likelihood estimator?
A method for estimating the probabilities of words in a Unigram LM
P(w|d) = (c(w,d))/|d|
What is the issue with the maximum likelihood estimator?
It doesn’t account for unseen words. These are words which may be relevant but do not appear in the doc
What is topic modelling?
A technique used to identify the main themes or topics present in a collection of documents. Can be solved using language models. If you have a new document that frequently uses words with high probabilities under a certain topic, it probably belongs to that topic
What is association analysis?
Determining which words are semantically related to others. It analyzes the probabilities and patterns of word occurences.
Uses the probability of a searched word to find words similarly occuring
How can we use the maximum likelihood estimator on a multi-word query?
Get the maximum likelihood estimate for each word and multiply them together
What is the issue with using maximum likelihood estimator on a multi-word query?
If there is a query that doesn’t appear, the entire likelihood goes to 0
How can we solve the issue with the maximum likelihood estimator?
Instead computing query likelihood which is how likely we are to observe a specific query form a doc model
How do you compute query likelihood?
Multiply the probabilities of finding each word within the document and rank by highest likelihood