Probability and Language Modelling Flashcards
How can you generally obtain better language models? (3)
More training data: This is limited by resources/money/availability
Better models: This is limited by either one’s own knowledge or one’s resources (hardware for example)
Better estimation method: This is limited by either one’s own knowledge or one’s resources (hardware for example)
What is a language model?
A model that assigns a probability to the occurrence of sentences.
How many parameters does a mixture of two Gaussians have?
5, 2 for the means, 2 for the variances and 1 for the probability of assigning a value to one of the groups (one of the Gaussians)
What is a model’s structure (or form)? What is a model’s parameters?
Structure: the form of the equations,usually determined by knowledge about the problem. Parameters: Specific values set to variables in the model’s formulas that affect its behaviour/shape
Relative Frequency Estimation formula
P(x) = count(x)/N where N is the total number of items in the dataset.
What is the likelihood?
The probabiltiy of observing a given set of data from a distribution with parameters θ L = P(x|θ) = L(θ|x)
Why is trying to use maximum-likelihood estimation to produce probabilities for each item in a large dataset difficult?
Each probability will be close to 0 since the normalisation term will be so large. This creates a sparse data problem.
What is a “bag-of-words” model?
A unigram model.
What assumption is necessary for the validity of an n-gram model?
The markov assumption up to n “the future word only depends on n-1 previous words”
What is the noisy-channel framework?
A framework that assumes that some input Y is affected by some noise/errorful encoding P(X|Y) and produces an output X. So given some output X we want to find a Y that is most like the original input. argmaxy(P(Y|X))
How does the entropy of a non-uniform distribution over N outcomes compare to that of a uniform distribution?
Any non-uniform distribution over N outcomes has lower entropy than the corresponding uniform distribution
Disadvantages of Language Models (3)
1) No way to share “information” about similar words. e.g.: The probabilities for “She bought a car” and “She bought a bicycle” would have to be calculated from scratch despite thier obvious similarities.
(Solved by class-based language models - model classes of words instead of the words themselves and calculate probabilities of classes instead of words)
2) No way to condition on context with intervening words. e.g. Dr. Gertrude Smith vs Dr. Jane Smith. You would have to use a unigram to calculate “Dr. Gertrude”, “Dr Jane” , “Gertrude Smith” and “Jane Smith”.
(Skip-Grams are a solution to this)
3) Can not handle long-distance dependencies since we limit the context to n-1 words.
What’s a parameter?
A variable in the equation that specifies a distribution - it is often determine by the observed data in probability estimation.
Given that X is an output sequence and Y is an input sequence.
In language modelling,
1) how is the language model defined?
2) how is the noise model defined?
3) what do we want to obtain
Since there is noise added to the input sequence we know that P(X) ≠ P(Y), the noise model is defined as P(X|Y)
The language model is estimated as P(Y)
We want to know the most likely input sequence so we want to find P(Y|X)
We want to find Y that maximises P(Y|X)
For a particular input y and output x, how do we find P(y|x)?
Using Bayes rule, we want to find the value of y that maximises P(y|x) = P(x|y)P(y)
Does the noise-model P(x|y) depend on the task?
Does the language model P(y) depend on the task?
The noise model depends on the task given
The language model is task independent.
What is the noise-model trained on?
What is the languge models trained on?
Corpus data for both
If we can train P(X|Y), why can’t we just train P(Y|X)? (2)
- We can often do so
- Training P(X|Y) and P(Y|X) requires input/output pairs which are often limited
i.e. a spellchecking system needs correcly spelt words and incorectly spelt words,
a translation system needs the translation of words or sentence
both of these need annotations and for someone to do the annotation.
- Sometimes we have this data, sometimes we don’t!
Language models can be trained on unannotated data - raw text is easy to get unlike annotated data
Give a con and pro of a uni-gram language model
- Gives the same probabiltiy to sentences that contain the same words regardless of the order.
- Good for finding the general “aboutness” of an entire document since it will produce the most likely words from that document, you can guess roughly what the document is about.
What tasks are bag-of-words (uni-gram) language models good for?
Tasks like lexical disambiguation, text-classification and information retrieval
What is the independence assumption in a tri-gram and bi-gram model?
tri-gram: P(wi|wi-1,wi-2), the probability of a word wi is only dependent on the previous two words
bi-gram: P(wi|wi-1), the probability of a word wi is only dependent on the previous word
How do we fix the problem of indicating whether a word makes sense at the beginning or end of a sentence?
Use end-of-sentence and start-of-sentence tags - and