Probability and Language Modelling Flashcards
How can you generally obtain better language models? (3)
More training data: This is limited by resources/money/availability
Better models: This is limited by either one’s own knowledge or one’s resources (hardware for example)
Better estimation method: This is limited by either one’s own knowledge or one’s resources (hardware for example)
What is a language model?
A model that assigns a probability to the occurrence of sentences.
How many parameters does a mixture of two Gaussians have?
5, 2 for the means, 2 for the variances and 1 for the probability of assigning a value to one of the groups (one of the Gaussians)
What is a model’s structure (or form)? What is a model’s parameters?
Structure: the form of the equations,usually determined by knowledge about the problem. Parameters: Specific values set to variables in the model’s formulas that affect its behaviour/shape
Relative Frequency Estimation formula
P(x) = count(x)/N where N is the total number of items in the dataset.
What is the likelihood?
The probabiltiy of observing a given set of data from a distribution with parameters θ L = P(x|θ) = L(θ|x)
Why is trying to use maximum-likelihood estimation to produce probabilities for each item in a large dataset difficult?
Each probability will be close to 0 since the normalisation term will be so large. This creates a sparse data problem.
What is a “bag-of-words” model?
A unigram model.
What assumption is necessary for the validity of an n-gram model?
The markov assumption up to n “the future word only depends on n-1 previous words”
What is the noisy-channel framework?
A framework that assumes that some input Y is affected by some noise/errorful encoding P(X|Y) and produces an output X. So given some output X we want to find a Y that is most like the original input. argmaxy(P(Y|X))
How does the entropy of a non-uniform distribution over N outcomes compare to that of a uniform distribution?
Any non-uniform distribution over N outcomes has lower entropy than the corresponding uniform distribution
Disadvantages of Language Models (3)
1) No way to share “information” about similar words. e.g.: The probabilities for “She bought a car” and “She bought a bicycle” would have to be calculated from scratch despite thier obvious similarities.
(Solved by class-based language models - model classes of words instead of the words themselves and calculate probabilities of classes instead of words)
2) No way to condition on context with intervening words. e.g. Dr. Gertrude Smith vs Dr. Jane Smith. You would have to use a unigram to calculate “Dr. Gertrude”, “Dr Jane” , “Gertrude Smith” and “Jane Smith”.
(Skip-Grams are a solution to this)
3) Can not handle long-distance dependencies since we limit the context to n-1 words.
What’s a parameter?
A variable in the equation that specifies a distribution - it is often determine by the observed data in probability estimation.
Given that X is an output sequence and Y is an input sequence.
In language modelling,
1) how is the language model defined?
2) how is the noise model defined?
3) what do we want to obtain
Since there is noise added to the input sequence we know that P(X) ≠ P(Y), the noise model is defined as P(X|Y)
The language model is estimated as P(Y)
We want to know the most likely input sequence so we want to find P(Y|X)
We want to find Y that maximises P(Y|X)
For a particular input y and output x, how do we find P(y|x)?
Using Bayes rule, we want to find the value of y that maximises P(y|x) = P(x|y)P(y)