Topic 2: N-gram modeling Flashcards
N-gram
N-gram is a N-token sequence of words. example of bi-gram, tri-gram??
N-gram model
language model is a prediction model. predicting words form previous N-1 words.
statistical model
Language Model
application of n-gram model
give with example
- spelling correction
- speech recognition
- augmentative communication
- machine translation
Simple n-grams
probability of the word,w given history P( w | h )
Relative frequency counts
example…
Corpus based estimation
example…
Easier Estimation
this utilizes chain rule of probability. example P(its water was so transparent) = P(its) * P(water|its)* P(so|its water was)* P(transparent|its water was so)|...
Intuition of n-gram model
approximate the history by just the last few words instead of computing probability of the entire word history.
Markov assumption
N-gram model comes with independent assumption that probability of some future unit can be predicted without looking too far into the past.
Exercise bi-gram model with maximum likelihood estimation
I am Sam
Sam I am
I do not like green eggs and ham
Calculate bigram probabilities from this corpus.
P(I|) = 2/3 = 0.67 P(Sam|) = P(am|I) =
P(|Sam) = P(Sam|am) = P(do|I) =
P(I|)=2/3=0.67 P(Sam|)=1/3=0.33 P(am|I)=2/3=0.67 P(|Sam)=1/2=0.5 P(Sam|am)=1/2=0.5 P(do|I)=1/3=0.33
Relative Frequency
obtained by dividing frequency of a sequence by frequency of a prefix
Bi-gram count exercise
refer to slides. have unigram count..have bigram table
calculated the probability by
value in bigram table divided by unigram word count
what knowledge can be captured by N-gram probabilities.
- world knowledge
- syntax
- discourse
Evaluating language models. what are the 2 types of evaluation
extrinsic evaluation - embed language model in an application and measure how it improves, measure end to end performance, expensive
intrinsic evaluation - training and test set. measure quality of model, independent of application
Training and testing paradigm
- evaluating different architectures
- development, training and test set
split 80:20
Intuition of perplexity
how to predict next word, a better model is the one predicts the word that’s actually occurs
Perplexity as evaluation metric
best language model is the one best predicts an unseen test set.
Perplexity is the inverse probability of the test set, normalized by the number of word.
perplexity calculation….
Generalization
the statistical model is highly dependent on the training corpus
it’s pretty useless if the training sets and test sets are from different genre..
business meetings vs movie?
challenges in language model
dynamically adapt to different genres
Zeros
one kind of generalization: zeros
test set have data occurrences that does not occur in the training set.
Incorrect estimation
this is the problem of zeros
underestimated probability that all sorts of words will occur.
if the probability of any word in test set is 0. entire probability of test set is 0.
perplexity can’t be computed
Smoothing (Laplace)
introduced to overcome zero problem
how it’s being done?
add 1 to all the word counts denominator is also adjusted with extra V observation
Laplace smoothing exercise
probability with added count
bi gram table
vocab number