Statistical Language Modelling Flashcards
what is NLP
Natural Language Processing builds systems that uses computational techniques to model and process natural languae in an automated way.
what is word-level processing
Before doing any text processing, we need to prepare out input data into sentences, then words, then tokens
What are the three ways of predicting the probability of a sequence
- Spellchecking
- Grammatical error correction
- Autocomplete/suggestions
what are the 4 n-grams
Unigrams
Bigrams
Trigrams
Quadrigrams
what is smoothing
techniques to ensure a low probability for unseen combinations without compromising the overall statistics of the training set
what are the 3 types of smoothing
- Laplace smoothing
- add-k smoothing
- Kneser-Ney smoothing
what is laplace smoothing
adds one to all counts.
what is add-k smoothing
rather than adding 1 to all counts, we can generalize to arbitrary k (typically between 0 and 1).
what n-gram is better?
- For higher n we capture more context and so we can make better predictions
- but for higher n, we also need more data and inevitably it will be sparse
How can we maximize probabilities
Given a tarin set and a test set, we want the model to maximize the probability of the test set. For bigarms this means we want to maximize:
p(w1w2…wn)=P(wi|wi-1)