Class 10 Flashcards
grammer
defines the syntax of legal sentences
language model
probability distribution describing the likelihood of any string – no pair of people with the exact same language model
tokenization
process of dividing a text into a sequence of words
n gram model
markov chain model that considers only the dependence between n adjacent words, works well for spam detection, sentiment analysis, etc…
character level model
alternative to n-gram model, probability of each character determined by n-1 previous characters
skip gram model
alternative to n-gram model, count words that are near each other but skip a word (or more) between them
smoothing
process of reserving some probability for never seen before n grams
backoff model
estimates n-gram counts, but for low zero counts we back off to (n-1)-grams
linear interpolation smoothing
backoff model that combines trigram, bigram, and unigram models by linear interpolation
wordnet
open source, hand curated dictionary in machine readable format which has proven useful for many natural language applications
penn treebank
corpus of over 3M words of text annotated with part of speech (POS) tags
beam search
compromise between a fast greedy search and a slower, but more accurate Viterbi algorithm
hidden markov model
common model for part of speech (POS) tagging – combined with viterbi algorithm can achieve accuracy of around 97%
discriminative model
learns a conditional probability distribution P(C|W), meaning it can assign categories given a sequence of words but can’t generate random sentences – ex: logistic regression
language
set of sentences that follow the rules laid out by a grammar