Class 10 Flashcards
grammer
defines the syntax of legal sentences
language model
probability distribution describing the likelihood of any string – no pair of people with the exact same language model
tokenization
process of dividing a text into a sequence of words
n gram model
markov chain model that considers only the dependence between n adjacent words, works well for spam detection, sentiment analysis, etc…
character level model
alternative to n-gram model, probability of each character determined by n-1 previous characters
skip gram model
alternative to n-gram model, count words that are near each other but skip a word (or more) between them
smoothing
process of reserving some probability for never seen before n grams
backoff model
estimates n-gram counts, but for low zero counts we back off to (n-1)-grams
linear interpolation smoothing
backoff model that combines trigram, bigram, and unigram models by linear interpolation
wordnet
open source, hand curated dictionary in machine readable format which has proven useful for many natural language applications
penn treebank
corpus of over 3M words of text annotated with part of speech (POS) tags
beam search
compromise between a fast greedy search and a slower, but more accurate Viterbi algorithm
hidden markov model
common model for part of speech (POS) tagging – combined with viterbi algorithm can achieve accuracy of around 97%
discriminative model
learns a conditional probability distribution P(C|W), meaning it can assign categories given a sequence of words but can’t generate random sentences – ex: logistic regression
language
set of sentences that follow the rules laid out by a grammar
syntactic categories
help to constrain the probable words at each point within a sentence – ex: noun phrase or verb phrase
phrase structure
provides framework for meaning or semantics of the sentence
overgenerate
when a grammar produces sentences that are not grammatical
undergenerate
when a grammar rejects valid sentences
lexicon
list of allowable words
parsing
process of analyzing a string of words to uncover its phrase structure according to the rules of grammar
cyk algorithm
chart parser that uses chomsky normal form grammar
shift reduce parsing
popular deterministic approach, go through the sentence word by word choosing at each point whether to shift the word onto a stack of constituents or to reduce the top constituent(s) on the stack according to a grammar rule
dependency grammar
assumes that syntactic structure is formed by binary relations between lexical items, without need for syntactic constituents