NLP Flashcards
Summarization
idea?
Ambiguity
t
morpheme
a meaningful morphological unit of a language that cannot be further divided (e.g. in, come, -ing, forming incoming ).
morphological
relating to the form or structure of things.
Zipf’s law
Zipf’s law states that given a large sample of words used, the frequency of any word is inversely proportional to its rank in the frequency table. So word number N has a frequency of 1/N.
Thus the most frequent word will occur about twice as often as the second most frequent word, three times as often as the third most frequent word, etc.
arg max or argmax
the arguments of the maxima are the points of the domain of some function at which the function values are maximized
In other words, arg max is the set of points, x, for which f(x) attains the function’s largest value (if it exists). Arg max may be the empty set, a singleton, or contain multiple elements.
Language model?
- Models that assign probabilities to sequences of words
- the simplest model that assigns probabilities
LM to sentences and sequences of words, the N-gram - Goal: Assign useful probabilities P(x)to sentences x
- related task: assign probability P(w_i+1 | w_i) of an upcoming word
-> Probabilities should broadly indicate plausibility of sentences
Noisy Channel Model
- Goal: predict sentence given acoustics
Unigram, bigram, trigram
t
stupid backoff
t
Prior probability
a probability as assessed before making reference to certain relevant observations, especially subjectively or on the assumption that all possible outcomes be given the same probability.
the prior of an uncertain quantity is the probability distribution that would express one’s beliefs about this quantity before some evidence is taken into account
Channel model probability
t
Markov Model / Markov Chain
A finite state automaton with probabilistic state transitions
Makes Markov assumption that next state only depends on the current state and independent of previous history
Higher-order Markov Language Models
k-gram models are equivalent to bi-gram models where the state space (the words) are k-tuples of the bi-gram state space
Maximum likelihood estimator
P(w_i | w_i-1)_mle = count(w_i-1,w_i) / count(w_i-1)
- it is the estimate that makes it most likely that the word “w_i” will occur x times in a y million size corpus
Back-off Models
- use trigram if you have good evidence, otherwise bigram, otherwise unigram
related itea: interpolation -> mix unigram, bigram and trigram (mix all three all the time)
-> most of the time interpolation works better than backoff
Laplace Smoothing
- add-one: pretend we saw each word one more time than we did -> add one to all counts
- also: add-x smoothing
Back-off: Linear Interpolation
related itea to backoff: interpolation -> mix unigram, bigram and trigram (mix all three all the time)
-> most of the time interpolation works better than backoff
-> how to set lambdas: use held-out corpus (aside training set and test set)
interpolate?
interpolation is a method of constructing new data points within the range of a discrete set of known data points
Extrapolation
extrapolation is the process of estimating, beyond the original observation range, the value of a variable on the basis of its relationship with another variable. It is similar to interpolation, which produces estimates between known observations, but extrapolation is subject to greater uncertainty and a higher risk of producing meaningless results
gradient descent
tbd!
perplexity
=> intrinsic evaluation of a language model
- is based on the inverse probability of the test set
-> perplexity is most common
=> is the probability of the test set, normalized by the number of words
vs. extrinsic evalution (e.g. using speech recognition system, MT system)
=> perplexity is a function of the probability of the sentence
=> also: perplexity is “weighted equivalent branching factor” (also “average branching factor”)
=> in general: the lower the perplexity, the better the model
=> minimizing perplexity is the same as maximizing probability
a better model of a text
is one which assigns a higher probability to the word that actually occurs
the best language model
is one that best predicts an unseen test set
-> gives the highest P(sentence)
chain rule
permits the calculation of any member of the joint distribution of a set of random variables using only conditional probabilities
long-distance dependencies/information
mostly N-gram models sufficient
how do we deal with bigram that have zero probaility?
simplest idea: Add-one (Laplace) smoothing
smoothing
Methods for assigning probability mass to rare events
backoff
use trigram if you have good evidence, otherwise back off and use bigram, otherwise unigram
in practice, interpolation works better than backoff
interpolation
mix unigram, bigram, trigram
in practice, interpolation works better than backoff
Add-one (Laplace) Smoothing
how do we deal with bigram that have zero probaility?
-> ok for text classification, not for language modeling
how do we deal with bigram that have zero probaility?
simplest idea: Add-one (Laplace) smoothing
smoothing
Methods for assigning probability mass to rare events
backoff
use trigram if you have good evidence, otherwise back off and use bigram, otherwise unigram
interpolation
mix unigram, bigram, trigram
log-likelihood
tbd
dot-product
tbd
Keser-Ney smoothing
tbd
Open vs Closed classes
tbd
generative vs discriminative model
- generative: models joint probability of the labels y
- discriminative: models conditional probability
- discrimininative models usually harder to train (we have to solve optimization problem like gradient descent), with generative models we usually just count
log-linear model
- a conditional probability model
- sometimes also called logistic regression model
- a discriminative model
- they have a feature function phi of x,y
- a weight vector
1) model: … cf 03-19
2) learning: argmax of w
3) prediction: argmax of y
Sequence Methods
e.g. HMM -> a generative model (we model the joint probability of input and output
named entity recognition (NER)
a sequence labeling task
e.g. twoDigitNum, allCaps, lowercase
see 03-24
Part of speech tags
- t
a sequence of categories
e.g. verb noun conj.verb adverb
transition & emission in HMM?
transition: y_i -> y_i+1
emission: y_i -> x_i
transition probabilities
–> HMM: Pr(𝑥_𝑖 | 𝑥_𝑖−1)
two sequences in HMM
• In a hidden Markov model there are two sequences. One is a Markov chain (the y_i’s) and one is emitted from the Markov chain (x_i’s)
- sequence: Markov chain (the labels i.e. y_i’s)
- sequence: emission of Markov chain (the emitted x_i’s)
joint probability
A joint probability is a statistical measure where the likelihood of two events occurring together and at the same point in time are calculated. Joint probability is the probability of event Y occurring at the same time event X occurs.
HMM
An HMM is a probabilistic sequence model: given a sequence of units (words, letters, morphemes, sentences, whatever), they compute a probability distribution over possible sequences of labels
and choose the best label sequence.
- > with HMM we model the joint probability of the input and the output (we also model the probability of seeing a certain sequence of words (input), not only tags)
- > HMM makes strong assumptions
e(x_i,y_i)
emission probability: prob of word x_i given category y_i
smoothed estimation (trigram HMM)
1) for transition prob: smooth the transition parameters just like n-grams models using lambda_1,2,3
2) for emission prob: usually pseudowords are used
- use training corpus to do different counts (linear interpolation)
HMM: Smoothing with Pseudowords
finesse this problem of low frequency words or words never seen in training by closing the vocabulary by mapping those low frequency words to a much smaller, finite set of perhabs 20 “words classes” which preserve information about the spellings
HMM: Inference
HMM: training, smoothing
next: inference
problem stmt: for a given input sentence find the most likely sequence of tags (try over all possible sequences of tags)
-> argmax over sequence of y’s
-> but is exponential in length of set of tags
=> cf. Viterbi algorithm -> tractable solution
Viterbi algorithm
a dynamic programming algorithm for finding the most likely sequence of hidden states (sequence of tags y’s) that results in a sequence of observed events
Dynamic programming
dynamic programming (also known as dynamic optimization) is a method for solving a complex problem by breaking it down into a collection of simpler subproblems, solving each of those subproblems just once, and storing their solutions. The next time the same subproblem occurs, instead of recomputing its solution, one simply looks up the previously computed solution, thereby saving computation time at the expense of a (hopefully) modest expenditure in storage space
HMM makes strong assumptions
- > about how the different predictions in the sequence relate to one another:
1) The word (emitted value) x_i is independent of the rest of the variables given y_i i.e. for each x_i we assume it is sufficient to look at y_i i.e. we don’t need to condition on the full sequence of labels -> ideally we want to assume more dependence between the words
=> BUT: Words are dependent on other words in the sentence, even given their POS
=> we want to assume more dependence between the words!
2) a POS tag is independent of past tags given the previous tag
=> BUT: POS tags have dependencies that cannot be bounded in distance
feature in log-linear model
- > the features in a log-linear model serve the same function as the emission probabilities in HMM
- > the higher the f, the higher is prob for a certain y
prediction
= tagging
- model
- inference
- learning
- i.e. given the observed variables, how do we find the most likely y = inference, cf. Viterbi algorithm
- how to find the parameters
first-order model
t
multinomial
you have a chart / lookup table for each word and label and for each pair of labels (?)
log linear model
you have a feature function
what’s the difference between a feature-based classifier and a classifier which just has a multinomial?
feature-based classifier: if you have completely different sequences of words but have similar features, if that is the case there is a connection between the sequences because of the features -> you have some influence between them even though they are not the same
multinomial: you have a chart labels/words where entries are independent. it tells you if you have this word, then it has this value. you don’t model any relation.
feature vector
a feature vector consists of say m features, of which many could be binary features or “indicator function” that ask some question
gradient ascent
hill climbing technique for concave functions (which have a local maximum which is also global maximum)
named entity recognition (NER)
tbd
co-reference resolution
t
relation extraction
t
apposition
t
ontology
t
supervision
t
precision, recall
precision: % of selected items that are correct
recall: % of correct items that are selected (opposite of precision)
open information extraction system
should be able to indentify the ontology (ie the tables) as well as populate the database (does not have a predefined ontology eg thorugh wikipedia infoboxes, it only has text!
how does it define relations? just as strings/pieces of text
how to cluster semantically equivalent relations? cf. 11-16 would be 3 different relation instances but represent the same
syntax does not align with semantics!
a first step: Semantic role labeling (SRL)
idea: take a sentence, split it into events and propositions and to be able to identify that this relation is expressed and it is very similar to other instances of the same type of relation
- > idea: the roles are invariant to the different sentences but stay the same for equivalent values ie. the role is the same regardless of what syntactic position they take - it’s kind of an abstraction over the sentence-level paraphrasing which you can do
Semantic Roles cf. 11-24
- An underlying relationship an argument has with the main verb in a clause
- This relation is (ideally) invariant to paraphrases
- do not necessarily align with syntactic roles
3 main representation schemes for SRL
1) FrameNet scheme
judgment frame; roles: judge, evaluee, reason; verb “blame” invokes the judgment frame
-> very accurate, but coverage very limited (only a few thousands of frames
- semantic roles are defined by the frame
- a frame does a clustering of verbs (blame, hail, etc have the same frame)
- is more difficult to parse
- frames are manually defined
- a big problem: frame coverage
2) ProBank scheme
idea: semantic roles are more fine-grained than in FrameNet
- idea: try to abstract over the different instantiations of the same verb
- more frequently used than FrameNet, ProBank is more agnostic, does not generalize over verbs at all
- more fine-grained: e.g. She blames… or She praises… would be put in same frame with FrameNet but use 2 different semantic roles in PropBank
3)
VerbNet scheme
- takes max. coarse-grained approach
- 22 roles only
-> very generalized, sometimes very difficult to pick the right role because very abstractly defined
- crux is: use same set of roles across different classes??
semantic role labeling (SLR)
you want to abstract away from the syntax to directly represent the semantic relations, e.g. FrameNet (big dictionary of frames), several verbs can invoke the same frame (eg judgment frame); frames are manually defined
- SLR is a simplified form of semantic representation
different type of dictionary: semantic fields
clustering of words that share a meaning component
syntactic parse
t
error propagation
usually when you separate one prediction task into 2, situation arises called “error propagation”: you make a mistake at first stage, then no matter what you do, you’re already doomed when you get to the second step
-> that’s a reason why people don’t use pipelines
sparsity
tbd
classifier training vs feature?
t
precision, recall, F-score
t
NLP
- information retrieval
- machine translation
- question answering
- summarization
information extraction
map text to database entries
Ambiguity
- key challenge
- Language has much underlying structure, which is ambiguously expressed
- Ambiguity is at all levels:
- > words can sound the same but not mean the same (bank vs. bank)
- > Sentences may look the same, but have different meanings
- Alongside ambiguity, there is also redundancy: many different ways to say the same thing
morpheme
is the smallest grammatical unit in a language. In other words, it is the smallest meaningful unit of a language. The linguistics field of study dedicated to morphemes is called morphology. A morpheme is not identical to a word, and the principal difference between the two is that a morpheme may or may not stand alone, whereas a word, by definition, is freestanding. When it stands by itself, it is considered as a root because it has a meaning of its own (e.g. the morpheme cat) and when it depends on another morpheme to express an idea, it is an affix because it has a grammatical function (e.g. the –s in cats to indicate that it is plural). Every word comprises one or more morphemes.