NLP Flashcards
Summarization
idea?
Ambiguity
t
morpheme
a meaningful morphological unit of a language that cannot be further divided (e.g. in, come, -ing, forming incoming ).
morphological
relating to the form or structure of things.
Zipf’s law
Zipf’s law states that given a large sample of words used, the frequency of any word is inversely proportional to its rank in the frequency table. So word number N has a frequency of 1/N.
Thus the most frequent word will occur about twice as often as the second most frequent word, three times as often as the third most frequent word, etc.
arg max or argmax
the arguments of the maxima are the points of the domain of some function at which the function values are maximized
In other words, arg max is the set of points, x, for which f(x) attains the function’s largest value (if it exists). Arg max may be the empty set, a singleton, or contain multiple elements.
Language model?
- Models that assign probabilities to sequences of words
- the simplest model that assigns probabilities
LM to sentences and sequences of words, the N-gram - Goal: Assign useful probabilities P(x)to sentences x
- related task: assign probability P(w_i+1 | w_i) of an upcoming word
-> Probabilities should broadly indicate plausibility of sentences
Noisy Channel Model
- Goal: predict sentence given acoustics
Unigram, bigram, trigram
t
stupid backoff
t
Prior probability
a probability as assessed before making reference to certain relevant observations, especially subjectively or on the assumption that all possible outcomes be given the same probability.
the prior of an uncertain quantity is the probability distribution that would express one’s beliefs about this quantity before some evidence is taken into account
Channel model probability
t
Markov Model / Markov Chain
A finite state automaton with probabilistic state transitions
Makes Markov assumption that next state only depends on the current state and independent of previous history
Higher-order Markov Language Models
k-gram models are equivalent to bi-gram models where the state space (the words) are k-tuples of the bi-gram state space
Maximum likelihood estimator
P(w_i | w_i-1)_mle = count(w_i-1,w_i) / count(w_i-1)
- it is the estimate that makes it most likely that the word “w_i” will occur x times in a y million size corpus
Back-off Models
- use trigram if you have good evidence, otherwise bigram, otherwise unigram
related itea: interpolation -> mix unigram, bigram and trigram (mix all three all the time)
-> most of the time interpolation works better than backoff
Laplace Smoothing
- add-one: pretend we saw each word one more time than we did -> add one to all counts
- also: add-x smoothing
Back-off: Linear Interpolation
related itea to backoff: interpolation -> mix unigram, bigram and trigram (mix all three all the time)
-> most of the time interpolation works better than backoff
-> how to set lambdas: use held-out corpus (aside training set and test set)
interpolate?
interpolation is a method of constructing new data points within the range of a discrete set of known data points
Extrapolation
extrapolation is the process of estimating, beyond the original observation range, the value of a variable on the basis of its relationship with another variable. It is similar to interpolation, which produces estimates between known observations, but extrapolation is subject to greater uncertainty and a higher risk of producing meaningless results
gradient descent
tbd!
perplexity
=> intrinsic evaluation of a language model
- is based on the inverse probability of the test set
-> perplexity is most common
=> is the probability of the test set, normalized by the number of words
vs. extrinsic evalution (e.g. using speech recognition system, MT system)
=> perplexity is a function of the probability of the sentence
=> also: perplexity is “weighted equivalent branching factor” (also “average branching factor”)
=> in general: the lower the perplexity, the better the model
=> minimizing perplexity is the same as maximizing probability
a better model of a text
is one which assigns a higher probability to the word that actually occurs
the best language model
is one that best predicts an unseen test set
-> gives the highest P(sentence)
chain rule
permits the calculation of any member of the joint distribution of a set of random variables using only conditional probabilities
long-distance dependencies/information
mostly N-gram models sufficient
how do we deal with bigram that have zero probaility?
simplest idea: Add-one (Laplace) smoothing
smoothing
Methods for assigning probability mass to rare events
backoff
use trigram if you have good evidence, otherwise back off and use bigram, otherwise unigram
in practice, interpolation works better than backoff
interpolation
mix unigram, bigram, trigram
in practice, interpolation works better than backoff
Add-one (Laplace) Smoothing
how do we deal with bigram that have zero probaility?
-> ok for text classification, not for language modeling
how do we deal with bigram that have zero probaility?
simplest idea: Add-one (Laplace) smoothing
smoothing
Methods for assigning probability mass to rare events
backoff
use trigram if you have good evidence, otherwise back off and use bigram, otherwise unigram
interpolation
mix unigram, bigram, trigram
log-likelihood
tbd
dot-product
tbd
Keser-Ney smoothing
tbd
Open vs Closed classes
tbd
generative vs discriminative model
- generative: models joint probability of the labels y
- discriminative: models conditional probability
- discrimininative models usually harder to train (we have to solve optimization problem like gradient descent), with generative models we usually just count
log-linear model
- a conditional probability model
- sometimes also called logistic regression model
- a discriminative model
- they have a feature function phi of x,y
- a weight vector
1) model: … cf 03-19
2) learning: argmax of w
3) prediction: argmax of y
Sequence Methods
e.g. HMM -> a generative model (we model the joint probability of input and output
named entity recognition (NER)
a sequence labeling task
e.g. twoDigitNum, allCaps, lowercase
see 03-24
Part of speech tags
- t
a sequence of categories
e.g. verb noun conj.verb adverb
transition & emission in HMM?
transition: y_i -> y_i+1
emission: y_i -> x_i
transition probabilities
–> HMM: Pr(𝑥_𝑖 | 𝑥_𝑖−1)
two sequences in HMM
• In a hidden Markov model there are two sequences. One is a Markov chain (the y_i’s) and one is emitted from the Markov chain (x_i’s)
- sequence: Markov chain (the labels i.e. y_i’s)
- sequence: emission of Markov chain (the emitted x_i’s)
joint probability
A joint probability is a statistical measure where the likelihood of two events occurring together and at the same point in time are calculated. Joint probability is the probability of event Y occurring at the same time event X occurs.
HMM
An HMM is a probabilistic sequence model: given a sequence of units (words, letters, morphemes, sentences, whatever), they compute a probability distribution over possible sequences of labels
and choose the best label sequence.
- > with HMM we model the joint probability of the input and the output (we also model the probability of seeing a certain sequence of words (input), not only tags)
- > HMM makes strong assumptions
e(x_i,y_i)
emission probability: prob of word x_i given category y_i
smoothed estimation (trigram HMM)
1) for transition prob: smooth the transition parameters just like n-grams models using lambda_1,2,3
2) for emission prob: usually pseudowords are used
- use training corpus to do different counts (linear interpolation)
HMM: Smoothing with Pseudowords
finesse this problem of low frequency words or words never seen in training by closing the vocabulary by mapping those low frequency words to a much smaller, finite set of perhabs 20 “words classes” which preserve information about the spellings
HMM: Inference
HMM: training, smoothing
next: inference
problem stmt: for a given input sentence find the most likely sequence of tags (try over all possible sequences of tags)
-> argmax over sequence of y’s
-> but is exponential in length of set of tags
=> cf. Viterbi algorithm -> tractable solution
Viterbi algorithm
a dynamic programming algorithm for finding the most likely sequence of hidden states (sequence of tags y’s) that results in a sequence of observed events
Dynamic programming
dynamic programming (also known as dynamic optimization) is a method for solving a complex problem by breaking it down into a collection of simpler subproblems, solving each of those subproblems just once, and storing their solutions. The next time the same subproblem occurs, instead of recomputing its solution, one simply looks up the previously computed solution, thereby saving computation time at the expense of a (hopefully) modest expenditure in storage space
HMM makes strong assumptions
- > about how the different predictions in the sequence relate to one another:
1) The word (emitted value) x_i is independent of the rest of the variables given y_i i.e. for each x_i we assume it is sufficient to look at y_i i.e. we don’t need to condition on the full sequence of labels -> ideally we want to assume more dependence between the words
=> BUT: Words are dependent on other words in the sentence, even given their POS
=> we want to assume more dependence between the words!
2) a POS tag is independent of past tags given the previous tag
=> BUT: POS tags have dependencies that cannot be bounded in distance
feature in log-linear model
- > the features in a log-linear model serve the same function as the emission probabilities in HMM
- > the higher the f, the higher is prob for a certain y
prediction
= tagging
- model
- inference
- learning
- i.e. given the observed variables, how do we find the most likely y = inference, cf. Viterbi algorithm
- how to find the parameters
first-order model
t
multinomial
you have a chart / lookup table for each word and label and for each pair of labels (?)
log linear model
you have a feature function
what’s the difference between a feature-based classifier and a classifier which just has a multinomial?
feature-based classifier: if you have completely different sequences of words but have similar features, if that is the case there is a connection between the sequences because of the features -> you have some influence between them even though they are not the same
multinomial: you have a chart labels/words where entries are independent. it tells you if you have this word, then it has this value. you don’t model any relation.
feature vector
a feature vector consists of say m features, of which many could be binary features or “indicator function” that ask some question
gradient ascent
hill climbing technique for concave functions (which have a local maximum which is also global maximum)
named entity recognition (NER)
tbd
co-reference resolution
t
relation extraction
t
apposition
t
ontology
t
supervision
t
precision, recall
precision: % of selected items that are correct
recall: % of correct items that are selected (opposite of precision)
open information extraction system
should be able to indentify the ontology (ie the tables) as well as populate the database (does not have a predefined ontology eg thorugh wikipedia infoboxes, it only has text!
how does it define relations? just as strings/pieces of text
how to cluster semantically equivalent relations? cf. 11-16 would be 3 different relation instances but represent the same
syntax does not align with semantics!
a first step: Semantic role labeling (SRL)
idea: take a sentence, split it into events and propositions and to be able to identify that this relation is expressed and it is very similar to other instances of the same type of relation
- > idea: the roles are invariant to the different sentences but stay the same for equivalent values ie. the role is the same regardless of what syntactic position they take - it’s kind of an abstraction over the sentence-level paraphrasing which you can do
Semantic Roles cf. 11-24
- An underlying relationship an argument has with the main verb in a clause
- This relation is (ideally) invariant to paraphrases
- do not necessarily align with syntactic roles
3 main representation schemes for SRL
1) FrameNet scheme
judgment frame; roles: judge, evaluee, reason; verb “blame” invokes the judgment frame
-> very accurate, but coverage very limited (only a few thousands of frames
- semantic roles are defined by the frame
- a frame does a clustering of verbs (blame, hail, etc have the same frame)
- is more difficult to parse
- frames are manually defined
- a big problem: frame coverage
2) ProBank scheme
idea: semantic roles are more fine-grained than in FrameNet
- idea: try to abstract over the different instantiations of the same verb
- more frequently used than FrameNet, ProBank is more agnostic, does not generalize over verbs at all
- more fine-grained: e.g. She blames… or She praises… would be put in same frame with FrameNet but use 2 different semantic roles in PropBank
3)
VerbNet scheme
- takes max. coarse-grained approach
- 22 roles only
-> very generalized, sometimes very difficult to pick the right role because very abstractly defined
- crux is: use same set of roles across different classes??
semantic role labeling (SLR)
you want to abstract away from the syntax to directly represent the semantic relations, e.g. FrameNet (big dictionary of frames), several verbs can invoke the same frame (eg judgment frame); frames are manually defined
- SLR is a simplified form of semantic representation
different type of dictionary: semantic fields
clustering of words that share a meaning component
syntactic parse
t
error propagation
usually when you separate one prediction task into 2, situation arises called “error propagation”: you make a mistake at first stage, then no matter what you do, you’re already doomed when you get to the second step
-> that’s a reason why people don’t use pipelines
sparsity
tbd
classifier training vs feature?
t
precision, recall, F-score
t
NLP
- information retrieval
- machine translation
- question answering
- summarization
information extraction
map text to database entries
Ambiguity
- key challenge
- Language has much underlying structure, which is ambiguously expressed
- Ambiguity is at all levels:
- > words can sound the same but not mean the same (bank vs. bank)
- > Sentences may look the same, but have different meanings
- Alongside ambiguity, there is also redundancy: many different ways to say the same thing
morpheme
is the smallest grammatical unit in a language. In other words, it is the smallest meaningful unit of a language. The linguistics field of study dedicated to morphemes is called morphology. A morpheme is not identical to a word, and the principal difference between the two is that a morpheme may or may not stand alone, whereas a word, by definition, is freestanding. When it stands by itself, it is considered as a root because it has a meaning of its own (e.g. the morpheme cat) and when it depends on another morpheme to express an idea, it is an affix because it has a grammatical function (e.g. the –s in cats to indicate that it is plural). Every word comprises one or more morphemes.
Shannon’s game (Language Modeling Problem)
given a sequence of words, what’s the distribution of the next word
N-gram
An N-gram is a sequence of N
N-gram words: a 2-gram (or bigram) is a two-word sequence of words like “please turn”,
“turn your”, or ”your homework”, and a 3-gram (or trigram) is a three-word sequence
of words like “please turn your”, or “turn your homework”. We’ll see how
to use N-gram models to estimate the probability of the last word of an N-gram
given the previous words, and also to assign probabilities to entire sequences
=> N-grams work well only for word prediction if the test corpus looks like the training corpus!
unigram model
=> simplest Markov model: unigram model -> simply estimate prob of a sequence of words by product of probs by individual words (words are independent in this model)
bigram model
- Condition on previous single word
- slightly more intelligent than unigram model
extend further to trigram, 4-gram, 5-gram
=> in general, insufficient model of language because language has long-distance dependencies
markov assumption
a simplyfing assumption (because we cannot count how many times a sequence of words occurs):
-> the next state only depends on the current state and independent of previous history
-> P(the | its water is so transparent that =~ P(the | that)
or
P(the | its water is so transparent that =~ P(the | transparent that)
more formally:
P(w_i | w_1w_2…w_n) =~ P(w_i | w_i-k…w_i-1)
=> estimate by just the last few words
=> simplest Markov model: unigram model -> simply estimate prob of a sequence of words by product of probs by individual words
Markov Model
The assumption that the probability of a word depends only on the previous word is called a Markov assumption. => Markov models are the class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past => Markov models assume the probability of a sequence can be given as a product of the transition probabilities
estimate probability of a sentence using bigram
P(I want english food ) = P(I | ) x P(want | I) x P(english | want) x P(food | english) x P( | food)
why log space
- to avoid arithmetic underflow
- adding is faster than multiplying
p_1 x p_2 x p_3 = log(p_1) + log(p_2) + log(p_3)
a held-out corpus
A held-out corpus is an additional training corpus that we use to set hyperparameters like these λ values, by choosing the λ values that maximize the likelihood of the held-out corpus
how to deal with bigrams with zero probability?
- simplest idea: add-one smoothing
text classification
t
Discounting Methods
-> Maximum likelihood estimates tend to be too high for low count items
=> One simple way to handle this is to discount all counts by a constant term: (say, 0.5)
But what do we do with the missing probability mass? -> collect it from the number of discounts made and distribute this mass to bigrams with count 0
Kneser-Ney Smoothing
cf P_continuation
classification: 3 main ideas
- Representation in a feature space
- Scoring by linear functions
- Learning by optimizing
joint vs. conditional probability
tbd
con·di·tion·al prob·a·bil·i·ty
the probability of an event ( A ), given that another ( B ) has already occurred
Naives Bayes
- Simplest Generative Model
Naives Bayes is a probabilistic classifier, meaning that for a document d, out of all classes c ∈ C the classifier returns the class c^ which has the maximum posterior probability given the document
feature space
t
bag of words model
the bag of words assumption: the position of the words is ignored
=> we make use of the frequency of each word
supervised learning
for the training data, you know the labels
feature function
t
likelihood
or log likelihood
the prob of seeing the labels given the inputs
regularization
to combat overfitting with log-linear model, we add a regularization term
-> idea: to penalize the increase of w especially for low-appearance features
neural net
non-linear, non-convex architecture
vs.
log-linear models: linear, convex
feed forward neural network
t
neural network
t
backpropagation algorithm
t
one-hot vector
cf. Feed-forward Neural Network
softmax layer
- > a standard way of forming a multi-label classifier
- Softmaxlayers turn vector outputs into a probability distribution
morphological paradigms
t
POS tagging problem
The POS tagging problem is to determine the (single) POS tag for a particular word token (instance)
main sources of information for POS tagging
- Knowledge of word probabilities (to man is rare)
2. Knowledge of neighboring words
Sequence Labeling as Classification
aim: predict category of each word
Approach 1: classify each token independently but use as input features and information about the surrounding tokens (sliding window) (but without category of neighboring words)
Approach 2: using Outputs as Inputs: Better input features are usually the categoriesof the surrounding tokens (not easy to integrate), but these are not available yet -> use category of either the preceding or succeeding tokens by going forward or backward and using previous output
- > forward classification
- > backward classification
probabilistic sequence model
- build a model that gives you a prob for the entire sequence
- allows integrating uncertainty over multiple, interdependent classifications and collectively determine the most likely global assignment
=> classic solution: Hidden Markov Model (HMM)
-> a global model: means it maximizes over sequences rather than local words only i.e. gives a prob for the entire sequence of words
hidden markov model (HMM)
an HMM allows us to talk about both observed events (like words that we see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in our probabilistic model
HMM - why hidden?
we didn’t observe part-of-speech tags in the world; we saw words and had to infer the correct tags from the word sequence. We call the part-of-speech tags hidden
because they are not observed
Trigram HMM
t
conditional independence assumption
given x and y_i-1, y_i is conditionally independent of everything that preceded it
(simliar to Markov assumptions)
-> in MEMM, unlike HMMs, the words are always conditioned on, there is no assumption re the words
words are conditionally independent of each other given the class
Markov chain
A Markov chain is a special case of a weighted automaton in which weights are probabilities (the
probabilities on all arcs leaving a node must sum to 1) and in which the input sequence uniquely determines which states the automaton will go through
-> A Markov chain is useful when we need to compute a probability for a sequence of events that we can observe in the world
first-order Markov chain
A Markov chain embodies an important assumption about these probabilities. In a first-order Markov chain, the probability of a particular state depends only on the previous state
discriminative models
- allows us more dependencies between different words (than e.g. HMM) through the use of discriminative models which also allows to use feature representations
Maximum Entropy Markov Models (MEMM)
- here, I assume that p(y_i|y_i-1,x_1….x_m) is a log-linear model
- is the product of transitions
- unlike HMM, here I don’t assume anything about the independence of the words: I condition on the words, and there is no assumption that words are independent or conditionally independent
- 2nd assumption: the structure of each of the factors in the product forms a log-linear model
- f = feature function
- w = weight vector
- the denumerator is to normalize the expression
why exponentiate? (e.g. in log-linear model)
t
multinomial vs. feature function based model?
- multinomial: you have a chart/table/lookup-table for each word and label: coarse-grained
- feature function: fine-grained
difference:
cf. 20171119_120054[1].mp4 26:00ff
history (context: feature function)
why history?
- > y_i is conditioned on y_i-1 and all the words = this is called the history
- > so it includes everything you condition on, not just observed values but also labels (e.g. the preceding tag)
analogy MEMM and HMM (re transition and emission prob)
MEMM: with the feature function we’re trying to find relevant information for defining which y’s are more probable given everything that I condition on
- the features with word/tag pairs in feature function f in MEMM serves the same purpose as the emission probability in HMM (of course! -> if the emission prob is high for a certain y given an x, then the feature function together with the weight vector should also yield a high probability for this y
- some features n f correspond to transition probabilities in HMM: eg f_103 considers the two previous tags. You can have both trigram and bigram transition probabilites in f, you don’t have to restrict yourself to one of them
block notation
t
Ratnaparkhi’s Tagger
05-05
- f_100 corresponds to emission prob e(x_i | y_i) in HMM
- f_101 etc aare additional features that HMM could not cover
- you can use trigram, bigram, unigram model in the same feature function set
- transition feature f_103 corresponds to transition prob q(y_i | y_i-1) in HMM (trigram)
and transition feature f_104 same but bigram
and transition feature f_105 same but unigram
=> AND therefore: it corresponds to backoff model smoothing method: if f_103 does not hold then maybe f_104 holds and so forth
-> 103 would have very high weight, 104 bit less, 105 again bit less (like a backoff model)
=> you can combining any number of features! -> becomes more effective
log-linear model: when we have probabilistic classifiers, we also need to care about 2 other things aside the model?
- train it and to do inference
- given training data, we want to estimate the parameters (=the weights w), and we need to know how to do inference: given weight values and x, what is the most likely y
MLE in MEMM
sl 05-05
differentiate the log likelihood of a log-linear model
sl 05-05
inference in MEMMs
pretty much the same as for HMM:
use Viterbi
sl 05-10/11
difference between inference and learning? (HMM, MEMM)
- inference (ie finding the max y’s): finding a maximum path on the grid representation
- > we have w and want to find the most likely y’s
- learning:
label bias
- problem with MEMM, HMM
-> occurs with any locally-normalized model - the model favors labels that predict well the next state over labels that predict poorly the next state, it will be biased towards them because of the high probability
=> your model predicts an unlikely/infrequent sequence of labels just because their probability does not decay exponentially (e.g. long book names) yet that sequence is very rare - could be harmful, not always
- e.g. in POS tagging it is usually not a problem
local normalization
a normalization term as denominator makes sure that the sum of probabilities of all possible values of y_j is 1
equivalent on a grid represenation: the sum of all outgoing edges of a node is 1
prediction vs inference vs learning?
inference := the process of inferring something. a conclusion reached on the basis of evidence and reasoning.
Sequence Conditional Random Fields (CRFs)
- CRFs are similar to MEMMs, but the normalization is global
- this solves the label bias: the scores of the next state given the current one need not sum up to 1
- the normalization is not done for each transition but rather the sum of all paths given the x sums up to 1
- each transition is not a probability anymore but rather a score/(or potentials)
- you still normalize but over the sum of all the paths
- we assume all different paths for a sentence sum up to 1, it might be that the outoging edges of a node do not sum up to 1
- CRF has been standard NLP model until NN came
why normalize at all?
- it makes sure that if you increase the prob of some path you will decrease the prob of others
- it serves as a regularization because otherwise you could assign infinity scores, there would be no limit, no boundaries
- > normalization makes you work with a fixed budget, then the problem has a unique maximum that you can find
how do we train w for CRF?
- how to estimate/train w, how to find the most llikely weights?
- no close form formula
- write conditional log likelihood and maximize it
- > compute gradient 05-17, but has intractable subtrahend term, then transform it st it has a polynomial term Pr(…) (with number of tags squared), and compute this term Pr(…) using dynamic programming for all values y_j and y_j-1 and over all j’s
gradient computation CRF
t
forward-backward algorithm (CRF)
- > similar to Viterbi but instead of max I take the sum
- M_i: unnormalized transition weight (not a probability!) to transition from y to y’
alpha: is the sum over all paths of length i-1 that end with y (forward term)
beta: same but start with y and go all the way to the end (backward term) - how is it helpful? I can compute it (tractable)
domain adaptation
- > trained with wall street journal, then apply it to wikipedia
- Accuracies degrade outside of domain
- e.g. try to find instances you’re most confident in
- e.g. use lexicons
Sentiment Analysis: Definition
- Sentiment analysis is the detection of attitudes
• Simplest task:
Is the attitude of this text positive or negative?
• More complex:
Rank the attitude of this text from 1 to 5
• Advanced:
Detect the target, source, or complex attitude types
- a simple classifier:
Log-linear or Naïve Bayes classifier
Negation in Sentiment Analysis
- One practice is to add NOT_ to every word between negation and following punctuation (useful if you want to stick to the bag of words model, but it still fails: “didn’t like” would still get a positive polarity
- if you have this heursistic of adding NOT_, then instead of “like” you would have a “NOT_like”
- > but it is still a very crude solution
Sentiment Analysis as Sequence Labeling
- This construal of sentiment analysis attempts to capture the meaning of a word in contextby encoding (parts of) the sentence as features
- Recently: using Recurrent Neural Networks (RNNs)
- RNNs are FSAs:
) No longer finite in the same sense
) The state is an embedding of the history in a continuous space
Recurrent Neural Networks
- Map from dense sequence to dense representation
- > idea RNN: save the history as a state
06-17
- R is parametrized and parameters are shared across steps
-> parameters of R are learned, it’s the history, everything we’ve learned so far
~ representation learning
-> it obviates the need to create a feature function -> let the system find the features -> it is the NN dogma - with RNN, you don’t do any global optimization (you compute a y and move to the next)
- RNN: no dynamic programming for inference
- > the RNN computes a classification for each word on its own
- RNN models are able to take the context (both preceding and following) into account, as well as the linear order between the words (bag of words cannot)
sequence labeling
t
Bi-RNN
- When using RNNs for sentence classification (such as sentiment analysis), it is often practical to use Bi-RNNs
- Bi-RNN: 2 RNNs, one going back to forth and the other forth to back, Output function is a function of both states
- This allows the states associated with each word to encode relevant information from the words following them and preceding them
- difference from bag-of-words: the result of the latter is just based on what a word is, with RNN it is based on what that word is and the state which encodes both the words preceding it and the words following it
- > it is kind of a modified/enhanced bag-of-words
sentiment analysis with Bi-RNN
- bottom line: encode the state/output for each word and somehow aggregate them
- one simple way to do sentiment analysis (or other sentence classification) with Bi-RNNs is to average the output sequence and train a binary logistic classifier for predicting the sentiment
Neural Language Models
t
Character-level Language Models
t
what is grammar?
- eg The rules and principles that guide the structure of sentences in language
why do we need grammar?
Two main motivations:
- Structural Analysis:decomposing complex units into simpler sub-structures can assist learning and generalization
- Semantic Information: grammatical structure often reflects semantic structure and distinctions
- > Grammatical analysis in NLP involves breaking the text into simpler parts and categorizing them
- > Categorization allows information to propagate from one type to another (important!) -> allow insight that plain sequences (either a word is close or distant) do not (whereas with categorization you have a tree where words can be in all sorts of configurations to one another even if they are distant)
a word
- Practical: anything that is between two spaces
- What about languages that don’t have spaces?
- Phonological words
- Grammatical words
morphology
is the study of words, how they are formed, and their relationship to other words in the same language. It analyzes the structure of words and parts of words, such as stems, root words, prefixes, and suffixes. Morphology also looks at parts of speech, intonation and stress, and the ways context can change a word’s pronunciation and meaning
clause (grammar)
- basic syntactic unit
- difficult to define
- some kind of description of some activity, state or property
- more formally: simplest kind of sentence
- a sentence can have several clauses
- simple clauses have a predicate, core arguments (usually subjects and objects), and non-core arguments (e.g., location, manner, instrument)
- > predicate = some main relation
Inter-Clause Relations (3x) (grammar)
- basic element of grammar
Three main inter-relations between clauses within a sentence:
-> complement clauses:
Clauses that serve as an argument in another clause
•“John said he will kick the ball”
-> relative clauses:
Clauses that modify an argument of another clause
•“John, who has studied the language for years, speaks German well”
-> linked clauses:
• After he had studied it for years, John could speak German well
Basic Units (grammar)
- basic element of grammar:
1) noun phrase:
- an argument of a clause is generally a noun phrase
Two major classes:
1) Proper nouns or names
2) Common nouns
- Noun phrases generally have a head (that determines their category, a word the phrase centers around and it determines its type (usually a noun))
-> a word that determines the semantic type of the whole phrase, and is an essential component of its meaning:
The man who wasn’t there
2) Syntactic Heads:
- the (main) verb is considered the head of a clause (in English at least) (take it as a convention)
- Arguments and modifiers are considered its dependents:
Johnran home (“ran” is the head of the clause, “John” and “home” are its dependents)
-> Some schemes view all syntactic relations as head-dependent relations
Dependency parse
- syntax is represented as head-dependent relations between words (cf tree structure)
- Nodes are words, edges are relations
- constituents are subtrees, but the words are all the nodes
- The arguments and modifiers of a word are its children
- Edge labels mark the type of relation
- The verb (=predicate) of the main clause is the root
=> the assumption: every phrase has a head
-> this assumption allows me to build a syntactic tree where edges between words represent bi-lexical dependencies
predicate
- the thing asserted (usually the verb)
- the property that you are asserting
Phrase-based (constituency) parse
- represent the sentence as a collection of nested phrases
- > it brackets words into phrases, parent phrases and so on
- Words are at the leaves, phrases/constituents are subtrees
- Non-terminal labels represent the phrase type. This is determined by the type of the headword (e.g., Noun Phrases (NP) are phrases headed by a noun)
what information can you get from syntactic structure?
why do we want to use hierachary? or syntactic representation? what’s wrong with just representing sentences as sequences?
- linear structure might lead us astray: two words might be far away from each other but might have a direct dependency between them (“war horses which were bought by the police like to drink”)
- same goes for speech tags or sentiment analysis (e.g. negation scope)
- -> this syntactic structure helps us achieve all kinds of useful things, do better language modeling, better disambiguation, negation scope
- > linear structure is just an approximation
- > in order to really model what’s going on we need some kind of hierarchy
information from syntactic structure we can get?
- phrases (e.g. for machine translation)
- > use a phrase-based parse /constituency tree of the English sentence and then change subtrees in order to get order of words in Japanese (tree transformation) (word permutation only won’t help here)
- > phrase move around in the tree and not single words
- > same would be possible with dependency tree (then you’d move subtrees)
- extract grammatical relations
- > extract subject, object or all events from a sentence
- allows to do syntactic disambiguation
- > some are detectable directly from the syntactic tree
- a proxy to semantics
- > by knowing the syntactic tree I already know quite a bit about of information and can answer semantical questions
dependency vs constituency tree
dependency:
- constituents are subtrees, but the words are all the nodes
- Verbs are heads of clauses, nouns are heads of noun phrases
constituency:
- Words are at the leaves, phrases/constituents are subtrees
- does not have the notion of a headword, it just puts brackets and has categories but does not tell you the headword
- > however, you can extract the headword based on the categories in the tree
- > basically it is possible to convert constituency trees into dependency trees (vice versa is trickier)
head
the head is the word that determines the type of the phrase
examples dependency trees (universal dependency format)
cf
08-syntax_von_pptx_1210.pdf, sl. 30,31
Graph-Based Dependency Parsing
- task of taking a piece of text and generating a dependency tree for it
- > what learning algorithm do we use?
sources of information for dependency parsing
- Bi-lexical affinities
- Dependency distance (mostly nearby words)
- Intervening material (e.g. a verb is unlikely to appear between two dependent word)
- valency (how many dependents a word has, e.g. give has two objects)
dependency parsing evaluation (evaluation measures)
- accuracy = #correct edges / #number of words
- labeled/unlabeled attachment score
Graph-based Parsing (context: dependency tree)
- a “structured prediction problem”
- MST Parser: minimum spanning tree (same algorithm as maximum spanning tree)
MST parser
MST: minimum spanning tree (same algorithm as maximum spanning tree)
1) build complete directed graph (except ROOT)
- > each spanning tree rooted in tree is a possible tree
- > gives the space of possible parses
- > first assign a weight to each edge how likely it is to appear in the MST
- how do I score the edges?
- find model that gives weight to edges (whether it is likely or not) based on information that tells us if this edge is relevant
- we use feature function phi (usually binary function) and a weight vector to assign scores to edges
- usually POS tagging is part of preprocessing for MST parser (then you can use POS tags in the feature function)
MST parser: inference and learning
inference: is finding the maximum directed spanning tree
- > score each edge based on its features (Chu-Liu Edmonds algorithm)
learning:
- use log-linear model: Pr(T) for all possible trees
- using the gradient of the log-likelihood I can find the optimal teta to maximize the log-likelihood
- > but computing the gradient is computationally expensive (we have to do for all possible trees for V)
- > the common way to do optimization/find the most likely teta is not by computing the gradient but by using the averaged perceptron algorithm
averaged/structured perceptron algorithm
- sl 08-13
- > a (supervised) learning algorithm! (for learning teta)
- > that learned teta is used in Pr(T)
- > Pr(T) used to find (best/most likely) dependency tree for a sentence
- instead of taking the expected value, take the maximum value/tree (done by inference: you compute T’ (the MST maximum tree) by using inference, T_i is the correct tree from training data)
- learning of teta done by the following: you take the sum over all the edges in the correct tree (gold standard) minus the sum of all the edges in the predicted tree
higher-order features (graph-based parsing)
- normally features are based on single edge alone
- higher-order features are types of features that cannot be encoded on edge level alone
- MST inference works well when the model factorizes over edges
e.g.
sibling relations:
-> features defined over two edges that have the same parent
grandparent features:
-> features defined over two edges, where one’s child is the parent of the other
valency:
-> the number of children a node has. Defined over all edges sharing a parent
=> formally, the score function factorizes over sets of edges
problem: you cannot run the MST algorithm anymore
- > if u want take scoring function that is over edge pairs or triplets you cannot run exact inference anymore (becomes intractable), but approximate methods exist for computing inference
transition-based parsers
- In transition-based parsing, the problem is decomposed to finding a sequence of transitions
- we don’t assume any grammar or context in transition based parsing
arc-standard parsers
- you have a buffer and a stack and a transition set, you build edges and nodes
- transition set T:
SHIFT, LEFT-ARC_l, RIGHT-ARC_l - can only parse projective trees
- > straightforward way to define a transition system
- bottom-up approach
arc-eager transition system
- you have a buffer and a stack and a transition set
- more incremental
~ it is more eager to create arcs - transition set T:
SHIFT, LEFT-ARC_l, RIGHT-ARC_l, REDUCE - complexity at most 2n
- Builds left-dependents bottom-up and right-dependents top-down
- arc-standard has to find whole subtree in order to attach it arc-eager does not
- we don’t know yet how to choose the next transition
- arc-eager does top-down and bottom-up parsing
projectivity
- intuititve: a tree in which there re no crossing arcs
- > ie all sub-trees are sub-strings
- projective trees much easier to handle (crossing edges make it much more diffficult)
transition systems
- Transitions operate on the parser configuration (or state):
c = (E,B,A)
A transition system is defined as:
S = (C, T , c_s , C_t )
- transition sequence
- > we transition from one configuration to next by transition
- transition: a function from configuration to configuration
greedy-transition based parsing
- want to find the most probable sequence of transitions
- in greedy parsing, instead of modeling this joint probability (the whole sequence i…m which maximizes this product sl 09-21), we make the assumption that its enough to choose the best transition each time given what we have so far (you make that decision and can’t change it - that’s why it’s greedy)
- a score s(t, c) estimates this probability where
t= a candidate transition
c= current configuration
inference in greedy-transition based parsing?
argmax_t_1…t_m = PI_i^m = P(t_i | c_i-1)
oracle for arc-standard/arc-eager
- a rule-based algorithm, does not need training
- it gets as input a correct tree
feedforward vs recurrent neural networks
t
transition classifiers (kinds of score functions)
- linear transition classifier
- NN transition classifier
- Bi-RNN encoder
error propagation in transition-based parsing
- once wrong transition is picked, then recovery very difficult since parsing is greedy
beam search algorithm
- compromise between greedy and exhaustive search
- k-beam
why is parsing hard?
- An exponential number of options per sentence
- Ambiguity
- Many many rules to learn (exceptions)
- Language speakers have no explicit knowledge of the rules of grammar
Constituency Parsing and Context-free Grammars (CFGs)
- you get annotated trees (eg from treebank) and you derive the parsing ruls from them
Probabilistic CFG (PCFG)
- not a deterministic parse
- think of it as a multinomial (each symbol has 1-n possible derivations)
- each tree has a probability
- independence assumption: The problem is that the independence assumptions are very strong: Each production is independent of any other production
e. g.: - All clauses have the same distribution
A sequence of rules is the same as a parse tree
- the two representations are equivalent
what’s the idea in context free grammar?
- rules are context-free: they ignore context
- each rule is independent of the others
Grammar-based vs. Discriminative Approaches to Parsing
t
model - learning - inference
t
constituency parsing problem
you want to find the tree with the highest score
Chomsky Normal Form (CNF)
- A common binarizationis the Chomsky normal form, which is a grammar where each production is either of the form A -> B C or A -> w, where A, B and C are pre-terminals and w is a terminal
- > often useful to treat all CFG rules as binary
- > every CFG grammar can be converted to a CNF, which includes the same sentences
Cocke-Kasami-Younger (CKY) Constituency Parsing (algorithm)
- input assumed to be in CNF
- > a recognizer that will only determine if a sentence is in the language
- > It is simple to extend it into a parser that also constructs a parse tree
Head Lexicalization
- to overcome attachment ambiguity
- trees are only different in terms of different rules applied
-> nothing relates the words its only unlexicalized rules
=> a deficiency of the vanilla PCFG
-> head lexicalization is a standard way to overcome this
-> assume each constituent has a headword
-> add headwords to each phrasal node
-> now you have more info at the rule node because headword is most informative word of subtree/that constituent (*) -> I’ll need to retrain my PCFG but it will cause huge sparsity problems: not just only so many eg 15 non-terminals, no I have 15 times the size of the vocabulary
-> has horrible sparsity
=> what to do now? use eg the Collins parser
bottom line: Lexicalization can resolve ambiguities and lead to huge boost in performance
- the cost is huge sparsity
(*) The head word of a phrase gives a good representation of the phrase’s structure and meaning
headword
the most informative word/element in the subtree
the Collins parser
- keep the head lexicalization
- to get over sparsity (introduced by head lexicalization), let’s make more (conditional) independence assumptions
- > each rule is independent of the other
- > each rule is not just a multinomial, it is a product of several terms
- > no dependency between different siblings
=> This model still has many parameters, but it reduces each of the terms in the product to depend on 2-3 lexicalized categories, instead of n+m+2 for a regular
lexicalized PCFG of this model (cf. 10-36)
learning and inference in constituency parsing
tbd
Unsupervised Learning
- paradigm of supervised learning: you have label data (you know the right answer)
- clustering: main techniques for generalization without having label set
- want to cluster words according to their part of speech without having annotated data
- discriminative models would not make sense: we don’t have labels (relation between labels and observed features not possible to work with)
- generative model can be useful
POS Tagging: a structured prediction task
t
unsupervised POS induction
Given a large corpus of raw text, partition the words into categories with respect to their grammatical function
POS tags are context-dependent -> makes the prediction much more difficult (e.g. the word ‘back’)
-> words will receive one POS tag only but if the most likely is taken the main tag will account for more than 90% of the probability mass (type-level clustering)
Supervised vs. (Completely) Unsupervised Learning
t
Distributional Clustering
-> words will receive one POS tag only but if the most likely is taken the main tag will account for more than 90% of the probability mass (type-level clustering)
Unsupervised learning for Hidden Markov Models
- adjectives have similar distribution (approx. definition of a POS tag)
- nouns also
-> POS Induction through PCA over word co-occurrence Matrices
type vs token (contexts POS)
tokens are instances of type
The Prototype Tagger
- another type-level POS induction algorithm
-
structured prediction
we label the entire sequence
1-to-1 accuracy measurement
each class is assigned exactly one cluster
m-to-1 accuracy measurement
a class may be assigned to multiple clusters
Markov Chain
- simplest Markov model
- It models the state of a system with a random variable that changes through time
conditional likelihood
In discriminative models we often talk about the conditional likelihood