Part-of-Speech Tagging Flashcards
Tagset
Set of all POS tags used by some model.
PoS tags: open class vs closed class
- Open class tags: noun, verb, adjective, and adverb. New words in the language are usually added to these classes
- Closed class tags: determiners, prepositions, conjunctions, etc. Closed word classes rarely receive new members.
PoS tagging task intro
The task of part-of-speech tagging involves the assignment of the proper (unique) POS tag to each word in an input sentence.
- POS tagging must be done in the context of an entire sentence, on the basis of the grammatical relationship of a word with its neighboring words
PoS tagging problem definition
The input is a sequence x1, x2, . . . , xn of (tokenized) words, and the output is a sequence y1, y2, . . . , yn of tags, with yi the tag assigned to xi.
POS tagging is therefore a structured prediction task, not a classification task.
Evaluation for PoS tagging
The accuracy of a part-of-speech tagger is the percentage of test set tags that match human gold labels.
One can compare a model with:
- Human ceiling: is the highest possible accuracy achievable by a human annotator in manually assigning the correct POS tags to a given set of words in a natural language text (97% accuracy on average).
- Most-Frequent-Class baseline: assigning each token to the class it occurred in most often in the training set. This baseline has an accuracy of about 92%.
One can visualize model performances with:
Qualitative evaluation: generate a confusion matrix for development set. This is a record of how often a word with gold tag yi was mistagged as yj.
Hidden Markov Model
HMM is a generative model: it models how a class could generate some input data. You might use the model to generate examples.
Find the HMM formula for computing the most probable sequence of tags given a sentence
Let w1:n = w1w2 … wn be an input sequence of words.
The goal of POS tagging is to choose the most probable tag sequence among all tag sequences t1:n = t1t2 … tn for w1:n
1. write the formula (hint: argmax on P(t1:n|w1:n):
- P(t1:n|w1:n) is referred to as the posterior probability and can be difficult to model/compute.
2. break down the posterior probability in the Hidden Markov model:
- Write the formula (use the Bayes rule)
- The term P(t1:n) is referred to as the prior or marginal probability. The term P(w1:n|t1:n) is the likelihood of the words given the tags.
3. Markov assumptions:
- the probability of a word depends only on its own POS tag and is independent of neighboring words and tags (write the formula)
The term P(wi|ti) is referred to as the emission probability - the probability of a tag is dependent only on the previous tag, rather than the entire tag sequence: (write the formula)
The term P(ti|ti-1) is referred to as the transition probability
4. PUT ALL TOGETHER
PoS tagging: Probability estimation
Assume a tagged corpus, where each word has been tagged with its gold label. We implement supervised learning using the relative frequency estimator.
- For the transition probability we obtain: P(ti|ti-1)=C(ti-1 t)/C(t-i)
- Similarly, for the emission probability we obtain: P(wi|ti)=C(ti, wi)/C(ti)
Examples at slide 26 pdf 7…
Hidden Markov model as automata, structure, how it works and components
We can formally define an HMM as a special type of probabilistic finite state automaton which generates sentences (instead of accepting sentences).
STRUCTURE:
- The states represent ‘hidden’ information, that is, the POS tags which are not observed.
- The transition function is defined according to the transition probabilities.
HOW IT WORKS:
Each state generates a word according to the emission probabilities. The generated output is an observable word sequence.
COMPONENTS:
- finite set of output symbols V
- finite set of states Q, with initial state q0 and final state qf
- transition probabilities α(q, q’) (sum for each q’ is 1)
- emission probabilities β(q, a) (sum for each a is 1)
Viterbi algorithm for decoding
Given a sequence of observations, find the most probable sequence of states/tags.
Let w = w1 … wn be the input sequence.
In what follows:
- q denotes a state/tag of the HMM
- i denotes an input position, 0 <= i <= n+1
We use a two-dimensional table viterbi[q,i], is denoting the probability of best path to get to state q after scanning w1:i
We use a two-dimensional table backpointer[q,i], is for retrieving the best path.
-
Initialisation step (for all q):
- viterbi[q,1] = α(q0, q) β(q, w1)
- backpointer[q,1]=0 -
Recursive step (for all i=2, …, n and for all q):
- viterbi[q,i] = max q’ viterbi[q’,i-1] α(q’, q) β(q, wi)
- backpointer[q,i] = argmax q’ viterbi[q’,i-1] α(q’, q) β(q, wi) -
Termination step:
- viterbi[qf,n+1] = max q’ viterbi[q’,n] α(q’, qf)
- backpointer[qf,n+1] = argmax q’ viterbi[q’,n] α(q’, qf)
After execution of the algorithm, viterbi[qf, n+1] is the probability of the most likely sequence of tags for w.
The sequence of tags q1, . . . , qn can be reconstructed starting with backpointer[qf, n+1] and following the backpointers.
https://www.youtube.com/watch?v=IqXdjdOgXPM
Conditional random fields (CRF)
Conditional random fields (CRF) are discriminative sequence models based on log-linear models. Discriminative classifiers learn what features from the input are most useful to discriminate between the different possible classes.
The model can only distinguish the classes, perhaps without learning much about them: it is unable to generate observations/examples.
- Let x1:n = x1, x2, … , xn be the input word sequence, and let Y(x1:n) be the set of all possible tag sequences y1:n = y1, y2, … , yn for x1:n.
SOLVES THE PROBLEM:
ŷ1:n = argmax (y1:n in Y) P(y1:n|x1:n)
HMM vs CRFs
- In contrast with HMMs, CRFs directly model the posterior probability
- In contrast to HMM, CRFs do not compute a probability for each tag at each time step. Instead, at each time step the CRF computes log-linear functions over a set of relevant features, and these local features are aggregated and normalised to produce a global probability.
How CRFs use global features, partition function
- Let us assume we have K global feature functions Fk(x1:n, y1:n), and weights wk for each feature.
define the probability P(y1:n|x1:n) using K global features…
The denominator is the so-called partition function Z(x1:n), a normalization term that only depends on the input sequence x1:n.
How CRFs decompose global features, linear chain CRF
We compute the global features Fk(x1:n, y1:n) by decomposing into a sum of local features, for each position i in y1:n: write the formula…
Each local feature depends on the current and previous output tokens, yi and yi-1 respectively.
The specific constrain above characterises linear chain CRF. This limitation makes it possible to use the Viterbi algorithm.
How CRFs compute local features, types
To avoid feature handwriting, specific features are automatically instantiated from feature templates.
Example of templates: <yi, xi>, <yi, yi-1>, <yi, xi-1, xi-2>
TYPES of FEATURES:
- Word shape features are used to represent letter patterns (e.g. “DC10-30” write a feature for this word…)
- Prefix / suffix features are used to represent word morphological patterns (e.g. “well-dressed” write a feature for this word…)