Part-of-Speech Tagging Flashcards

Question 1

Q

Tagset

Answer

A

Set of all POS tags used by some model.

Question 2

Q

PoS tags: open class vs closed class

Answer

A

Open class tags: noun, verb, adjective, and adverb. New words in the language are usually added to these classes
Closed class tags: determiners, prepositions, conjunctions, etc. Closed word classes rarely receive new members.

Question 3

Q

PoS tagging task intro

Answer

A

The task of part-of-speech tagging involves the assignment of the proper (unique) POS tag to each word in an input sentence.

POS tagging must be done in the context of an entire sentence, on the basis of the grammatical relationship of a word with its neighboring words

Question 4

Q

PoS tagging problem definition

Answer

A

The input is a sequence x1, x2, . . . , xn of (tokenized) words, and the output is a sequence y1, y2, . . . , yn of tags, with yi the tag assigned to xi.

POS tagging is therefore a structured prediction task, not a classification task.

Question 5

Q

Evaluation for PoS tagging

Answer

A

The accuracy of a part-of-speech tagger is the percentage of test set tags that match human gold labels.

One can compare a model with:

Human ceiling: is the highest possible accuracy achievable by a human annotator in manually assigning the correct POS tags to a given set of words in a natural language text (97% accuracy on average).
Most-Frequent-Class baseline: assigning each token to the class it occurred in most often in the training set. This baseline has an accuracy of about 92%.

One can visualize model performances with:

Qualitative evaluation: generate a confusion matrix for development set. This is a record of how often a word with gold tag yi was mistagged as yj.

Question 6

Q

Hidden Markov Model

Answer

A

HMM is a generative model: it models how a class could generate some input data. You might use the model to generate examples.

Question 7

Q

Find the HMM formula for computing the most probable sequence of tags given a sentence

Answer

A

Let w1:n = w1w2 … wn be an input sequence of words.

The goal of POS tagging is to choose the most probable tag sequence among all tag sequences t1:n = t1t2 … tn for w1:n

1. write the formula (hint: argmax on P(t1:n|w1:n):

P(t1:n|w1:n) is referred to as the posterior probability and can be difficult to model/compute.

2. break down the posterior probability in the Hidden Markov model:

Write the formula (use the Bayes rule)
The term P(t1:n) is referred to as the prior or marginal probability. The term P(w1:n|t1:n) is the likelihood of the words given the tags.

3. Markov assumptions:

the probability of a word depends only on its own POS tag and is independent of neighboring words and tags (write the formula)
The term P(wi|ti) is referred to as the emission probability
the probability of a tag is dependent only on the previous tag, rather than the entire tag sequence: (write the formula)
The term P(ti|ti-1) is referred to as the transition probability

4. PUT ALL TOGETHER

Question 8

Q

PoS tagging: Probability estimation

Answer

A

Assume a tagged corpus, where each word has been tagged with its gold label. We implement supervised learning using the relative frequency estimator.

For the transition probability we obtain: P(ti|ti-1)=C(ti-1 t)/C(t-i)
Similarly, for the emission probability we obtain: P(wi|ti)=C(ti, wi)/C(ti)

Examples at slide 26 pdf 7…

Question 9

Q

Hidden Markov model as automata, structure, how it works and components

Answer

A

We can formally define an HMM as a special type of probabilistic finite state automaton which generates sentences (instead of accepting sentences).

STRUCTURE:

The states represent ‘hidden’ information, that is, the POS tags which are not observed.
The transition function is defined according to the transition probabilities.

HOW IT WORKS:

Each state generates a word according to the emission probabilities. The generated output is an observable word sequence.

COMPONENTS:

finite set of output symbols V
finite set of states Q, with initial state q0 and final state qf
transition probabilities α(q, q’) (sum for each q’ is 1)
emission probabilities β(q, a) (sum for each a is 1)

Question 10

Q

Viterbi algorithm for decoding

Answer

A

Given a sequence of observations, find the most probable sequence of states/tags.

Let w = w1 … wn be the input sequence.

In what follows:

q denotes a state/tag of the HMM
i denotes an input position, 0 <= i <= n+1

We use a two-dimensional table viterbi[q,i], is denoting the probability of best path to get to state q after scanning w1:i

We use a two-dimensional table backpointer[q,i], is for retrieving the best path.

Initialisation step (for all q):
- viterbi[q,1] = α(q0, q) β(q, w1)
- backpointer[q,1]=0
Recursive step (for all i=2, …, n and for all q):
- viterbi[q,i] = max q’ viterbi[q’,i-1] α(q’, q) β(q, wi)
- backpointer[q,i] = argmax q’ viterbi[q’,i-1] α(q’, q) β(q, wi)
Termination step:
- viterbi[qf,n+1] = max q’ viterbi[q’,n] α(q’, qf)
- backpointer[qf,n+1] = argmax q’ viterbi[q’,n] α(q’, qf)

After execution of the algorithm, viterbi[qf, n+1] is the probability of the most likely sequence of tags for w.

The sequence of tags q1, . . . , qn can be reconstructed starting with backpointer[qf, n+1] and following the backpointers.

https://www.youtube.com/watch?v=IqXdjdOgXPM

Question 11

Q

Conditional random fields (CRF)

Answer

A

Conditional random fields (CRF) are discriminative sequence models based on log-linear models. Discriminative classifiers learn what features from the input are most useful to discriminate between the different possible classes.

The model can only distinguish the classes, perhaps without learning much about them: it is unable to generate observations/examples.

Let x1:n = x1, x2, … , xn be the input word sequence, and let Y(x1:n) be the set of all possible tag sequences y1:n = y1, y2, … , yn for x1:n.

SOLVES THE PROBLEM:

ŷ1:n = argmax (y1:n in Y) P(y1:n|x1:n)

Question 12

Q

HMM vs CRFs

Answer

A

In contrast with HMMs, CRFs directly model the posterior probability
In contrast to HMM, CRFs do not compute a probability for each tag at each time step. Instead, at each time step the CRF computes log-linear functions over a set of relevant features, and these local features are aggregated and normalised to produce a global probability.

Question 13

Q

How CRFs use global features, partition function

Answer

A

Let us assume we have K global feature functions Fk(x1:n, y1:n), and weights wk for each feature.

define the probability P(y1:n|x1:n) using K global features…

The denominator is the so-called partition function Z(x1:n), a normalization term that only depends on the input sequence x1:n.

Question 14

Q

How CRFs decompose global features, linear chain CRF

Answer

A

We compute the global features Fk(x1:n, y1:n) by decomposing into a sum of local features, for each position i in y1:n: write the formula…

Each local feature depends on the current and previous output tokens, yi and yi-1 respectively.

The specific constrain above characterises linear chain CRF. This limitation makes it possible to use the Viterbi algorithm.

Question 15

Q

How CRFs compute local features, types

Answer

A

To avoid feature handwriting, specific features are automatically instantiated from feature templates.

Example of templates: <yi, xi>, <yi, yi-1>, <yi, xi-1, xi-2>

TYPES of FEATURES:

Word shape features are used to represent letter patterns (e.g. “DC10-30” write a feature for this word…)
Prefix / suffix features are used to represent word morphological patterns (e.g. “well-dressed” write a feature for this word…)

Question 16

Q

Inference problem for linear-chain CRF

Answer

A

ŷ1:n = argmax (y1:n in Y) P(y1:n|x1:n)
define the probability P(y1:n|x1:n) using the global features
remove constant terms
sum of local features. write the final formula…

Question 17

Q

Decoding for CRF using Viterbi algorithm

Answer

A

For CRF we need to replace the prior and the likelihood probabilities with the CRF features.

write the formula…

Question 18

Q

Limitations of HMMs and CRFs

Answer

A

One limitation with HMM and CRF architectures is that the models are exclusively run left-to-right.

Bidirectional models are quite standard for deep learning, as we will see with the BiLSTM models to be introduced later.

Question 19

Q

Neural PoS taggers, main approaches

Answer

A

In neural network approaches to POS tagging, we construct distributed feature representations (dense vectors) for each tagging decision, based on the word and its context.

Two main approaches:

Local search: Neural networks can perform POS tagging as a per-token classification decision (Fixed-window neural model, RNN, RBM)
Global search: Feature representations can be combined with the Viterbi algorithm to tag the entire sequence globally (joint tagging).

Question 20

Q

Fixed-window neural model (local search)

Answer

A

Fixed-window models use a feed-forward neural network to implement local search.

The fixed-window model is very efficient, since it limits the context from which information can be extracted.
Sliding windows makes it difficult for network to learn systematic patterns arising from phenomena like constituency, since patterns are shifted to different positions.

Question 21

Q

Recurrent neural model (local search)

Answer

A

Let x1, x2, . . . , xn be the input word sequence and y1, y2, . . . , yn be the associated output POS tags.

We assume word embeddings e1, e2, . . . , en for x1, x2, . . . , xn.

Recurrent neural networks can be generally described as implementing the following recursive relation, for t = 1, . . . , n:

ht = g(et, ht-1) = f(W(h)ht-1+W(e)et+b)

We score each POS tag y by means of a linear function of hidden state vector ht, and then retrive the highest score tag for xt:

ψ(y, ht) = βht
ŷt = argmax (y*) ψ(y, ht)

Question 22

Q

Recurrent bidirectional model (local search)

Answer

A

Hidden state vector ht encodes left context up to position t but it ignores subsequent tokens, which might be relevant to yt as well.

Bidirectional RNN are used to address this problem.

htDX = g(et, ht-1DX)
htSX = g’(et, ht+1SX)
ht = [htDX, htSX]

We score each POS tag y by means of a linear function of hidden state vector ht, and then retrive the highest score tag for xt:

ψ(y, ht) = βht
ŷt = argmax (y*) ψ(y, ht)

Question 23

Q

Named entity recognition and other sequence labelling tasks

Answer

A

Named entity recognition (NER) seeks to locate multi-word expressions referring to entities such as person names, organizations, locations, time expressions, quantities, monetary values, etc.

Besides tagging, in NER we also need to find the proper span of the target expression.

The standard approach for span-recognition is BIO tagging.

In BIO tagging:

tokens that begin a span are marked with label B
tokens that occur inside a span are marked with I
tokens outside of any span of interest are marked with O