Part-of-Speech Tagging Flashcards

1
Q

Tagset

A

Set of all POS tags used by some model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

PoS tags: open class vs closed class

A
  • Open class tags: noun, verb, adjective, and adverb. New words in the language are usually added to these classes
  • Closed class tags: determiners, prepositions, conjunctions, etc. Closed word classes rarely receive new members.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

PoS tagging task intro

A

The task of part-of-speech tagging involves the assignment of the proper (unique) POS tag to each word in an input sentence.

  • POS tagging must be done in the context of an entire sentence, on the basis of the grammatical relationship of a word with its neighboring words
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

PoS tagging problem definition

A

The input is a sequence x1, x2, . . . , xn of (tokenized) words, and the output is a sequence y1, y2, . . . , yn of tags, with yi the tag assigned to xi.

POS tagging is therefore a structured prediction task, not a classification task.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Evaluation for PoS tagging

A

The accuracy of a part-of-speech tagger is the percentage of test set tags that match human gold labels.

One can compare a model with:

  • Human ceiling: is the highest possible accuracy achievable by a human annotator in manually assigning the correct POS tags to a given set of words in a natural language text (97% accuracy on average).
  • Most-Frequent-Class baseline: assigning each token to the class it occurred in most often in the training set. This baseline has an accuracy of about 92%.

One can visualize model performances with:

Qualitative evaluation: generate a confusion matrix for development set. This is a record of how often a word with gold tag yi was mistagged as yj.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Hidden Markov Model

A

HMM is a generative model: it models how a class could generate some input data. You might use the model to generate examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Find the HMM formula for computing the most probable sequence of tags given a sentence

A

Let w1:n = w1w2 … wn be an input sequence of words.

The goal of POS tagging is to choose the most probable tag sequence among all tag sequences t1:n = t1t2 … tn for w1:n

1. write the formula (hint: argmax on P(t1:n|w1:n):

  • P(t1:n|w1:n) is referred to as the posterior probability and can be difficult to model/compute.

2. break down the posterior probability in the Hidden Markov model:

  • Write the formula (use the Bayes rule)
  • The term P(t1:n) is referred to as the prior or marginal probability. The term P(w1:n|t1:n) is the likelihood of the words given the tags.

3. Markov assumptions:

  1. the probability of a word depends only on its own POS tag and is independent of neighboring words and tags (write the formula)
    The term P(wi|ti) is referred to as the emission probability
  2. the probability of a tag is dependent only on the previous tag, rather than the entire tag sequence: (write the formula)
    The term P(ti|ti-1) is referred to as the transition probability

4. PUT ALL TOGETHER

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

PoS tagging: Probability estimation

A

Assume a tagged corpus, where each word has been tagged with its gold label. We implement supervised learning using the relative frequency estimator.

  • For the transition probability we obtain: P(ti|ti-1)=C(ti-1 t)/C(t-i)
  • Similarly, for the emission probability we obtain: P(wi|ti)=C(ti, wi)/C(ti)

Examples at slide 26 pdf 7…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Hidden Markov model as automata, structure, how it works and components

A

We can formally define an HMM as a special type of probabilistic finite state automaton which generates sentences (instead of accepting sentences).

STRUCTURE:

  • The states represent ‘hidden’ information, that is, the POS tags which are not observed.
  • The transition function is defined according to the transition probabilities.

HOW IT WORKS:

Each state generates a word according to the emission probabilities. The generated output is an observable word sequence.

COMPONENTS:

  • finite set of output symbols V
  • finite set of states Q, with initial state q0 and final state qf
  • transition probabilities α(q, q’) (sum for each q’ is 1)
  • emission probabilities β(q, a) (sum for each a is 1)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Viterbi algorithm for decoding

A

Given a sequence of observations, find the most probable sequence of states/tags.

Let w = w1 … wn be the input sequence.

In what follows:

  • q denotes a state/tag of the HMM
  • i denotes an input position, 0 <= i <= n+1

We use a two-dimensional table viterbi[q,i], is denoting the probability of best path to get to state q after scanning w1:i

We use a two-dimensional table backpointer[q,i], is for retrieving the best path.

  1. Initialisation step (for all q):
    - viterbi[q,1] = α(q0, q) β(q, w1)
    - backpointer[q,1]=0
  2. Recursive step (for all i=2, …, n and for all q):
    - viterbi[q,i] = max q’ viterbi[q’,i-1] α(q’, q) β(q, wi)
    - backpointer[q,i] = argmax q’ viterbi[q’,i-1] α(q’, q) β(q, wi)
  3. Termination step:
    - viterbi[qf,n+1] = max q’ viterbi[q’,n] α(q’, qf)
    - backpointer[qf,n+1] = argmax q’ viterbi[q’,n] α(q’, qf)

After execution of the algorithm, viterbi[qf, n+1] is the probability of the most likely sequence of tags for w.

The sequence of tags q1, . . . , qn can be reconstructed starting with backpointer[qf, n+1] and following the backpointers.

https://www.youtube.com/watch?v=IqXdjdOgXPM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Conditional random fields (CRF)

A

Conditional random fields (CRF) are discriminative sequence models based on log-linear models. Discriminative classifiers learn what features from the input are most useful to discriminate between the different possible classes.

The model can only distinguish the classes, perhaps without learning much about them: it is unable to generate observations/examples.

  • Let x1:n = x1, x2, … , xn be the input word sequence, and let Y(x1:n) be the set of all possible tag sequences y1:n = y1, y2, … , yn for x1:n.

SOLVES THE PROBLEM:

ŷ1:n = argmax (y1:n in Y) P(y1:n|x1:n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

HMM vs CRFs

A
  • In contrast with HMMs, CRFs directly model the posterior probability
  • In contrast to HMM, CRFs do not compute a probability for each tag at each time step. Instead, at each time step the CRF computes log-linear functions over a set of relevant features, and these local features are aggregated and normalised to produce a global probability.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How CRFs use global features, partition function

A
  • Let us assume we have K global feature functions Fk(x1:n, y1:n), and weights wk for each feature.

define the probability P(y1:n|x1:n) using K global features…

The denominator is the so-called partition function Z(x1:n), a normalization term that only depends on the input sequence x1:n.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How CRFs decompose global features, linear chain CRF

A

We compute the global features Fk(x1:n, y1:n) by decomposing into a sum of local features, for each position i in y1:n: write the formula…

Each local feature depends on the current and previous output tokens, yi and yi-1 respectively.

The specific constrain above characterises linear chain CRF. This limitation makes it possible to use the Viterbi algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How CRFs compute local features, types

A

To avoid feature handwriting, specific features are automatically instantiated from feature templates.

Example of templates: <yi, xi>, <yi, yi-1>, <yi, xi-1, xi-2>

TYPES of FEATURES:

  • Word shape features are used to represent letter patterns (e.g. “DC10-30” write a feature for this word…)
  • Prefix / suffix features are used to represent word morphological patterns (e.g. “well-dressed” write a feature for this word…)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Inference problem for linear-chain CRF

A
  1. ŷ1:n = argmax (y1:n in Y) P(y1:n|x1:n)
  2. define the probability P(y1:n|x1:n) using the global features
  3. remove constant terms
  4. sum of local features. write the final formula…
17
Q

Decoding for CRF using Viterbi algorithm

A

For CRF we need to replace the prior and the likelihood probabilities with the CRF features.

write the formula…

18
Q

Limitations of HMMs and CRFs

A

One limitation with HMM and CRF architectures is that the models are exclusively run left-to-right.

Bidirectional models are quite standard for deep learning, as we will see with the BiLSTM models to be introduced later.

19
Q

Neural PoS taggers, main approaches

A

In neural network approaches to POS tagging, we construct distributed feature representations (dense vectors) for each tagging decision, based on the word and its context.

Two main approaches:

  • Local search: Neural networks can perform POS tagging as a per-token classification decision (Fixed-window neural model, RNN, RBM)
  • Global search: Feature representations can be combined with the Viterbi algorithm to tag the entire sequence globally (joint tagging).
20
Q

Fixed-window neural model (local search)

A

Fixed-window models use a feed-forward neural network to implement local search.

  • The fixed-window model is very efficient, since it limits the context from which information can be extracted.
  • Sliding windows makes it difficult for network to learn systematic patterns arising from phenomena like constituency, since patterns are shifted to different positions.
21
Q

Recurrent neural model (local search)

A
  • Let x1, x2, . . . , xn be the input word sequence and y1, y2, . . . , yn be the associated output POS tags.

We assume word embeddings e1, e2, . . . , en for x1, x2, . . . , xn.

Recurrent neural networks can be generally described as implementing the following recursive relation, for t = 1, . . . , n:

ht = g(et, ht-1) = f(W(h)ht-1+W(e)et+b)

  • We score each POS tag y by means of a linear function of hidden state vector ht, and then retrive the highest score tag for xt:

ψ(y, ht) = βht
ŷt = argmax (
y*) ψ(y, ht)

22
Q

Recurrent bidirectional model (local search)

A

Hidden state vector ht encodes left context up to position t but it ignores subsequent tokens, which might be relevant to yt as well.

Bidirectional RNN are used to address this problem.

htDX = g(et, ht-1DX)
htSX = g’(et, ht+1SX)
ht = [htDX, htSX]

We score each POS tag y by means of a linear function of hidden state vector ht, and then retrive the highest score tag for xt:

ψ(y, ht) = βht
ŷt = argmax (
y*) ψ(y, ht)

23
Q

Named entity recognition and other sequence labelling tasks

A

Named entity recognition (NER) seeks to locate multi-word expressions referring to entities such as person names, organizations, locations, time expressions, quantities, monetary values, etc.

Besides tagging, in NER we also need to find the proper span of the target expression.

The standard approach for span-recognition is BIO tagging.

In BIO tagging:

  • tokens that begin a span are marked with label B
  • tokens that occur inside a span are marked with I
  • tokens outside of any span of interest are marked with O