Sequence Labelling for Parts of Speech and Named Entities Flashcards
Part-of-speech tagging
Taking a sequence of words and assigning each word a part of speech like NOUN or VERB
Named entity recognition
Assigning words or phrases tags like PERSON, LOCATION or ORGANIZATION.
Sequence labelling tasks
Tasks in which we assign each item in an input sequence, xᵢ, a label yᵢ, so that the output sequence Y has the same length as the input sequence X.
2 Categories of Parts of Speech
Closed class
POS with relatively fixed membership, such as prepositions. New prepositions are rarely coined.
2 Categories of Parts of Speech
Open class
Nouns and verbs. New nouns are continually being created or borrowed.
Categories of Parts of Speech
4 Major open classes
- nouns
- verbs
- adjectives
- adverbs
- (smaller open class of) interjections
Count nouns vs mass nouns
Count nouns can occur in the singular and plural (goat / goats, relationship / relationships) can can be counted.
Mass nouns are used when something is conceptualized as a homogeneous group. (snow, salt, communism).
Proper nouns
names of specific persons or entities
Verbs
Refer to actions and processes.
Adjectives
Often describe properties or qualities of nouns.
Adverbs
Adverbs generally modify something.
Directional adverbs or locative adverbs specify the direction or location of some action. (home, here, downhill)
Degree adverbs specify the extent of some action, process or property (extremely, very, somewhat).
Manner adverbs describe the manner of some action or process (slowly, slinkily, delicately)
Temporal adverbs describe the time that some action took place (yesterday, Monday)
Particle
Resembles a preposition or an adverb and is used in combination with a verb.
Conjunctions
Join two phrases, clauses or sentences.
Coordinating conjunctions - and, or, but - join two elements of equal status.
Subordinating conjunctions are used when one of the elements has some embedded status.
4 Common Named Entity types
PER (person)
LOC (location)
ORG (organization)
GPE (geo-political entity)
Markov chain
A model that tells us something about the probabilities of sequences of random variables - states - each of which can take on values from some set.
These sents can be words, or tags, or symbols representing anything, e.g. the weather.
A Markov chain makes a very strong assumption that if we want to predict the future in the sequence, all that matters is the current state. All states before the current state have no impact on the future except via the current state.
3 Components of a markov model
Q = q₁, q₂, … qₙ. A set of n
states
A = a₁₁, a₁₂, ….. aₙ₁, …aₙₙ A transition probability matrix A
, each aᵢⱼ
representing the probability of moving from state i
to state j
π = π₁, π₂, …, πₙ. An initial probability distribution over states. πᵢ is the probability that the Markov chain will start in state i
.
Hidden Markov Model
A hidden Markov model (HMM) allows us to talk about both observed events (like words in the input) and hidden events (like POS tags) that we think of as causal factors in our probabilitstic model.
Components of an HMM:
Q = q₁, q₂, … qₙ. A set of n
states
A = a₁₁, a₁₂, ….. aₙ₁, …aₙₙ A transition probability matrix A
, each aᵢⱼ
representing the probability of moving from state i
to state j
O = o₁, o₂, … oₜ. A sequence of t
observations, each one drawn from a vocabulary V
.
B = bᵢ(oᵣ). A sequence of observation likelihoods, also called emission probabilities, each expressing the probability of an observation oᵣ
being generated from a state qᵢ
.
π = π₁, π₂, …, πₙ. An initial probability distribution over states. πᵢ is the probability that the Markov chain will start in state i
.
HMM tagging as decoding
Given as input an HMM 𝞴 = (A, B)
and a sequence of observations O = o₁, o₂, ...oₜ
, find the most probable sequence of states Q = q₁, q₂, ...qₜ
2 common approaches to sequence modelling
- A generative approach: HMM tagging
- A descriminative approach: CRF tagging
How are the probabilities in HMM taggers estimated
By maximum likelihood estimation on tag-labeled training corpora.
The Viterbi algorithm is used for decoding, finding the most likely tag sequence.
Conditional Random Fields
Train a log-linear model that can choose the best tag sequence given an observation space, based on features that condition on the output tag, the prior output tag, the entire input sequence, and the current timestep.
They use the Viterbi algorithm for inference, to choose the best sequence of tags, and a version of the Forward-Backward algorithm for training.