Topic 3: Part of Speech Tagging Flashcards
What is part of speech?
word classes/syntactic categories reveals alot of word and its neighbours
example:
noun is likely to be preceeded by determiners and adjective
ver is likely to be proceeded by noun
syntactic structure..nouns are part of noun phrases
Application of POS
- parsing
- labelling named entities in IR
- Co-reference resolution..example
- speech recognition or synthesis
What are the 2 categories of POS
closed class and open class
Closed class
relatively fix membership, preposition or new preposition are rarely coined
generally function words
occur frequently, short and often have structuring uses in grammer
E.g: of, it, and or you
Open Class
noun and verbs..can have new word..
new noun : iPhone
new verb: to fax
are continually being created or borrowed
In English: there are 4 (nouns, verbs, adjectives, adverbs)
many languages have the 4 but not all
Open Class: Noun
Give example Proper noun (Penang samsung, IBM, Intel) Common noun (cat, pencil count noun ( cat, cats) - can be enumerated grammatically mass noun (salt, snow)
Open Class: Verb
have inflections (a change of the form of word)
non-third-person-sg (eat), third-person-sg (eats)
progressive (eating), past participle (eaten)
Open Class: Adjective
terms for properties and qualities
e.g concepts of color (blue, yellow), age, value
Open Class: Adverb
view as modifying something directional/locative (here, downhill) degree (very, extremely) manner (slowly, delicately) temporal (yesterday, monday)
Closed class
list and examples prepositions particles determiners pronouncs conjunctions auxiliary verbs numerals
Closed class: Prepositions
occur before noun phrases. semantically often indicate spatial preposition or temporal relations
Closed class: Particles
resembles a preposition or adverb but used in combination with verb for extended meanings
eg over : she turned the paper over
Closed class: Phrasal verb
verb and particle that act as single syntactic and/or semantic unit example: turn down find out go on
Closed class: Determiners
the closed class that occurs with nouns, marking the beginning of noun phrase eg a, an, the, this, that
Closed class: Conjunction
word that join 2 phrases, clauses or sentences
eg: and, or but, that
Closed class: Pronoun
forms that often act as a kind of shorthand for referring to the same noun phrase or entity or event
eg: you, she, i, it, me, (personal pronoun)
my, your, his, her, its, one’s, our, their (possessive pronoun)
Penn Treebank Tagset
45-tag Penn Treebank tagset used to label many corpora
in labelling, POS generally represented by placing the tag after each word delimited by a slash
Choosing tagset
Coarse tagsets:
8 parts of speech which is noun, verb, pronoun, preposition, adverb, conjunction, participle, article
Finer grained tagset:
45 tag “ Penn Treebank Tagset”
More finer grained tagset
87-tag tagset from Brown tagset
Tagged corpora
corpora labelled with POS tag. cruicial for training statistical tagging algorithm
created by running automated POS tagger on texts then human annotators hand-corrected the tags
words are generally tokenized before tagging
Three main tagged corpora commonly used
Brown corpus
WSJ corpus
Switchboard corpus
Role of tokenization in tagging
Treebank tagset assumes tokenization of multipart words done at whitespace process..seperated by whitespaces
recap tagging what is POS tagging and the model
process of assigning POS marker to each word in input text
input: sequence o tokenized word and tagset
function: tagging algorithm
output: sequence of tags, one per token
Challenges in tagging and give example
- words are ambiguous
tagging is a task to disambiguate
ambiguity - one of more ambiguous possible POS. goal is to find the correct tag
For example a verb (book that flight) a noun (hand me that book)
High ambiguous word
that, back, down, put and set..
example there are 6 different POS for "back" JJ NN VBP VB RP RB
Method for POS tagging
rule based…tag based on hand-written disambiguation rules
probabilistic/stochastic tagger
resolve tagging ambiguities by training corpus.
compute probability of given word having given tag in a given context
HMM tagger
Simple baseline algorithm
idea of more likely POS
for example “ a” can be a determiner or the letter “a”.
but it is more likely a determiner.
simplistic baseline algo for POS choose tag that most frequent in training corpus given the ambiguous word.
POS Accuracy
standard performance measure
percentage of tags correctly labelled matching human labels test set.
always compare classifier against baseline at least as good as the most frequent class baseline. MFCB
Rule based POS tagger
assign list of potential POS tag to each word based on dictionary
manual rules for out of vocab words
apply handwritten constrains until each word has only one possible POS
example
- DT cannot immediately precede with a verb
- no verb can immediately precede a tensed verb
- eliminate VBN if VBD is an option
A probabilistic method for POS
consider all possibile sequences of classes
chose tag sequences which is most probable given the observation sequence of n words
Estimating probability
word likelihood prob * tag transition prob