POS Tagging Flashcards
What is POS tagging?
Identifying parts of speech (syntactic categories) in a given string (and audio) for further processing.
Give some examples of other tagging tasks
Case restoration, Named Entity Recognition, Information Field Segmentation, Prosodic marking
What is Case restoration?
If some text has been converted all to lower case or upper case. Case restoration is trying to do the reverse. That is for a string like: “this is not a drill. i repeat. this is not a drill”, we would hopefully try to restore: “This is not a drill. I repeat. This is not a drill”
What is Information Field Segmentation
Trying to find words that fit under a certain category (field) within a body of text
What is prosodic marking?
Determining which words have certain intonations/stress/tones. Eg. “He’s going”, “He’s going!”, “He’s going?” Would all have different intonations that change their meaning.
What are open-class words?
verbs, adjectives, adverbs, nouns, they contain most of the content of a sentence They are constantly changing and new additions are being made all the time (for example: “googling” as a verb)
what are closed-class words?
pronouns, determiners, prepositions, connectives These are mostly just functional, there is a limited number of these and they act to tie the concept of a sentence together
What type of tags would you expect morphologically rich languages to have?
compound morphosyntactic tags
What is a homograph?
Two sentences that use the same word(s) X with different parts of speech tags associated with X
What makes POS tagging difficult?
Open question. One answer is homographs. Knowledge of words required.
How do we define a probabilistic model for tagging?
Say t = a tag, and we begin a sentence with t0 =
If we say that the probability of a following tag depends on the probability of the previous tag then this probability is given by P(ti|ti-1)
If we say that each tag can be realised as one of many different words so each of these words are conditional on the tag, this probability is given by P(w|ti).
To generate sentence of length n:
Let t0=
For i= 1 to n
Choose a tag conditioned on previous tag: P(ti|ti−1)
Choose a word conditioned on its tag:P(wi|ti)
What are the assumptions the probabilistic model for a tagged sentence of length n makes?
Each tag depends only on the previous tag: a bigram model over tags.
Words are conditionally independent given tags
What’s a “balanced corpus”?
One that has data from different genres and on different topics