Prediction and Part-Of-Speech Tagging Flashcards
Corpus
A body of text that has been collected for some purpose.
Balanced Corpus
Contains texts which represent different genres.
Prediction
Given a sequence of words, we want to determine what’s most likely to come next.
N-gram Model
A type of Markove Chain where the sequence of the prior n -1 words i sused to predict the next.
Trigram
Use preceding two words.
Bigram
Models the preceding word
Unigram
Use no context at all.
Bigrams Model
Assigns a probabilitiy to a word based on the previous word alone.
Viterbi Algorithm
A dynamic programming technique for efficiently applying n-grams in speech recognition and other applications to find the highest probability sequence. It is usually descibed in terms of an FSA.
Smoothing
To allow for sparse data - we use smoothing. This means that we make some assumption about the probability of unseen or very infrequently seen events and distribute that probability appropriately.
Add-one Smoothing
Add one to all counts - not sound theoretically but simple to implement.
Backoff
Backing off to lower n-gram probabilities.
Part of Speech Tagging
Associating words in a corpus with a tag indication some syntactic information that applies to that particular use of the word. POS tagging makes it easier to extract some types of information.
Stochastic POS-tagging
Too complex to make flash card from look at pages 22-24 in the notes.
Evaluation of POS tagging
POS tagging algorithms are evaluated in terms of percentage of correct tags. Success rates of 95% are misleading as baseline of choosing most common tag based on training set gives 90% accuracy.