Prediction and Part-Of-Speech Tagging Flashcards
Corpus
A body of text that has been collected for some purpose.
Balanced Corpus
Contains texts which represent different genres.
Prediction
Given a sequence of words, we want to determine what’s most likely to come next.
N-gram Model
A type of Markove Chain where the sequence of the prior n -1 words i sused to predict the next.
Trigram
Use preceding two words.
Bigram
Models the preceding word
Unigram
Use no context at all.
Bigrams Model
Assigns a probabilitiy to a word based on the previous word alone.
Viterbi Algorithm
A dynamic programming technique for efficiently applying n-grams in speech recognition and other applications to find the highest probability sequence. It is usually descibed in terms of an FSA.
Smoothing
To allow for sparse data - we use smoothing. This means that we make some assumption about the probability of unseen or very infrequently seen events and distribute that probability appropriately.
Add-one Smoothing
Add one to all counts - not sound theoretically but simple to implement.
Backoff
Backing off to lower n-gram probabilities.
Part of Speech Tagging
Associating words in a corpus with a tag indication some syntactic information that applies to that particular use of the word. POS tagging makes it easier to extract some types of information.
Stochastic POS-tagging
Too complex to make flash card from look at pages 22-24 in the notes.
Evaluation of POS tagging
POS tagging algorithms are evaluated in terms of percentage of correct tags. Success rates of 95% are misleading as baseline of choosing most common tag based on training set gives 90% accuracy.
Training data and test data
The assumption in NLP is that a system should work on novel data. The test data must therefore be kept unseen.
Baselines
Report evaluation with respect to a baseline, which is normally what could be achieved with a very basic approach, given the same training data.
Ceiling
Ceiling for performance of an application. This is usually taken to be human performance on that task where ceiling is percentage agreement found between two annotators.
Error Analysis
Error rate on a particular program will be distributed very unevenly. Some errors may also be more important than others e.g. treating an incoming order as junk is much worse than converse.
Reproducibility
Evaluation should be done on a generally available corpus so that other researches can replicate the experiments.