Evaluation and Linguistic Resources Flashcards
What are extrinsic evalutations?
They evaluate the performance of an NLP component by embedding it in an application and measuring how much the whole application improves
What is intrinsic evaluation?
It measures the quality of an NLP component independent of any application
How does classic AI differ from modern AI?
The classic AI is based on patterns, prescriptive grammars, symbolic rules whereas modern AI infers statistical patterns and rules from examining large quantities of text
Why must we use a test and training split of data?
It is to distinguish between signals and noise, in order to check whether your model works outside of training
What type of data splits can you have?
Training Dataset - used to fit the model
Validation Dataset - used to provide evaluation of model fit on training data while tuning hyperparameters
Test Dataset - used to provide evaluation of a final model fit on the training dataset
What size corpora is better?
The bigger the corpora, the more varied the language, so the better the training data and more word types
What is cross-validation?
It is a method of partitioning data for training and testing.
Explain how cross validation works
You divide data into k-folds. For training we use all but one of the folds and test on the final fold. This is repeated k times, with a different fold for the test each time. Then the average error rate can be calculated
What are some pros and cons of cross-validation?
Pros - less biased error measure compared to a single test set
Cons - can be time consuming when n is large, can be computationally expensive
What is the random subsampling variation on cross validation?
It is similar to k-fold but for each time, randomly choose a proportion of dataset to be the test set. The pros are that it is not dependent on the number of iterations and may be more robust to selection bias. The cons are that some points may never be selected or be selected multiple times
What is the ROUGE metric?
It is a text summarisation metric. It measures machine summaries against a gold standard set of summaries from a set of humans. It looks for common sequences of words between the two summaries. It is recall oriented
What is the BLEU metric?
It is a metric used for machine translation. The idea is that a good MT will have the same sequences of words as a human generated translation. It is precision based as it focuses only on how much of the translation it did well. It also penalises translations that are much smaller than the actual translations
What is Perplexity?
It is used to evaluate language models. It predicts how good a vocabulary is at predicting a target text based on the probability of all words in the text appearing in that order. The aim is to minimise the perplexity score
What does recall measure?
It measures the proportion of relevant items that were selected from all the relevant items
(Items that were classified compared to all items that should have been classified)
What does precision measure?
It measures the proportion of relevant items that were selected compared to all the items that were selected
(Items that were correctly classified compared to all the items that were classified)