Evaluation and Linguistic Resources Flashcards

Question 1

Q

What are extrinsic evalutations?

Answer

A

They evaluate the performance of an NLP component by embedding it in an application and measuring how much the whole application improves

Question 2

Q

What is intrinsic evaluation?

Answer

A

It measures the quality of an NLP component independent of any application

Question 3

Q

How does classic AI differ from modern AI?

Answer

A

The classic AI is based on patterns, prescriptive grammars, symbolic rules whereas modern AI infers statistical patterns and rules from examining large quantities of text

Question 4

Q

Why must we use a test and training split of data?

Answer

A

It is to distinguish between signals and noise, in order to check whether your model works outside of training

Question 5

Q

What type of data splits can you have?

Answer

A

Training Dataset - used to fit the model

Validation Dataset - used to provide evaluation of model fit on training data while tuning hyperparameters

Test Dataset - used to provide evaluation of a final model fit on the training dataset

Question 6

Q

What size corpora is better?

Answer

A

The bigger the corpora, the more varied the language, so the better the training data and more word types

Question 7

Q

What is cross-validation?

Answer

A

It is a method of partitioning data for training and testing.

Question 8

Q

Explain how cross validation works

Answer

A

You divide data into k-folds. For training we use all but one of the folds and test on the final fold. This is repeated k times, with a different fold for the test each time. Then the average error rate can be calculated

Question 9

Q

What are some pros and cons of cross-validation?

Answer

A

Pros - less biased error measure compared to a single test set

Cons - can be time consuming when n is large, can be computationally expensive

Question 10

Q

What is the random subsampling variation on cross validation?

Answer

A

It is similar to k-fold but for each time, randomly choose a proportion of dataset to be the test set. The pros are that it is not dependent on the number of iterations and may be more robust to selection bias. The cons are that some points may never be selected or be selected multiple times

Question 11

Q

What is the ROUGE metric?

Answer

A

It is a text summarisation metric. It measures machine summaries against a gold standard set of summaries from a set of humans. It looks for common sequences of words between the two summaries. It is recall oriented

Question 12

Q

What is the BLEU metric?

Answer

A

It is a metric used for machine translation. The idea is that a good MT will have the same sequences of words as a human generated translation. It is precision based as it focuses only on how much of the translation it did well. It also penalises translations that are much smaller than the actual translations

Question 13

Q

What is Perplexity?

Answer

A

It is used to evaluate language models. It predicts how good a vocabulary is at predicting a target text based on the probability of all words in the text appearing in that order. The aim is to minimise the perplexity score

Question 14

Q

What does recall measure?

Answer

A

It measures the proportion of relevant items that were selected from all the relevant items

(Items that were classified compared to all items that should have been classified)

Question 15

Q

What does precision measure?

Answer

A

It measures the proportion of relevant items that were selected compared to all the items that were selected

(Items that were correctly classified compared to all the items that were classified)

Question 16

Q

How do we balance Precision and Recall?

Answer

Study These Flashcards

A

We use the F-measure which is the weighted harmonic mean of precision and recall.

F1 balances both factors equally (F1 = 2PR / (P + R))

Question 17

Q

What do we do when multiple classes exist?

Answer

Study These Flashcards

A

We can calculate separately for each class and combined (macroaveraging) - is more balanced than micro

Or we can calucalte once based on pooled data for each class (microaveraging) - dominated by frequent classes

Evaluation and Linguistic Resources Flashcards

(17 cards)