Lecture 7 - Model Selection, Evaluation Flashcards

1
Q

What are the two goals of evaluating the prediction performance?

A
  1. Model selection: almost all methods have parameters that need to be chosen
  2. Model evaluation: After selecting your hypothesis, estimate how well it generalizes to new data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does the loss function tell?

A

The price for predicting y’ when true value is y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is misclassification rate?

A

The average zero-one loss over data set, aka the percentage of falses: 10% falses => misclassification rate 0.1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is classification accuracy?

A

The fraction of correct instances, aka 1 - misclassification rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the generalization error?

A

Expected loss over the probability distribution of our data, aka the loss on average for random new instance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the standard approach for generalization error?

A

Having a training and test data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some estimators of generalization error?

A
  • Training set error
  • Test set error
  • CV error
  • Bootstrap error
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is bias?

A

Expected difference between estimated and true error, ideally close to zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is optimistic/pessimistic bias?

A

Optimistic bias: systematically estimates error to be smaller than it is
Pessimistic bias: systematically estimates error to be larger than it is

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is variance?

A

How large the magnitudes of the differences tend to be on average.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Can method with zero bias have large variance?

A

Yes, negative + positive errors cancel each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is training set error (resubstitution error)

A

The average loss on the training examples.
Has high optimistic bias, since the model is chosen due to good fit to this particular data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is testing set error (holdout estimate)?

A

Splitting data to training and test, then computing average loss on test set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Is training set error unbiased?

A

Only if testing a single hypothesis, testing multiple hyperparameters etc leads to optimistic bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the solution to avoid optimistic bias for test set?

A

Splitting the data to training-validation-test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is stratification?

A

Using strategy to make the data splits more similar to distribution of classes in each set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Cross-Validation and why is it used?

A

It is used because too small test sets can be unreliable. We want as much data as possible for testing and training, so we use CV to make “multiple” training and testing sets from the same data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is leave-one-out cross-validation?

A

Using data of n instances, we use ith instance as a test set and n-1 as training sets, then we loop through all the n.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Can the leave-one-out CV overfit?

A

No, because test instances are not part of the training set

20
Q

What is the benefit of leave-one-out CV?

A

Allows us to use the whole data set for both training and testing simultaneously?

21
Q

What is K-fold CV?

A

Same idea as with leave-one-out CV, but instead of having one for testing, it has one fold for testing and K-1 folds for training.

22
Q

What is the K in K-fold cross validation?

A

It is the number of how many folds the data is split into, NOT HOW MANY DATA POINTS THERE ARE IN ONE FOLD.

So for example 100 data points with k=5 => test set is 20

23
Q

What is the benefit of K-fold CV?

A

Much faster than leave-one-out.

24
Q

What is the disadvantage of K-fold CV?

A

Can have a small pessimistic bias, because the model evaluated can be slightly less accurate than the final model trained on all data

25
Q

What’s the benefit of M-times repeated K-fold CV?

A

Some variance in the K-fold can be reduced with repeating the random sampled folds multiple times.

26
Q

Can we do model selection with CV?

A

Yes, using the training folds we can choose hyperparameter values, feature sets etc. and then test the final model on independent test data. does not need validaition set.

27
Q

What is nested CV?

A

CV within CV. We split the data to K-fold, then we take one fold as a test data, and other folds test different parameters within the fold itself until testing with the test data.

Computationally very expensive

28
Q

What are some challenges in performance evaluation?

A

The data isn’t always iid (independent, identically distributed).

29
Q

What’s the issue with grouped data?

A

With grouped data it’s easy to make predictions, if parts of the group is in the training and part in the testing. Solution s to randomize data on whole group level, so all group data is either on the test or training set

30
Q

Why is time series data problematic?

A

Random split is not suitable, it is easy to predict stock price if we know the previous and next day of the predicted date

31
Q

What is Mean Squared Error? What is it’s baseline?

A

(y’ - y)^2.
Zero if correct
Sensitive to outliers
Baseline: is MSE smaller than predicting the mean of training set y values

32
Q

What is mean absolute error? What is it’s baseline?

A

|y’ - y|
Zero if correct, else linearly growing
Not as sensitive to outliers
Baseline: smaller than the median of training

33
Q

What is baseline for misclassification rate?

A

Majority voter

34
Q

What is cost matrix?

A

The consequence for one class might be different than for another class.

For example broken tea cup produced again -> c1

Mailing broken cup leads to mailing a new one, reputation costs etc -> c2

if c1 is a lot cheaper, it is better to predict broken more often incorrectly than not broken

35
Q

What is confusion matrix?

A

Classification results represented in a confusion matrix, where rows are true classes and the columns are predicted classes

36
Q

What does a binary confusion matrix look like?

A

On rows are predicted values, on columns actual values,

TP FP
FN TN

37
Q

How is the precision calculated? What does it tell us?

A

Precision = TP / (TP+FP)

Tells the proportion of relevant documents among those returned, can be maximized with classifying only those instances we are most certain are positive -> bad recall

38
Q

How is recall calculated and what does it tell us?

A

Recall = TP / (TP + FN)

Maximized if there are no false positives, so easy to maximize, classify everything as positive.

Tells us the proportion of relevant documents found

39
Q

How is F-score calculated?

A

2 * (Precision * Recall) / (Precision + Recall)

40
Q

What is TPR?

A

True Positive Rate (recall)
TPR = TP / (TP + FN)

Proporion of positives correctly identified

41
Q

What is FPR?

A

False Positive Rate
FPR = FP / (FP + TN)

Proportion of negatives incorrectly identified

42
Q

What is the ROC curve?

A

Receiver operating characteristic curve? Draws the true positive rate against the false positive rate

43
Q

What is AUC?

A

The area under ROC curve. It’s an indicator how well the classifier solves the problem, the larger the are the better the solution.

44
Q

What values AUC has?

A

Between 0 and 1. 1 = perfect ranking, all instances of positive class receive higher predicted values than those of negative class. 0.5 is the same as guessing, so AUC should be better than 0.5

45
Q

What does it mean that AUC is invariant to relative class distributions?

A

AUC should be same for inbalanced data sets with 90-10 positives and 50-50 positives

46
Q

What isi classical mistake for AUC?

A

Using discrete values 1 and 0 instead of TPR and FPR values like 0.6 and 0.4