Lecture 7 - Model Selection, Evaluation Flashcards by Santeri Isotalo

What are the two goals of evaluating the prediction performance?

Model selection: almost all methods have parameters that need to be chosen
Model evaluation: After selecting your hypothesis, estimate how well it generalizes to new data.

How well did you know this?

Not at all

Perfectly

What does the loss function tell?

The price for predicting y’ when true value is y

How well did you know this?

Not at all

Perfectly

What is misclassification rate?

The average zero-one loss over data set, aka the percentage of falses: 10% falses => misclassification rate 0.1

How well did you know this?

Not at all

Perfectly

What is classification accuracy?

The fraction of correct instances, aka 1 - misclassification rate

How well did you know this?

Not at all

Perfectly

What is the generalization error?

Expected loss over the probability distribution of our data, aka the loss on average for random new instance

How well did you know this?

Not at all

Perfectly

What is the standard approach for generalization error?

Having a training and test data

How well did you know this?

Not at all

Perfectly

What are some estimators of generalization error?

Training set error
Test set error
CV error
Bootstrap error

How well did you know this?

Not at all

Perfectly

What is bias?

Expected difference between estimated and true error, ideally close to zero

How well did you know this?

Not at all

Perfectly

What is optimistic/pessimistic bias?

Optimistic bias: systematically estimates error to be smaller than it is
Pessimistic bias: systematically estimates error to be larger than it is

How well did you know this?

Not at all

Perfectly

What is variance?

How large the magnitudes of the differences tend to be on average.

How well did you know this?

Not at all

Perfectly

Can method with zero bias have large variance?

Yes, negative + positive errors cancel each other

How well did you know this?

Not at all

Perfectly

What is training set error (resubstitution error)

The average loss on the training examples.
Has high optimistic bias, since the model is chosen due to good fit to this particular data set

How well did you know this?

Not at all

Perfectly

What is testing set error (holdout estimate)?

Splitting data to training and test, then computing average loss on test set.

How well did you know this?

Not at all

Perfectly

Is training set error unbiased?

Only if testing a single hypothesis, testing multiple hyperparameters etc leads to optimistic bias

How well did you know this?

Not at all

Perfectly

What is the solution to avoid optimistic bias for test set?

Splitting the data to training-validation-test

How well did you know this?

Not at all

Perfectly

What is stratification?

Using strategy to make the data splits more similar to distribution of classes in each set.

How well did you know this?

Not at all

Perfectly

What is Cross-Validation and why is it used?

It is used because too small test sets can be unreliable. We want as much data as possible for testing and training, so we use CV to make “multiple” training and testing sets from the same data.

How well did you know this?

Not at all

Perfectly

What is leave-one-out cross-validation?

Using data of n instances, we use ith instance as a test set and n-1 as training sets, then we loop through all the n.

How well did you know this?

Not at all

Perfectly

Can the leave-one-out CV overfit?

Study These Flashcards

No, because test instances are not part of the training set

What is the benefit of leave-one-out CV?

Study These Flashcards

Allows us to use the whole data set for both training and testing simultaneously?

What is K-fold CV?

Study These Flashcards

Same idea as with leave-one-out CV, but instead of having one for testing, it has one fold for testing and K-1 folds for training.

What is the K in K-fold cross validation?

Study These Flashcards

It is the number of how many folds the data is split into, NOT HOW MANY DATA POINTS THERE ARE IN ONE FOLD.

So for example 100 data points with k=5 => test set is 20

What is the benefit of K-fold CV?

Study These Flashcards

Much faster than leave-one-out.

What is the disadvantage of K-fold CV?

Study These Flashcards

Can have a small pessimistic bias, because the model evaluated can be slightly less accurate than the final model trained on all data

What's the benefit of M-times repeated K-fold CV?

Some variance in the K-fold can be reduced with repeating the random sampled folds multiple times.

Can we do model selection with CV?

Yes, using the training folds we can choose hyperparameter values, feature sets etc. and then test the final model on independent test data. does not need validaition set.

What is nested CV?

CV within CV. We split the data to K-fold, then we take one fold as a test data, and other folds test different parameters within the fold itself until testing with the test data. Computationally very expensive

What are some challenges in performance evaluation?

The data isn't always iid (independent, identically distributed).

What's the issue with grouped data?

With grouped data it's easy to make predictions, if parts of the group is in the training and part in the testing. Solution s to randomize data on whole group level, so all group data is either on the test or training set

Why is time series data problematic?

Random split is not suitable, it is easy to predict stock price if we know the previous and next day of the predicted date

What is Mean Squared Error? What is it's baseline?

(y' - y)^2. Zero if correct Sensitive to outliers Baseline: is MSE smaller than predicting the mean of training set y values

What is mean absolute error? What is it's baseline?

|y' - y| Zero if correct, else linearly growing Not as sensitive to outliers Baseline: smaller than the median of training

What is baseline for misclassification rate?

Majority voter

What is cost matrix?

The consequence for one class might be different than for another class. For example broken tea cup produced again -> c1 Mailing broken cup leads to mailing a new one, reputation costs etc -> c2 if c1 is a lot cheaper, it is better to predict broken more often incorrectly than not broken

What is confusion matrix?

Classification results represented in a confusion matrix, where rows are true classes and the columns are predicted classes

What does a binary confusion matrix look like?

On rows are predicted values, on columns actual values, TP FP FN TN

How is the precision calculated? What does it tell us?

Precision = TP / (TP+FP) Tells the proportion of relevant documents among those returned, can be maximized with classifying only those instances we are most certain are positive -> bad recall

How is recall calculated and what does it tell us?

Recall = TP / (TP + FN) Maximized if there are no false positives, so easy to maximize, classify everything as positive. Tells us the proportion of relevant documents found

How is F-score calculated?

2 * (Precision * Recall) / (Precision + Recall)

What is TPR?

True Positive Rate (recall) TPR = TP / (TP + FN) Proporion of positives correctly identified

What is FPR?

False Positive Rate FPR = FP / (FP + TN) Proportion of negatives incorrectly identified

What is the ROC curve?

Receiver operating characteristic curve? Draws the true positive rate against the false positive rate

What is AUC?

The area under ROC curve. It's an indicator how well the classifier solves the problem, the larger the are the better the solution.

What values AUC has?

Between 0 and 1. 1 = perfect ranking, all instances of positive class receive higher predicted values than those of negative class. 0.5 is the same as guessing, so AUC should be better than 0.5

What does it mean that AUC is invariant to relative class distributions?

AUC should be same for inbalanced data sets with 90-10 positives and 50-50 positives

What isi classical mistake for AUC?

Using discrete values 1 and 0 instead of TPR and FPR values like 0.6 and 0.4

Lecture 7 - Model Selection, Evaluation Flashcards

(46 cards)