Lecture 7 - Model Selection, Evaluation Flashcards
What are the two goals of evaluating the prediction performance?
- Model selection: almost all methods have parameters that need to be chosen
- Model evaluation: After selecting your hypothesis, estimate how well it generalizes to new data.
What does the loss function tell?
The price for predicting y’ when true value is y
What is misclassification rate?
The average zero-one loss over data set, aka the percentage of falses: 10% falses => misclassification rate 0.1
What is classification accuracy?
The fraction of correct instances, aka 1 - misclassification rate
What is the generalization error?
Expected loss over the probability distribution of our data, aka the loss on average for random new instance
What is the standard approach for generalization error?
Having a training and test data
What are some estimators of generalization error?
- Training set error
- Test set error
- CV error
- Bootstrap error
What is bias?
Expected difference between estimated and true error, ideally close to zero
What is optimistic/pessimistic bias?
Optimistic bias: systematically estimates error to be smaller than it is
Pessimistic bias: systematically estimates error to be larger than it is
What is variance?
How large the magnitudes of the differences tend to be on average.
Can method with zero bias have large variance?
Yes, negative + positive errors cancel each other
What is training set error (resubstitution error)
The average loss on the training examples.
Has high optimistic bias, since the model is chosen due to good fit to this particular data set
What is testing set error (holdout estimate)?
Splitting data to training and test, then computing average loss on test set.
Is training set error unbiased?
Only if testing a single hypothesis, testing multiple hyperparameters etc leads to optimistic bias
What is the solution to avoid optimistic bias for test set?
Splitting the data to training-validation-test
What is stratification?
Using strategy to make the data splits more similar to distribution of classes in each set.
What is Cross-Validation and why is it used?
It is used because too small test sets can be unreliable. We want as much data as possible for testing and training, so we use CV to make “multiple” training and testing sets from the same data.
What is leave-one-out cross-validation?
Using data of n instances, we use ith instance as a test set and n-1 as training sets, then we loop through all the n.
Can the leave-one-out CV overfit?
No, because test instances are not part of the training set
What is the benefit of leave-one-out CV?
Allows us to use the whole data set for both training and testing simultaneously?
What is K-fold CV?
Same idea as with leave-one-out CV, but instead of having one for testing, it has one fold for testing and K-1 folds for training.
What is the K in K-fold cross validation?
It is the number of how many folds the data is split into, NOT HOW MANY DATA POINTS THERE ARE IN ONE FOLD.
So for example 100 data points with k=5 => test set is 20
What is the benefit of K-fold CV?
Much faster than leave-one-out.
What is the disadvantage of K-fold CV?
Can have a small pessimistic bias, because the model evaluated can be slightly less accurate than the final model trained on all data
What’s the benefit of M-times repeated K-fold CV?
Some variance in the K-fold can be reduced with repeating the random sampled folds multiple times.
Can we do model selection with CV?
Yes, using the training folds we can choose hyperparameter values, feature sets etc. and then test the final model on independent test data. does not need validaition set.
What is nested CV?
CV within CV. We split the data to K-fold, then we take one fold as a test data, and other folds test different parameters within the fold itself until testing with the test data.
Computationally very expensive
What are some challenges in performance evaluation?
The data isn’t always iid (independent, identically distributed).
What’s the issue with grouped data?
With grouped data it’s easy to make predictions, if parts of the group is in the training and part in the testing. Solution s to randomize data on whole group level, so all group data is either on the test or training set
Why is time series data problematic?
Random split is not suitable, it is easy to predict stock price if we know the previous and next day of the predicted date
What is Mean Squared Error? What is it’s baseline?
(y’ - y)^2.
Zero if correct
Sensitive to outliers
Baseline: is MSE smaller than predicting the mean of training set y values
What is mean absolute error? What is it’s baseline?
|y’ - y|
Zero if correct, else linearly growing
Not as sensitive to outliers
Baseline: smaller than the median of training
What is baseline for misclassification rate?
Majority voter
What is cost matrix?
The consequence for one class might be different than for another class.
For example broken tea cup produced again -> c1
Mailing broken cup leads to mailing a new one, reputation costs etc -> c2
if c1 is a lot cheaper, it is better to predict broken more often incorrectly than not broken
What is confusion matrix?
Classification results represented in a confusion matrix, where rows are true classes and the columns are predicted classes
What does a binary confusion matrix look like?
On rows are predicted values, on columns actual values,
TP FP
FN TN
How is the precision calculated? What does it tell us?
Precision = TP / (TP+FP)
Tells the proportion of relevant documents among those returned, can be maximized with classifying only those instances we are most certain are positive -> bad recall
How is recall calculated and what does it tell us?
Recall = TP / (TP + FN)
Maximized if there are no false positives, so easy to maximize, classify everything as positive.
Tells us the proportion of relevant documents found
How is F-score calculated?
2 * (Precision * Recall) / (Precision + Recall)
What is TPR?
True Positive Rate (recall)
TPR = TP / (TP + FN)
Proporion of positives correctly identified
What is FPR?
False Positive Rate
FPR = FP / (FP + TN)
Proportion of negatives incorrectly identified
What is the ROC curve?
Receiver operating characteristic curve? Draws the true positive rate against the false positive rate
What is AUC?
The area under ROC curve. It’s an indicator how well the classifier solves the problem, the larger the are the better the solution.
What values AUC has?
Between 0 and 1. 1 = perfect ranking, all instances of positive class receive higher predicted values than those of negative class. 0.5 is the same as guessing, so AUC should be better than 0.5
What does it mean that AUC is invariant to relative class distributions?
AUC should be same for inbalanced data sets with 90-10 positives and 50-50 positives
What isi classical mistake for AUC?
Using discrete values 1 and 0 instead of TPR and FPR values like 0.6 and 0.4