Section 3 Evaluation of a classifier Flashcards
Name 3 measures to evaluate how a linear regression model performs
Root mean squared error
Mean absolute error
R^2
What is the MAP rule
Popular choice along a continuous set of τ (threshold) [0,1] is 0.5, the “maximum a posteriori (MAP) rule” which assigns the most likely class, i.e. corresponding to the largest probability.
When might the MAP rule not be a correct choice to make?
This might not be a sensible choice as it assumes the two classes are balanced.
This is not a sensible choice if different misclassification costs are associated with false positives and false negatives. Ex: Medical applications, where category of interest is rare
How to check balance of data
First thing you should do with data is have a look at the proportions of the sample.
If data is imbalanced, skewed then using MAP rule will get a skewed distribution.
Explain threshold classifier evaluation
If τ = 0, then every observation is predicted as y_hati = 1
If τ = 1, then every observation is predicted as y_hati = 0
Explain meaning of TN, TP,FN,FP
T N – True negatives i.e. Number of 0 correctly classified as 0.
T P – True positives i.e.Number of 1 correctly classified as 1.
F N – False negatives i.e. Number of 1 wrongly classified as 0.
F P – False positives i.e. Number of 0 wrongly classified as 1.
What is sensitivity
Sensitivity/recall focuses on the positive cases assessing: P(y_hat=1|y=1) Of those truly positive how many are classified as positive
What is recall?
Sensitivity/recall focuses on the positive cases assessing: P(y_hat=1|y=1) Also we can get the false positive rate by 1-sensitivity. Of those truly positive how many are classified as positive
What is accuracy
Accuracy would be assessing: P(y_hat=y)
What is specificity
Specificity focuses on the negative cases assessing: P(y_hat=0|y=0). Also we can get the false positive rate by 1- specificity.
Of those truly negative how many are classified as negative
What is precision
Precision would be assessing: P(y=1|y_hat=1)
Of those who are classified as positive how many are truly positive.
Describe the relationship of precision vs recall
Recall decreasing will increase the precision metric.
If positive class is rare what metrics would we focus on?
Important case is when a positive class (for example) is rare, we want to focus more on the positive class being predicted well. In this scenario we would want to have a good balance between precision and recall.
Express recall in terms of precision
Recall = (Precision*prevalence form the model)/(prevalence from the data)
What is the ROC curve
The receiver operating characteristic – ROC curve plots sensitivity (true positive rate) versus specificity (false positive rate).The curve illustrates the diagnostic ability of a binary classifier as the discrimination threshold τ is varied.
How can we evaluate a classifier by the ROC curve
How often will a randomly chosen 1 outcome have a higher probability of being predicted to be a 1 outcome than a randomly chosen true 0?
A perfect classifier would have AU-ROC = 1,and the ROC curve pushed to the top left Corner. This would imply large sensitivity and large specificity.
A classifier not better than random guessing would have AU-ROC = 0.5
A common way of choosing τ in relation to the ROC curve is to maximise the sum of sensitivity and specificity - balance true positives and negatives
When does ROC curve not work
Sensitivity and specificity, in conjunction with the ROC curve, can work in mild imbalanced situations.
However, because sensitivity and specificity are equally weighted, these metrics and ROC curve can be misleading in some very imbalanced applications and provide an optimistic view.
What is the PR cruve
The precision/recall – PR curve plots precision versus recall, as a function of the classification threshold τ.
The area under the PR curve - AU-PR is related to the average precision for varying threshold τ.
The larger the area under the curve, the better is the ability of the classifier at correctly identifying the positive(rare) class.
How to evaluate a classifer based on the PR curve
The larger the area under the curve, the better is the ability of the classifier at correctly identifying the positive(rare) class.
A good classifier will have both a high precision and high recall, and the PR curve pushed to the top right corner.
The precision is lower bounded by the prevalence of the positive class - which would mean no better than random guessing .
What is the F1 score
A quantity to quantify and compare a classifier’s ability to predict positive cases would be the F1 score. This is the harmonic mean of precision and recall
Interpretation: Model’s balanced ability to both detect positive cases (recall) and be accurate with the cases it detects (precision).
The score 0 ≥ F1 ≥ 1, with F1 = 1 denoting perfection
Summarise why we use ROC/ sensitivity+specificity
Mild to none imbalance
“Cost” of false positive vs false negative
Interest also in the negative (0) cases
Summarise why we use Precision, recall and F1
Positive cases are rare
Want to maintain low false positive rate
Does not consider (True) negative cases
Summarise the reasons for using accuracy measure
Easy to interpret and use
Popular
Not always appropriate
What measure is used typically for multinomial logistic regression
Typically simple accuracy is used but maybe also:
Class-specific sensitivity, i.e. the proportion of correctly classified observations for class k.
Class-specific false positive rate, i.e. the proportion of instances incorrectly assigned to class k.
Why do we not make model very complex to fit perfectly?
In theory, a model can be made arbitrarily complex, as to perfectly fit the data it was estimated on.
High complexity will fit model perfectly to the data but will generalise poorly, while low complexity may not fit the data well.
Although alot of metrics will reduce as complexity increases the variance of predictions will increase. The variance of predictions is linked to if you can trust your model in reality out of sample.
What happens to variance when overfitting occurs
Although alot of metrics will reduce as complexity increases the variance of predictions will increase. The variance of predictions is linked to if you can trust your model in reality out of sample.
Explain why we use test data
The data used to learn the model provides an optimistic view of the predictive performance as it can be made arbitrarily complex and the same data is used to learn the model and assess its predictive performance.
The data used to fit the model must not be used to evaluate the predictive performance of a model, only can be used for assessing goodness-of-fit and quality of the model.
What does test data look like - define it
Data points not used as target cases in the fitting procedure. The test data set is used to estimate the generalisation error of the fitted model. Hence, these data points are used to test the fitted model.
Test data should be either external data with similar characteristics to data used for fitting or should be out of sample and split off from the training data.
Define training data
Training set: Data points whose target variable values are used in the model fitting procedure, that is are used to learn the parameters of the model. These data points are employed to train the model.
What is a key assumption about test and training data
Crucial assumption is that both training and test data are generated by the same data generating process.
What is data leakage and how do we avoid it
Data leakage is contaminating the test data with training data information.
When we split data into training and test data we standardize the sets individually to avoid data leakage.
In reality how do the loss functions for test and training data compare? Similarly for predictive performance measures
The estimated out-of-sample loss is never smaller than the estimated training loss.
Similarly, (in expectation) the estimated out-of-sample predictive performance is not
greater than the estimated training performance
Define the validation set
Validation set: Data points used to evaluate and compare the models to see which is best. Each model in turn is evaluated on these data. Surrogate of the test data
What is the summary of steps for evaluating models
Split the available data into: Training data – Validation data – Test data.
Model training: Use the training data to estimate the model parameters.
Model validation: Use the validation data to perform model selection.
Find model which is maximising
validation performance.
Model testing: Use the test data to assess the predictive performance of the model and the ability to generalise to unseen inputs in real world problems.
Estimate generalised predictive performance.
What are resampling methods
Resampling methods involve repeatedly drawing random samples from a dataset and refitting and testing a model of interest on each sample in order to obtain additional information about a model.
Resampling is replicated a number of times to account for the sampling variability of the process
Explain cross validation
Cross validation is a class of resampling methods that estimate the performance by holding out a subset of the training observations from the fitting process.
The model is trained on the “kept” observations (in-sample).
The model is applied to those held out observations (out-of-sample) to evaluate the predictive performance.
The process is replicated a number of times to account for the sampling variability
The estimate of the predictive performance is the average predictive performance computed
Explain hold out sample cross validation
The data are randomly split into training and hold-out samples.
The model is fitted on the training set, and then the predictive performance of the model is evaluated on the hold-out sample.
In the case of a single model the hold-out sample corresponds to the test set.
In the case of multiple models, the hold-out sample is split into validation and test set.
What are common splits of data for hold out sample cross validation
50% training and 25% validation, 25% test
Why do we recombine the validation data into training data after model selection?
we combine the training data back with validation data and retrain the model. Always more data is better. Keep test data completely separate.
Explain leave-one-out cross validation
The idea is that only one observation in turn is used for validation, while the remaining are employed for training - we are splitting data in deterministic way
The cross-validation estimate of the performance is the average performance over the N iterations
This method requires training N models
Explain k-fold cross validation
Where Leave-one-out can be computationally expensive and time consuming this may be a better approach. K -fold approach can save time and computational resources.
In k-fold cross validation, chunks (folds) of data are used to evaluate the performance of a model instead of single observations.
The cross-validation estimate of the performance is the average performance over the K folds
how does the size of the data determine which cross validation to use
If a large number of observations is available, usually a simple hold-out procedure works fine - smaller samples should use leave-one-out or k-fold.
What are the downsides of each cross validation procedure
The simple hold-out procedure can overestimate the generalization error.
Leave-one-out overcomes this issue, as prediction is evaluated on a single observation (in turn), so it tends to not overestimate the error. Also the estimated error is less variable. Not screened by random subsets - but computationally intensive.
K-fold cross-validation reduces the computational complexity of leave-one-out, but trading for bias in the estimated error.