Section 3 Evaluation of a classifier Flashcards

1
Q

Name 3 measures to evaluate how a linear regression model performs

A

Root mean squared error
Mean absolute error
R^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the MAP rule

A

Popular choice along a continuous set of τ (threshold) [0,1] is 0.5, the “maximum a posteriori (MAP) rule” which assigns the most likely class, i.e. corresponding to the largest probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When might the MAP rule not be a correct choice to make?

A

This might not be a sensible choice as it assumes the two classes are balanced.
This is not a sensible choice if different misclassification costs are associated with false positives and false negatives. Ex: Medical applications, where category of interest is rare

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to check balance of data

A

First thing you should do with data is have a look at the proportions of the sample.
If data is imbalanced, skewed then using MAP rule will get a skewed distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain threshold classifier evaluation

A

If τ = 0, then every observation is predicted as y_hati = 1
If τ = 1, then every observation is predicted as y_hati = 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain meaning of TN, TP,FN,FP

A

T N – True negatives i.e. Number of 0 correctly classified as 0.
T P – True positives i.e.Number of 1 correctly classified as 1.
F N – False negatives i.e. Number of 1 wrongly classified as 0.
F P – False positives i.e. Number of 0 wrongly classified as 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is sensitivity

A

Sensitivity/recall focuses on the positive cases assessing: P(y_hat=1|y=1) Of those truly positive how many are classified as positive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is recall?

A

Sensitivity/recall focuses on the positive cases assessing: P(y_hat=1|y=1) Also we can get the false positive rate by 1-sensitivity. Of those truly positive how many are classified as positive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is accuracy

A

Accuracy would be assessing: P(y_hat=y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is specificity

A

Specificity focuses on the negative cases assessing: P(y_hat=0|y=0). Also we can get the false positive rate by 1- specificity.
Of those truly negative how many are classified as negative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is precision

A

Precision would be assessing: P(y=1|y_hat=1)
Of those who are classified as positive how many are truly positive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe the relationship of precision vs recall

A

Recall decreasing will increase the precision metric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

If positive class is rare what metrics would we focus on?

A

Important case is when a positive class (for example) is rare, we want to focus more on the positive class being predicted well. In this scenario we would want to have a good balance between precision and recall.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Express recall in terms of precision

A

Recall = (Precision*prevalence form the model)/(prevalence from the data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the ROC curve

A

The receiver operating characteristic – ROC curve plots sensitivity (true positive rate) versus specificity (false positive rate).The curve illustrates the diagnostic ability of a binary classifier as the discrimination threshold τ is varied.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can we evaluate a classifier by the ROC curve

A

How often will a randomly chosen 1 outcome have a higher probability of being predicted to be a 1 outcome than a randomly chosen true 0?
A perfect classifier would have AU-ROC = 1,and the ROC curve pushed to the top left Corner. This would imply large sensitivity and large specificity.
A classifier not better than random guessing would have AU-ROC = 0.5
A common way of choosing τ in relation to the ROC curve is to maximise the sum of sensitivity and specificity - balance true positives and negatives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

When does ROC curve not work

A

Sensitivity and specificity, in conjunction with the ROC curve, can work in mild imbalanced situations.
However, because sensitivity and specificity are equally weighted, these metrics and ROC curve can be misleading in some very imbalanced applications and provide an optimistic view.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the PR cruve

A

The precision/recall – PR curve plots precision versus recall, as a function of the classification threshold τ.
The area under the PR curve - AU-PR is related to the average precision for varying threshold τ.
The larger the area under the curve, the better is the ability of the classifier at correctly identifying the positive(rare) class.

19
Q

How to evaluate a classifer based on the PR curve

A

The larger the area under the curve, the better is the ability of the classifier at correctly identifying the positive(rare) class.
A good classifier will have both a high precision and high recall, and the PR curve pushed to the top right corner.
The precision is lower bounded by the prevalence of the positive class - which would mean no better than random guessing .

20
Q

What is the F1 score

A

A quantity to quantify and compare a classifier’s ability to predict positive cases would be the F1 score. This is the harmonic mean of precision and recall
Interpretation: Model’s balanced ability to both detect positive cases (recall) and be accurate with the cases it detects (precision).
The score 0 ≥ F1 ≥ 1, with F1 = 1 denoting perfection

21
Q

Summarise why we use ROC/ sensitivity+specificity

A

Mild to none imbalance
“Cost” of false positive vs false negative
Interest also in the negative (0) cases

22
Q

Summarise why we use Precision, recall and F1

A

Positive cases are rare
Want to maintain low false positive rate
Does not consider (True) negative cases

23
Q

Summarise the reasons for using accuracy measure

A

Easy to interpret and use
Popular
Not always appropriate

24
Q

What measure is used typically for multinomial logistic regression

A

Typically simple accuracy is used but maybe also:
Class-specific sensitivity, i.e. the proportion of correctly classified observations for class k.
Class-specific false positive rate, i.e. the proportion of instances incorrectly assigned to class k.

25
Q

Why do we not make model very complex to fit perfectly?

A

In theory, a model can be made arbitrarily complex, as to perfectly fit the data it was estimated on.
High complexity will fit model perfectly to the data but will generalise poorly, while low complexity may not fit the data well.
Although alot of metrics will reduce as complexity increases the variance of predictions will increase. The variance of predictions is linked to if you can trust your model in reality out of sample.

26
Q

What happens to variance when overfitting occurs

A

Although alot of metrics will reduce as complexity increases the variance of predictions will increase. The variance of predictions is linked to if you can trust your model in reality out of sample.

27
Q

Explain why we use test data

A

The data used to learn the model provides an optimistic view of the predictive performance as it can be made arbitrarily complex and the same data is used to learn the model and assess its predictive performance.
The data used to fit the model must not be used to evaluate the predictive performance of a model, only can be used for assessing goodness-of-fit and quality of the model.

28
Q

What does test data look like - define it

A

Data points not used as target cases in the fitting procedure. The test data set is used to estimate the generalisation error of the fitted model. Hence, these data points are used to test the fitted model.

Test data should be either external data with similar characteristics to data used for fitting or should be out of sample and split off from the training data.

29
Q

Define training data

A

Training set: Data points whose target variable values are used in the model fitting procedure, that is are used to learn the parameters of the model. These data points are employed to train the model.

30
Q

What is a key assumption about test and training data

A

Crucial assumption is that both training and test data are generated by the same data generating process.

31
Q

What is data leakage and how do we avoid it

A

Data leakage is contaminating the test data with training data information.
When we split data into training and test data we standardize the sets individually to avoid data leakage.

32
Q

In reality how do the loss functions for test and training data compare? Similarly for predictive performance measures

A

The estimated out-of-sample loss is never smaller than the estimated training loss.
Similarly, (in expectation) the estimated out-of-sample predictive performance is not
greater than the estimated training performance

33
Q

Define the validation set

A

Validation set: Data points used to evaluate and compare the models to see which is best. Each model in turn is evaluated on these data. Surrogate of the test data

34
Q

What is the summary of steps for evaluating models

A

Split the available data into: Training data – Validation data – Test data.
Model training: Use the training data to estimate the model parameters.
Model validation: Use the validation data to perform model selection.
Find model which is maximising
validation performance.
Model testing: Use the test data to assess the predictive performance of the model and the ability to generalise to unseen inputs in real world problems.
Estimate generalised predictive performance.

35
Q

What are resampling methods

A

Resampling methods involve repeatedly drawing random samples from a dataset and refitting and testing a model of interest on each sample in order to obtain additional information about a model.
Resampling is replicated a number of times to account for the sampling variability of the process

36
Q

Explain cross validation

A

Cross validation is a class of resampling methods that estimate the performance by holding out a subset of the training observations from the fitting process.
The model is trained on the “kept” observations (in-sample).
The model is applied to those held out observations (out-of-sample) to evaluate the predictive performance.
The process is replicated a number of times to account for the sampling variability
The estimate of the predictive performance is the average predictive performance computed

37
Q

Explain hold out sample cross validation

A

The data are randomly split into training and hold-out samples.
The model is fitted on the training set, and then the predictive performance of the model is evaluated on the hold-out sample.
In the case of a single model the hold-out sample corresponds to the test set.
In the case of multiple models, the hold-out sample is split into validation and test set.

38
Q

What are common splits of data for hold out sample cross validation

A

50% training and 25% validation, 25% test

39
Q

Why do we recombine the validation data into training data after model selection?

A

we combine the training data back with validation data and retrain the model. Always more data is better. Keep test data completely separate.

40
Q

Explain leave-one-out cross validation

A

The idea is that only one observation in turn is used for validation, while the remaining are employed for training - we are splitting data in deterministic way
The cross-validation estimate of the performance is the average performance over the N iterations
This method requires training N models

41
Q

Explain k-fold cross validation

A

Where Leave-one-out can be computationally expensive and time consuming this may be a better approach. K -fold approach can save time and computational resources.
In k-fold cross validation, chunks (folds) of data are used to evaluate the performance of a model instead of single observations.

The cross-validation estimate of the performance is the average performance over the K folds

42
Q

how does the size of the data determine which cross validation to use

A

If a large number of observations is available, usually a simple hold-out procedure works fine - smaller samples should use leave-one-out or k-fold.

43
Q

What are the downsides of each cross validation procedure

A

The simple hold-out procedure can overestimate the generalization error.
Leave-one-out overcomes this issue, as prediction is evaluated on a single observation (in turn), so it tends to not overestimate the error. Also the estimated error is less variable. Not screened by random subsets - but computationally intensive.
K-fold cross-validation reduces the computational complexity of leave-one-out, but trading for bias in the estimated error.