Chapter 8: Evaluation Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

What is the big idea?

A

The most important part of designing an evaluation experiment for a predictive model is ensuring that the data used to evaluate the model is not the same as the data used to train the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the purpose of evaluation?

A
  • To determine which model is the most suitable for a task
  • To estimate how each model will perform
  • To convince the users that the model will meet their needs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What do the first and second purposes do?

A

Measure and compare the performance of a group of models to determine which model best performs the prediction task the models have been built to perform
- no free lunch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What do the second and third purposes do?

A

They have a strong link to deployment and convince users that new decisions made based on the chosen model will improve the current state of affairs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some priorities for analytical models?

A
  • Medical- can never incorrectly predict that a sick patient is healthy
  • Financial- model needs to be just slightly better than the norm
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the primary evaluation metric computer science focuses on

A

Measuring model execution performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some issues to consider for a model to be successfully deployed?

A
  • How accurate it is
  • How accurate it remains despite drift in data
  • How quickly it makes predictions
  • How easy it is for human analysts to understand or explain the predictions made by the model
  • How much human experts can learn from the model’s actions
  • How easy it is to retrain the model if it goes stale over time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the basic way to evaluate the effectiveness of a model?

A

Take a dataset we know the expected prediction for (test set) and present it to the trained model. Record the predictions the model makes and compare with the expected predictions. Use a performance measure to numerically capture how well the predictions match the expected ones.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the training set used for?

A

2/3
Model construction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is the test set used for?

A

1/3
Accuracy estimation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is misclassification rate?

A

number of incorrect predictions/ total predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the hold-out test set?

A
  • The simplest way to construct a test set form a dataset
  • It is created by randomly sampling a portion of the data in the ABT
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the benefit of using the hold out test?

A

It avoid peeking

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is peeking?

A
  • Occurs when the a model is evaluated on the same data used to train it
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why is using the same data an issue?

A

Since the data was used in training, the model has already seen it so it will probably perform well when evaluated on that same data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is evaluating with a test set better?

A
  • It is a better measure of how the model is likely to perform when actually deployed
  • Shows how well the model can generalize beyond the instances used to train it
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does the misclassification rate show?

A
  • Its values range [0, 1]
  • Shows that lower values indicate better performance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a confusion matrix aka truth table?

A
  • A useful tool to capture what happened in an evaluation test in a little more detail
  • It is the basis for calculating many other performance measures
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How does a confusion matrix work?

A

It calculates the frequency of each possible outcome of predictions made by a model for a test dataset to show how the model is performing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the possible outcomes for a prediction problem with a binary target feature?

A
  • True Positive (TP)- positive target feature value
    and that was predicted to have a positive target feature value
  • True Negative (TN)- negative predicted to have negative
  • False Positive (FP)- negative predicted to have positive
  • False Negative (FN)- positive predicted to have negative
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the structure of a confusion matrix?

A

Prediction
+ve -ve
Target +ve TP FN
-ve FP TN
Assume spam = +ve, ham = -ve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

what is the misclassification rate for truth tables?

A

(FP + FN) / (TP + TN + FP + FN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

what is the classification accuracy for truth tables?

A

(TP + TN) / (TP + TN + FP + FN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a common tension that arrives with evaluation?

A

The need to fully understand the performance of the model and the need to reduce the model performance to a single measure to rank

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is hold-out sampling

A
  • sampling to find non-overlapping samples
  • most appropriate for very large datasets we can take samples from
  • sometimes extended to include a validation set
  • no fixed recommendations on how large the different datasets should be
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Why do we use a validation set?

A

Used when data outside the training set is required in order to tune particular aspects of a model
- Example like in the wrapper based feature selection technique

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the most common use of a validation set?

A

Avoiding overfitting when using algorithms that iteratively build more and more complex models
- id3 algorithm for decision tree and gradient descent algorithm are two examples of this approach

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How do we combat overfitting with validation sets?

A

Allow algorithms to train models beyond this point but save the model generated at each iteration. After training is done find the point where the validation begins to disimprove and revert back to the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What are issues that arise when using hold-out sampling?

A
  • There is not enough data to make a suitably large training and validation set. This results in small partitions and poor evaluation
  • If we make a lucky split that put difficult instances in the training and easy ones in the test set. This will make the model appear more accurate than it actually is
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What happens with k-fold cross validation?

A

Available data is divided into k equal-sized folds (most popular is k=10) and k separate evaluations are performed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is leave-one-out?

A
  • jackknifing
  • k folds where k=# of tuples used for small sized data
  • Each time the one/left-out tuple is used to test
  • The number of folds is the same as the number of training instances
  • each fold contains only one instance and the training set contains the remainder
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

When is leave out cross validation useful?

A

when the amount of data available is too small to allow big enough training sets in a k-fold cross validation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is stratified cross-validation

A

folds are stratified so that class distribution in each fold is approximately the same as it was in the original data
- helps reduce variance in the estimate/testing a little bit more

34
Q

How to improve cross validation

A
  • repeated k-fold cross validation (use partitioning and take the average
  • stratified cross validation
35
Q

What is bootstrapping?

A
  • sample our data uniformly with replacement
  • methods- ε0 bootstrap and 632 bootstrap
36
Q

When is bootstrapping best used?

A

In contexts with very small datasets (fewer than 300 instances)

37
Q

How does ε0 bootstrapping work?

A
  • it iteratively performs multiple evaluation experiments using slightly different training sets each time to evaluate the expected performance of the model
  • to generate the partitions a it takes a random selection of m instances from the full dataset to generate a test set then the remaining instances are used as the training set
  • using the training set to train a model and the test set to evaluate it a performance measure is calculated for the iteration
  • process is repeated for k iterations
  • find the average for the overall performance measure
38
Q

What is out of time sampling

A

A form of hold-out sampling in which the sampling is done in a targeted manner rather than random
- be careful to ensure that times s from which the training and test sets are taken do
not introduce a bias into the evaluation process, because the two different time samples are
not really representative
- It is important when choosing the periods for out-of-time sampling that the time
spans are large enough to take into account any cyclical behavioral patterns or that other
approaches are used to account for these

39
Q

What is confusion matrix-based performance measures

A

a convenient way to fully describe the performance of a predictive model when applied to a test set
- They are also the basis for a whole range of different
performance measures that can highlight different aspects of the performance of a
predictive model

40
Q

What are the basic measures?

A

true positive rate (TPR), true negative rate (TNR), false negative rate (FNR), and false positive rate (FPR),
- they convert the raw numbers from the confusion matrix into percentages

41
Q

What are the relationships between these measures?

A
  • FNR = 1 − TPR
  • FPR = 1 − TNR
42
Q

What are precision, recall and F1 measure?

A

another frequently used set of performance
measures that can be calculated directly from the confusion matrix

43
Q

What is recall?

A
  • TPR
  • tells us how confident we can be that all instances with the positive target level have been found by the model
44
Q

What is precision?

A
  • captures how often a prediction is correct when a model makes a positive prediction
  • tells us how confident
    we can be that an instance predicted to have the positive target level actually has the
    positive target level.
45
Q

What is the range of precision and recall?

A
  • range [0, 1]
  • higher values in both cases indicate better model performance
46
Q

What is a single performance measure precision and recall can be collapsed into?

A

F1 measure
- it offers a useful alternative to the simpler misclassification rate

47
Q

What is F1 measure or F score?

A

it is the harmonic mean of precision and recall
- f1 = 2 * [(precision * recall) / (precision + recall)]

48
Q

Why harmonic mean?

A

it is less sensitive to large outliers than the arithmetic mean so it does not get skewed by one of the measures being much better than the other

49
Q

Why is harmonic mean useful?

A

we prefer measures to highlight shortcomings in our models rather than hide them
- same range and value as precision and recall

50
Q

Which problems do precision, recall and F1 work best with

A
  • prediction problems with binary target features
  • they place an emphasis on capturing the performance of a prediction model
    on the positive, or most important level
51
Q

What issue is average class accuracy used to solve?

A

imbalanced dataset

52
Q

Why is it preferred to use harmonic mean over arithmetic mean for average class accuracy?

A

arithmetic mean are susceptible to large outliers which inflate the apparent performance of a model
- harmonic mean emphasizes the importance of smaller values and can give a a slightly more realistic measure of how well
a model is performing

53
Q

What results in a more pessimistic view of model performance?

A

harmonic mean

54
Q

Why is it not correct to treat all outcomes equally?

A

you have to take into account the cost of different outcomes when evaluating models

55
Q

what is the structure of a profit matrix?

A

basically same as confusion matrix

56
Q

How do out classification models work?

A

They produce a prediction score and a threshold process is used to convert the score into one of the levels of the target feature

57
Q

What is ROC index?

A
  • receiver operating characteristic index
  • based on the ROC curve
  • widely used performance measure calculated using prediction scores
  • TPR and TNR are intrinsically tied to the threshold used to convert prediction scores into target levels
  • this threshold can be changed which leads to different predictions and a different confusion matrix
58
Q

What happens as the threshold increases?

A
  • TPR decreases
  • TNR decreases
59
Q

How can the ROC index be interpreted?

A

it can be interpreted probabilistically as the probability that a model will
assign a higher rank to a randomly selected positive instance than to a randomly selected
negative instance

60
Q

What is Gini coefficient?

A

a commonly used performance measure that is just a linear rescaling of the ROC index

61
Q

What is the Kolmogorov-Smirnov statistic (K-S statistic)

A

a performance measure that captures the separation between the distribution of prediction scores for the different target levels in a classification problem

62
Q

how to calculate k-s statistic

A
  • first determine the
    cumulative probability distributions of the prediction scores for the positive and negative
    target levels
  • plot the distributions on a K-S chart
63
Q

When are measuring gain and lift useful?

A
  • when we have a positive target level we are especially interested in
  • it can often be useful to focus in on how well a model is making predictions for just
    those instances, rather than how well the model is distinguishing between two target
    levels
64
Q

What is the basic assumption behind both gain and lift?

A

if we were to rank the instances
in a test set in descending order of the prediction scores assigned to them by a well-
performing model, we would expect the majority of the positive instances to be toward the
top of this ranking

65
Q

What is gain?

A
  • a measure of how many of the positive instances in the overall test set are
    found in a particular decile
  • we count the number of positive instances (based
    on the known target values) found in each decile and divide these by the total number of
    positive instances in the test set
  • in a particular decile it can be interpreted as a measure of how much better than
    random guessing the predictions made by a model are
66
Q

How do you know if a model is performing well using gain?

A

the gain is higher for the lower deciles, which contain the instances
with the highest scores.

67
Q

How do you calculate cumulative gain?

A

Cumulative gain is calculated as the fraction of the total number of positive instances in a test set identified up to a particular decile (i.e., in that decile and all deciles below it)

68
Q

What is lift?

A

Lift tells us how much higher the actual
percentage of positive instances in a decile dec is than the rate expected
- cumulative lift is the same way we calculate cum gain

69
Q

When is cumulative gain especially useful?

A

in customer relationship management (CRM)
applications such as cross-sell and upsell models

70
Q

When are multinomial targets used?

A

When there are multiple target levels

71
Q

How is performance measures for continuous targets different from categorical?

A
  • the basic process is the same
  • we also have to measure how accurately the predicted values match the correct target values
72
Q

What are the basic measures of error?

A
  • sum of squared errors
  • mean squared error
  • root mean squared error
  • mean absolute error
73
Q

How do you evaluate models after deployment?

A
  • measure the performance of the model using appropriate performance measures
  • the distributions of the outputs of the model
  • the distributions of the descriptive features in query instances presented to the model
74
Q

What is concept drift?

A

almost all the predictive models we build will go stale at some point

75
Q

What is the simplest way to get a signal that concept drift has occurred?

A

repeatedly evaluate models with the same performance measures used to evaluate them before deployment
- we can then compare the performance before and after deployment

76
Q

What is an alternative to using changing model performance as a signal for concept drift?

A

use changes in the distribution of model outputs

77
Q

What is stability index?

A

a measure to calculate the difference between the distributions collected after deployment and the original distribution

78
Q

What do the results of the stability index mean?

A
  • SI < 0.1- distribution of the newly
    collected test set is broadly similar to the distribution in the original test set
  • SI b/w 0.1 and 0.25- n some change has
    occurred and further investigation may be useful
  • SI > 0.25 - significant change has
    occurred and corrective action is required
79
Q

How do we measure differences in descriptive features before and after deployment?

A
  • any appropriate measure that captures the difference between two different distributions
  • stability index, X2 statistic, K-S statistic
80
Q

What is the challenge with monitoring descriptive feature distribution changes?

A

There are usually a large number of descriptive features for which measures need to be calculated and tracked
- it is also unlikely that a change in the distribution of one descriptive feature in a multi-feature model will have a large impact on model performance

81
Q

When do we use monitoring descriptive feature distribution changes?

A

when a model uses a very small number of descriptive features, usually fewer than 10

82
Q

What do we use control groups for?

A

to evaluate how good they are at helping with the business problem when they are deployed