Chapter 8: Evaluation Flashcards

Question

What is hold-out sampling

Answer 1

- sampling to find non-overlapping samples - most appropriate for very large datasets we can take samples from - sometimes extended to include a validation set - no fixed recommendations on how large the different datasets should be

Answer 2

Used when data outside the training set is required in order to tune particular aspects of a model - Example like in the wrapper based feature selection technique

Answer 3

Avoiding overfitting when using algorithms that iteratively build more and more complex models - id3 algorithm for decision tree and gradient descent algorithm are two examples of this approach

Answer 4

Allow algorithms to train models beyond this point but save the model generated at each iteration. After training is done find the point where the validation begins to disimprove and revert back to the model

Answer 5

- There is not enough data to make a suitably large training and validation set. This results in small partitions and poor evaluation - If we make a lucky split that put difficult instances in the training and easy ones in the test set. This will make the model appear more accurate than it actually is

Answer 6

Available data is divided into k equal-sized folds (most popular is k=10) and k separate evaluations are performed

Answer 7

- jackknifing - k folds where k=# of tuples used for small sized data - Each time the one/left-out tuple is used to test - The number of folds is the same as the number of training instances - each fold contains only one instance and the training set contains the remainder

Answer 8

when the amount of data available is too small to allow big enough training sets in a k-fold cross validation

Answer 9

folds are stratified so that class distribution in each fold is approximately the same as it was in the original data - helps reduce variance in the estimate/testing a little bit more

Answer 10

- repeated k-fold cross validation (use partitioning and take the average - stratified cross validation

Answer 11

- sample our data uniformly with replacement - methods- ε0 bootstrap and 632 bootstrap

Answer 12

In contexts with very small datasets (fewer than 300 instances)

Answer 13

- it iteratively performs multiple evaluation experiments using slightly different training sets each time to evaluate the expected performance of the model - to generate the partitions a it takes a random selection of m instances from the full dataset to generate a test set then the remaining instances are used as the training set - using the training set to train a model and the test set to evaluate it a performance measure is calculated for the iteration - process is repeated for k iterations - find the average for the overall performance measure

Answer 14

A form of hold-out sampling in which the sampling is done in a targeted manner rather than random - be careful to ensure that times s from which the training and test sets are taken do not introduce a bias into the evaluation process, because the two different time samples are not really representative - It is important when choosing the periods for out-of-time sampling that the time spans are large enough to take into account any cyclical behavioral patterns or that other approaches are used to account for these

Answer 15

a convenient way to fully describe the performance of a predictive model when applied to a test set - They are also the basis for a whole range of different performance measures that can highlight different aspects of the performance of a predictive model

Answer 16

true positive rate (TPR), true negative rate (TNR), false negative rate (FNR), and false positive rate (FPR), - they convert the raw numbers from the confusion matrix into percentages

Answer 17

- FNR = 1 − TPR - FPR = 1 − TNR

Answer 18

another frequently used set of performance measures that can be calculated directly from the confusion matrix

Answer 19

- TPR - tells us how confident we can be that all instances with the positive target level have been found by the model

Answer 20

- captures how often a prediction is correct when a model makes a positive prediction - tells us how confident we can be that an instance predicted to have the positive target level actually has the positive target level.

Answer 21

- range [0, 1] - higher values in both cases indicate better model performance

Answer 22

F1 measure - it offers a useful alternative to the simpler misclassification rate

Answer 23

it is the harmonic mean of precision and recall - f1 = 2 * [(precision * recall) / (precision + recall)]

Answer 24

it is less sensitive to large outliers than the arithmetic mean so it does not get skewed by one of the measures being much better than the other

Answer 25

we prefer measures to highlight shortcomings in our models rather than hide them - same range and value as precision and recall

Answer 26

- prediction problems with binary target features - they place an emphasis on capturing the performance of a prediction model on the positive, or most important level

Answer 27

imbalanced dataset

Answer 28

arithmetic mean are susceptible to large outliers which inflate the apparent performance of a model - harmonic mean emphasizes the importance of smaller values and can give a a slightly more realistic measure of how well a model is performing

Answer 29

harmonic mean

Answer 30

you have to take into account the cost of different outcomes when evaluating models

Answer 31

basically same as confusion matrix

Answer 32

They produce a prediction score and a threshold process is used to convert the score into one of the levels of the target feature

Answer 33

- receiver operating characteristic index - based on the ROC curve - widely used performance measure calculated using prediction scores - TPR and TNR are intrinsically tied to the threshold used to convert prediction scores into target levels - this threshold can be changed which leads to different predictions and a different confusion matrix

Answer 34

- TPR decreases - TNR decreases

Answer 35

it can be interpreted probabilistically as the probability that a model will assign a higher rank to a randomly selected positive instance than to a randomly selected negative instance

Answer 36

a commonly used performance measure that is just a linear rescaling of the ROC index

Answer 37

a performance measure that captures the separation between the distribution of prediction scores for the different target levels in a classification problem

Answer 38

- first determine the cumulative probability distributions of the prediction scores for the positive and negative target levels - plot the distributions on a K-S chart

Answer 39

- when we have a positive target level we are especially interested in - it can often be useful to focus in on how well a model is making predictions for just those instances, rather than how well the model is distinguishing between two target levels

Answer 40

if we were to rank the instances in a test set in descending order of the prediction scores assigned to them by a well- performing model, we would expect the majority of the positive instances to be toward the top of this ranking

Answer 41

- a measure of how many of the positive instances in the overall test set are found in a particular decile - we count the number of positive instances (based on the known target values) found in each decile and divide these by the total number of positive instances in the test set - in a particular decile it can be interpreted as a measure of how much better than random guessing the predictions made by a model are

Answer 42

the gain is higher for the lower deciles, which contain the instances with the highest scores.

Answer 43

Cumulative gain is calculated as the fraction of the total number of positive instances in a test set identified up to a particular decile (i.e., in that decile and all deciles below it)

Answer 44

Lift tells us how much higher the actual percentage of positive instances in a decile dec is than the rate expected - cumulative lift is the same way we calculate cum gain

Answer 45

in customer relationship management (CRM) applications such as cross-sell and upsell models

Answer 46

When there are multiple target levels

Answer 47

- the basic process is the same - we also have to measure how accurately the predicted values match the correct target values

Answer 48

- sum of squared errors - mean squared error - root mean squared error - mean absolute error

Answer 49

- measure the performance of the model using appropriate performance measures - the distributions of the outputs of the model - the distributions of the descriptive features in query instances presented to the model

Answer 50

almost all the predictive models we build will go stale at some point

Answer 51

repeatedly evaluate models with the same performance measures used to evaluate them before deployment - we can then compare the performance before and after deployment

Answer 52

use changes in the distribution of model outputs

Answer 53

a measure to calculate the difference between the distributions collected after deployment and the original distribution

Answer 54

- SI < 0.1- distribution of the newly collected test set is broadly similar to the distribution in the original test set - SI b/w 0.1 and 0.25- n some change has occurred and further investigation may be useful - SI > 0.25 - significant change has occurred and corrective action is required

Answer 55

- any appropriate measure that captures the difference between two different distributions - stability index, X2 statistic, K-S statistic

Answer 56

There are usually a large number of descriptive features for which measures need to be calculated and tracked - it is also unlikely that a change in the distribution of one descriptive feature in a multi-feature model will have a large impact on model performance

Answer 57

when a model uses a very small number of descriptive features, usually fewer than 10

Answer 58

to evaluate how good they are at helping with the business problem when they are deployed

Chapter 8: Evaluation Flashcards

(82 cards)