Chapter 8: Evaluation Flashcards
What is the big idea?
The most important part of designing an evaluation experiment for a predictive model is ensuring that the data used to evaluate the model is not the same as the data used to train the model
What is the purpose of evaluation?
- To determine which model is the most suitable for a task
- To estimate how each model will perform
- To convince the users that the model will meet their needs
What do the first and second purposes do?
Measure and compare the performance of a group of models to determine which model best performs the prediction task the models have been built to perform
- no free lunch
What do the second and third purposes do?
They have a strong link to deployment and convince users that new decisions made based on the chosen model will improve the current state of affairs
What are some priorities for analytical models?
- Medical- can never incorrectly predict that a sick patient is healthy
- Financial- model needs to be just slightly better than the norm
What is the primary evaluation metric computer science focuses on
Measuring model execution performance
What are some issues to consider for a model to be successfully deployed?
- How accurate it is
- How accurate it remains despite drift in data
- How quickly it makes predictions
- How easy it is for human analysts to understand or explain the predictions made by the model
- How much human experts can learn from the model’s actions
- How easy it is to retrain the model if it goes stale over time
What is the basic way to evaluate the effectiveness of a model?
Take a dataset we know the expected prediction for (test set) and present it to the trained model. Record the predictions the model makes and compare with the expected predictions. Use a performance measure to numerically capture how well the predictions match the expected ones.
What is the training set used for?
2/3
Model construction
what is the test set used for?
1/3
Accuracy estimation
What is misclassification rate?
number of incorrect predictions/ total predictions
What is the hold-out test set?
- The simplest way to construct a test set form a dataset
- It is created by randomly sampling a portion of the data in the ABT
What is the benefit of using the hold out test?
It avoid peeking
What is peeking?
- Occurs when the a model is evaluated on the same data used to train it
Why is using the same data an issue?
Since the data was used in training, the model has already seen it so it will probably perform well when evaluated on that same data
Why is evaluating with a test set better?
- It is a better measure of how the model is likely to perform when actually deployed
- Shows how well the model can generalize beyond the instances used to train it
What does the misclassification rate show?
- Its values range [0, 1]
- Shows that lower values indicate better performance
What is a confusion matrix aka truth table?
- A useful tool to capture what happened in an evaluation test in a little more detail
- It is the basis for calculating many other performance measures
How does a confusion matrix work?
It calculates the frequency of each possible outcome of predictions made by a model for a test dataset to show how the model is performing
What are the possible outcomes for a prediction problem with a binary target feature?
- True Positive (TP)- positive target feature value
and that was predicted to have a positive target feature value - True Negative (TN)- negative predicted to have negative
- False Positive (FP)- negative predicted to have positive
- False Negative (FN)- positive predicted to have negative
What is the structure of a confusion matrix?
Prediction
+ve -ve
Target +ve TP FN
-ve FP TN
Assume spam = +ve, ham = -ve
what is the misclassification rate for truth tables?
(FP + FN) / (TP + TN + FP + FN)
what is the classification accuracy for truth tables?
(TP + TN) / (TP + TN + FP + FN)
What is a common tension that arrives with evaluation?
The need to fully understand the performance of the model and the need to reduce the model performance to a single measure to rank
What is hold-out sampling
- sampling to find non-overlapping samples
- most appropriate for very large datasets we can take samples from
- sometimes extended to include a validation set
- no fixed recommendations on how large the different datasets should be
Why do we use a validation set?
Used when data outside the training set is required in order to tune particular aspects of a model
- Example like in the wrapper based feature selection technique
What is the most common use of a validation set?
Avoiding overfitting when using algorithms that iteratively build more and more complex models
- id3 algorithm for decision tree and gradient descent algorithm are two examples of this approach
How do we combat overfitting with validation sets?
Allow algorithms to train models beyond this point but save the model generated at each iteration. After training is done find the point where the validation begins to disimprove and revert back to the model
What are issues that arise when using hold-out sampling?
- There is not enough data to make a suitably large training and validation set. This results in small partitions and poor evaluation
- If we make a lucky split that put difficult instances in the training and easy ones in the test set. This will make the model appear more accurate than it actually is
What happens with k-fold cross validation?
Available data is divided into k equal-sized folds (most popular is k=10) and k separate evaluations are performed
What is leave-one-out?
- jackknifing
- k folds where k=# of tuples used for small sized data
- Each time the one/left-out tuple is used to test
- The number of folds is the same as the number of training instances
- each fold contains only one instance and the training set contains the remainder
When is leave out cross validation useful?
when the amount of data available is too small to allow big enough training sets in a k-fold cross validation