7 - Validating and Evaluating Data Models Flashcards
How can forecasts be evaluated?
- compare the forecast to what actually happened
- compare to naive forecast
How can forecasts be evaluated?
Compare the forecast to what actually happened
- distance between observations and forecast should be minimal
- fit can change if the forecasted market changes
- careful: self-fulfilling prophecies can lead to a perfect fit that is still not helpful
How can forecasts be evaluated?
Compare to naive forecast
- naive forecast: assume what happened in the previous period will happen in this period
- any more complex forecast should be better that the naive forecast, otherwise why pay for a complex method?
Measuring forecast performance
Error measures for numerical values
Absolut: RMSE (root mean squared error)
-> depends on scale
Percentage: MAPE (mean absolute percentage error)
-> independent of scale
Self-fulfilling forecasts can have bad consequences
Example
- a firm offers several products at different prices
- customers always buy the cheapest product, substituting higher-priced products
- the firm sells fewer expensive products than expected
- the forecast predicts little demand for expensive products
- the firm stocks more cheap products
- profit spirals down
Measuring classification performance
Error measures for categorical values
- error rate per category
- error rate across categories
- comparing error rates
Measuring classification performance
Error measures for categorical values
Error rate per category
Recall = no. of instances correctly assigned to class / no. of instances that are actually in class
(starts from the true assignment)
Precision = no. of instances that are actually in class / no. of instances assigned to class
(starts from the predicted assignment)
Measuring classification performance
Error measures for categorical values
Error rate across categories
- average or weighted average
- weighted according to exogenous or endogenous importance of a class
Measuring classification performance
Error measures for categorical values
Comparing error rates
- error on training set vs. validation set vs. test set
- expected error (probability) vs. observed error vs. error from benchmark approaches
Measuring classification performance
Benchmarking
Possible benchmarks
- statistically expected error rate - probabilistic distribution of instances
- naive rules
- expert assignment
Measuring classification performance
Benchmarking
Benchmark factors beyond accuracy
- effort - computational, financial, …
- reliability - over time, data sets, …
- acceptance - who gets to overwrite?
Cross-Validation and Bootstrapping
Splitting the data set for evaluation
Example: Decision tree
Training set: build the tree
Validation set: prune the tree
Test set: evaluate the tree’s predictions
Cross-Validation and Bootstrapping
Splitting the data set for evaluation
Split the data set:
- training set
- validation set
- test set
Build the model:
- Training set
- Validation set
(both overlapping)
Evaluate the model:
- test set
Cross-Validation and Bootstrapping
Hold out 1: k-fold Cross validation
Split the data set into k partitions of equal size
- use k-1 partitions for training (and validation)
- use the k-th partition for evaluation (“hold-out”)
- common: k=10
Repeat the cross validation k times, where the hold-out partition alternates across all k partitions
Average the result over the k repetitions for a single measure
Cross-Validation and Bootstrapping
Hold out 2: Bootstrap
- alternative to cross validation, applicable for small data sets
- n = size of the original data set
- draw n instances with replacement from the data set to generate a training set
- > drawing with replacement: the same instance can be included multiple times, others are ignored
- use the instances that were never drawn for the test
What’s a lift factor?
- describes the increase in response rate yielded by the learning tool
- describes only the percentage increase, not an increase in absolute respondents
- but assuming that any additional sample costs money, computing lift factors enables cost-benefit analyses
Computing the lift factor for deterministic classification
Steps
- consider for all instances the prediction and the actual class
- compute the overall share of the desired class (e.g. “positive response to the newsletter”)
- compute the share of the desired class in those instances predicted to belong to the class
- lift factor = share within the in-class predicted instances/overall share
Computing the lift factor for probabilistic classification
Steps
- consider for all instances the predicted class probability and the actual class
- order the instances by descending probability of belonging to the desired class (e.g. “positive response to newsletter”)
- select a sample size and select the corresponding number of instances from the top of the ordered list
- compute the share of the desired class in the selected instances
- lift factor = share within the sample / overall share
Lift chart
- lift charts can be computed when classification is probabilistic
- compute the lift factor when increasing the sample size, possibly comparing to the increase in cost caused by increasing the sample size