7 - Validating and Evaluating Data Models Flashcards
How can forecasts be evaluated?
- compare the forecast to what actually happened
- compare to naive forecast
How can forecasts be evaluated?
Compare the forecast to what actually happened
- distance between observations and forecast should be minimal
- fit can change if the forecasted market changes
- careful: self-fulfilling prophecies can lead to a perfect fit that is still not helpful
How can forecasts be evaluated?
Compare to naive forecast
- naive forecast: assume what happened in the previous period will happen in this period
- any more complex forecast should be better that the naive forecast, otherwise why pay for a complex method?
Measuring forecast performance
Error measures for numerical values
Absolut: RMSE (root mean squared error)
-> depends on scale
Percentage: MAPE (mean absolute percentage error)
-> independent of scale
Self-fulfilling forecasts can have bad consequences
Example
- a firm offers several products at different prices
- customers always buy the cheapest product, substituting higher-priced products
- the firm sells fewer expensive products than expected
- the forecast predicts little demand for expensive products
- the firm stocks more cheap products
- profit spirals down
Measuring classification performance
Error measures for categorical values
- error rate per category
- error rate across categories
- comparing error rates
Measuring classification performance
Error measures for categorical values
Error rate per category
Recall = no. of instances correctly assigned to class / no. of instances that are actually in class
(starts from the true assignment)
Precision = no. of instances that are actually in class / no. of instances assigned to class
(starts from the predicted assignment)
Measuring classification performance
Error measures for categorical values
Error rate across categories
- average or weighted average
- weighted according to exogenous or endogenous importance of a class
Measuring classification performance
Error measures for categorical values
Comparing error rates
- error on training set vs. validation set vs. test set
- expected error (probability) vs. observed error vs. error from benchmark approaches
Measuring classification performance
Benchmarking
Possible benchmarks
- statistically expected error rate - probabilistic distribution of instances
- naive rules
- expert assignment
Measuring classification performance
Benchmarking
Benchmark factors beyond accuracy
- effort - computational, financial, …
- reliability - over time, data sets, …
- acceptance - who gets to overwrite?
Cross-Validation and Bootstrapping
Splitting the data set for evaluation
Example: Decision tree
Training set: build the tree
Validation set: prune the tree
Test set: evaluate the tree’s predictions
Cross-Validation and Bootstrapping
Splitting the data set for evaluation
Split the data set:
- training set
- validation set
- test set
Build the model:
- Training set
- Validation set
(both overlapping)
Evaluate the model:
- test set
Cross-Validation and Bootstrapping
Hold out 1: k-fold Cross validation
Split the data set into k partitions of equal size
- use k-1 partitions for training (and validation)
- use the k-th partition for evaluation (“hold-out”)
- common: k=10
Repeat the cross validation k times, where the hold-out partition alternates across all k partitions
Average the result over the k repetitions for a single measure
Cross-Validation and Bootstrapping
Hold out 2: Bootstrap
- alternative to cross validation, applicable for small data sets
- n = size of the original data set
- draw n instances with replacement from the data set to generate a training set
- > drawing with replacement: the same instance can be included multiple times, others are ignored
- use the instances that were never drawn for the test