08 Data Evaluation Flashcards
Explain stratified holdout.
Explain Train-Validate-Test.
Keep the proportional representation of different classes in the training/test set.
Model fitted to training data (find parameters) <->Use validation data to inform model design decisions (different hyperparameters/ different model). (model selection)
Then train final model using BOTH validation and training data. Assess performance on the holdout set. (model assessment)
How does K-Fold cross validation work?
How to compare error rates?
- fixed number of 𝑘 partitions of the data (folds)
- in turn: each partition is used for validation and the remaining instances for training
− In total, each instance is used for validation exactly once - may use stratification
- standard practice: stratified ten-fold cross-validation
- error rate is estimated by taking the average of error rates
- select model that performs best over all test subsets
–>paired t-test: average difference of error rates* root of k parameters/ (std deviation of diff. in error rates)
Need variance as well as point estimates
Explain how measurement errors are calculated in a confusion matrix.
Sensitivity = False alarm rate = FP / (FP+ TN): “How many negative instances have been predicted to be positive?”
Specificity = TN/(FP+TN): “How many negative instances have been predicted to be negative?”
Precision =TP /(TP+FP) :“How many positively predicted instances have ac- tually been positive?”
Recall(hit rate) = TP/(TP+FN) : How many positive instances have been predicted to be positive?”
Inference (statistics):
* Goal: …
* Reason about the entire population given data from a limited sample
* Determine whether …
* Directly interpret the meaning of parameters 𝛽
Prediction task (machine learning):
* Goal: …
* Individual….
Explicitly estimate the parameters 𝛽 of the model 𝑓(𝒙) in order to
effects are significant
Make ‘good’ predictions 𝑦ො for unseen data points 𝑥
parameters/weights of the model are often not of interest unless
interpretability/explainability is important.
Generalization Errors
Components of generalization error
* Bias is the …. Error
might be due to inaccurate assumptions/simplifications made by the model.
* Variance is the… High
variance causes overfitting.
Underfitting: model is too …
* …
* high …
Overfitting: model is too “complex” and fits irrelevant characteristics/noise * …
error from erroneous assumptions in the learning algorithm
error from sensitivity to small fluctuations in the training set.
“simple” to represent all relevant characteristics ; high bias and low variance; training error and high test error
low bias and high variance
* low training error and high test error
Most methods are based on a tradeoff between fitting error (high variance) and model complexity (low bias). (Akaike IC, Min description length, resampling methods)
Comparing Error Rates
Estimated error rate is just an estimate (random).
….
Construct a 𝑡-test statistic:
* need variance as well as point estimates
Example of bootstrapping as an alternative to cross-validation
Student’s paired 𝑡-test tells us whether the means of two samples are significantly different.
t = average of difference of error rates divided by observed std dev. of diff. in error rate over root of k
Example: Re-sample 500 samples of 𝑛=50 with replacement, run logistic regression and examine the distribution of error rates (or other metrics).
Cost-Sensitive Learning
Most learning schemes minimize total error rate
* Costs were not considered at training time.
* They generate the same classifier no matter what costs are assigned to thedifferent classes.
* Example: standard decision tree learner
Simple methods for cost-sensitive learning
* …
*…
− E.g. …When testing on the original test data set, there will be fewer false positives.
Weighting of instances according to costs
Resampling of instances according to costs
increase the “no” instances in training, which yields a model that is biased towards avoiding errors on “no” instances.
Your colleague argues:
“This is the best possible classifier. Its gain curve is above the gain curve of a random classifier, and therefore it clearly dominates a random classifier. Moreover, the lift curve of my classifier will be a constant function at 2, whereas the lift curve of your classfier is monotonically decreasing. We should employ my classifier instead of yours.”
Explain three reasons why you disagree with your colleague’s statement.
a) The colleague evaluates the model on (a subset of) the training set. This indicates overfitting. The model might perform poorly on unseen data and we should not employ it.
b) The lift curve will not be a constant function at 2. It will start at 2.5, and then monotonically decrease with the first “-“ instance.
c) The lift curve of my classifier is not monotonically decreasing. It is neither monotonically in- creasing nor decreasing.