Chapter 15 Probabilistic Model Selection with AIC, BIC, and MDL Flashcards
It is common to choose a model that performs the best on a hold-out test dataset or to estimate model performance using a resampling technique, such as k-fold cross-validation. What’s an alternative approach? Give an example P 136
Using probabilistic statistical measures that attempt to quantify both the model performance on the training dataset and the complexity of the model. Examples include the Akaike and Bayesian Information Criterion and the Minimum Description Length.
What’s one benefit and one limitation of the information criterion statistics? P 136
The benefit of these information criterion statistics is that they do not require a hold-out test set, although a limitation is that they do not take the uncertainty of the models into account and may end-up selecting models that are too simple.
The simplest reliable method of model selection involves fitting candidate models on a training set, tuning them on the validation dataset, and selecting a model that performs the best on the test dataset according to a chosen metric, such as accuracy or error. What is the problem with these approaches of evaluation? P 137
A problem with this approach is that only model performance is assessed, regardless of model complexity.
Probabilistic model selection (or ____) provides an analytical technique for scoring and choosing among candidate models. Models are scored both on their ____ and based on the ____. P 137
information criteria, performance on the training dataset, complexity of the model
What’s the definition of model performance and model complexity? P 137
Model Performance. How well a candidate model has performed on the training dataset.
Model Complexity. How complicated the trained candidate model is after training.
Model performance may be evaluated using a probabilistic framework, such as ____ under the framework of maximum likelihood estimation. Model complexity may be evaluated as ____ aka ____. P 137
Log-likelihood, the number of degrees of freedom, parameters in the model
A limitation of probabilistic model selection methods is that the same general statistic cannot be calculated across a range of different types of models. Instead, the metric must be carefully derived for each model. True/False P 138
True
There are three statistical approaches to estimating how well a given model fits a dataset and how complex the model is. And each can be shown to be equivalent or proportional to each other, although each was derived from a different framing or field of study, but each statistic can be calculated using the log-likelihood for a model and the data. What are they and from which field are they derived? P 138
They are:
Akaike Information Criterion (AIC). Derived from frequentist probability.
Bayesian Information Criterion (BIC). Derived from Bayesian probability.
Minimum Description Length (MDL). Derived from information theory.
To use AIC/ BIC/ MDL for model selection, we simply choose the model giving ____ (smallest/biggest) AIC/BIC/ MDL over the set of models considered. P 139
Smallest
“Compared to the BIC method, the AIC statistic penalizes complex models less”, what does it mean? P 139
Means that it may put more emphasis on model performance on the training dataset, and, in turn, select more complex models.
Unlike the AIC, the BIC penalizes the model more for its complexity, meaning that more complex models will have a worse (larger) score and will, in turn, be less likely to be selected. True/False P 139
True
Given a family of models, including the true model, the probability that BIC will select the correct model approaches one as the sample size N → infinity. And it’s the same for AIC too. True/False P 140
False
Importantly, the derivation of BIC under the Bayesian probability framework means that if a selection of candidate models includes a true model for the dataset, then the probability that BIC will select the true model increases with the size of the training dataset. This cannot be said for the AIC score.
A downside of BIC is that for smaller, less representative training datasets, it is more likely to choose models that are too simple. True/False P 140
True
The MDL principle takes the stance that the best theory for a body of data is one that maximizes the size of the theory plus the amount of information necessary to specify the exceptions relative to the theory True/False P 141
False
The MDL principle takes the stance that the best theory for a body of data is one that minimizes the size of the theory plus the amount of information necessary to specify the exceptions relative to the theory
The MDL calculation is very similar to BIC and can be shown to be equivalent in some situations. True/False P 141
True