Chapter 15 Probabilistic Model Selection with AIC, BIC, and MDL Flashcards

1
Q

It is common to choose a model that performs the best on a hold-out test dataset or to estimate model performance using a resampling technique, such as k-fold cross-validation. What’s an alternative approach? Give an example P 136

A

Using probabilistic statistical measures that attempt to quantify both the model performance on the training dataset and the complexity of the model. Examples include the Akaike and Bayesian Information Criterion and the Minimum Description Length.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What’s one benefit and one limitation of the information criterion statistics? P 136

A

The benefit of these information criterion statistics is that they do not require a hold-out test set, although a limitation is that they do not take the uncertainty of the models into account and may end-up selecting models that are too simple.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The simplest reliable method of model selection involves fitting candidate models on a training set, tuning them on the validation dataset, and selecting a model that performs the best on the test dataset according to a chosen metric, such as accuracy or error. What is the problem with these approaches of evaluation? P 137

A

A problem with this approach is that only model performance is assessed, regardless of model complexity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Probabilistic model selection (or ____) provides an analytical technique for scoring and choosing among candidate models. Models are scored both on their ____ and based on the ____. P 137

A

information criteria, performance on the training dataset, complexity of the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What’s the definition of model performance and model complexity? P 137

A

ˆ Model Performance. How well a candidate model has performed on the training dataset.
ˆ Model Complexity. How complicated the trained candidate model is after training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Model performance may be evaluated using a probabilistic framework, such as ____ under the framework of maximum likelihood estimation. Model complexity may be evaluated as ____ aka ____. P 137

A

Log-likelihood, the number of degrees of freedom, parameters in the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

A limitation of probabilistic model selection methods is that the same general statistic cannot be calculated across a range of different types of models. Instead, the metric must be carefully derived for each model. True/False P 138

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

There are three statistical approaches to estimating how well a given model fits a dataset and how complex the model is. And each can be shown to be equivalent or proportional to each other, although each was derived from a different framing or field of study, but each statistic can be calculated using the log-likelihood for a model and the data. What are they and from which field are they derived? P 138

A

They are:
ˆ Akaike Information Criterion (AIC). Derived from frequentist probability.
ˆ Bayesian Information Criterion (BIC). Derived from Bayesian probability.
ˆ Minimum Description Length (MDL). Derived from information theory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

To use AIC/ BIC/ MDL for model selection, we simply choose the model giving ____ (smallest/biggest) AIC/BIC/ MDL over the set of models considered. P 139

A

Smallest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

“Compared to the BIC method, the AIC statistic penalizes complex models less”, what does it mean? P 139

A

Means that it may put more emphasis on model performance on the training dataset, and, in turn, select more complex models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Unlike the AIC, the BIC penalizes the model more for its complexity, meaning that more complex models will have a worse (larger) score and will, in turn, be less likely to be selected. True/False P 139

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Given a family of models, including the true model, the probability that BIC will select the correct model approaches one as the sample size N → infinity. And it’s the same for AIC too. True/False P 140

A

False

Importantly, the derivation of BIC under the Bayesian probability framework means that if a selection of candidate models includes a true model for the dataset, then the probability that BIC will select the true model increases with the size of the training dataset. This cannot be said for the AIC score.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

A downside of BIC is that for smaller, less representative training datasets, it is more likely to choose models that are too simple. True/False P 140

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The MDL principle takes the stance that the best theory for a body of data is one that maximizes the size of the theory plus the amount of information necessary to specify the exceptions relative to the theory True/False P 141

A

False

The MDL principle takes the stance that the best theory for a body of data is one that minimizes the size of the theory plus the amount of information necessary to specify the exceptions relative to the theory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The MDL calculation is very similar to BIC and can be shown to be equivalent in some situations. True/False P 141

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The likelihood function for a linear regression model can be shown to be identical to the least squares function. True/False P 142

A

True

16
Q

Summary:

ˆ Akaike and Bayesian Information Criterion are two ways of scoring a model based on its
____ and ____.
ˆ Minimum Description Length provides another scoring method from information theory
that can be shown to be equivalent to ____.

A

log-likelihood, complexity, BIC