MDL Flashcards
Rissanens approach
““We never want to make the false assumption that the observed data actually
were generated by a distribution of some kind, say Gaussian, and then go on
to analyze the consequences and make further deductions. Our deductions may be entertaining but quite irrelevant to the task at hand, namely, to learn useful
properties from the data.””
Example 1.5 [Models that are Wrong, yet Useful] Even though the models
under consideration are often wrong, they can nevertheless be very useful. Ex-
amples are the successful ‘Naive Bayes’ model for spam filtering, Hidden Markov
Models for speech recognition (is speech a stationary ergodic process? probably
not), and the use of linear models in econometrics and psychology. Since these
models are evidently wrong, it seems strange to base inferences on them using
methods that are designed under the assumption that they contain the true distri-
bution.”
Basics of MDL
“MDL is based on the following insight: any regularity in the data can be used to compress the data, i.e. to describe it using fewer symbols than the number of symbols needed to describe the data literally.
Minimum Description Length prinicple (MDL), introduced by Jorma Rissanen in the 1970s, is provides a generic solution:
Provides:
Occam’s Razor MDL chooses a model that trades-off goodness-of-fit on the ob-
served data with ‘complexity’ or ‘richness’ of the model.
No overfitting, automatically MDL procedures automatically and inherently pro-
tect against overfitting and can be used to estimate both the parameters and the
structure (e.g., number of parameters) of a model. In contrast, to avoid over-
fitting when estimating the structure of a model, traditional methods such as
maximum likelihood must be modified and extended with additional, typically ad
hoc principles.
Bayesian interpretation MDL is closely related to Bayesian inference, but avoids
some of the interpretation difficulties of the Bayesian approach
No need for ‘underlying truth’ In contrast to other statistical methods, MDL
procedures have a clear interpretation independent of whether or not there exists
some underlying ‘true’ model.
Predictive interpretation Because data compression is formally equivalent to a
form of probabilistic prediction, MDL methods can be interpreted as searching
for a model with good predictive performance on unseen data.”
Kolmogorov Complexity
“ength of the shortest program that prints the sequence and then halts. The lower
the Kolmogorov complexity of a sequence, the more regular it is.
Kolmogorov complexity is uncomputable: No algorithm can calculate it for every string.”
Crude, two-part MDL
“To apply MDL to polynomial or other types of hypothesis and model selection, we
have to make precise the somewhat vague insight ‘learning may be viewed as data
compression’. This can be done in various ways. In this section, we concentrate on the
earliest and simplest implementation of the idea. This is the so-called two-part code
version of MDL:
“
Probability Distribution ⇔ Prefix Code
This correspondence allows identifying codelengths functions and probability distributions such that short code length corresponds to high probability and vice verca
PMF and codelength functions
Matching codes to a distribution