C3 Flashcards
what are the two parts of two-part MDL?
first part: use some optimal code to specify the right Turing machine, x
second part: use some optimal code to specify the right content for the tape, y
“right” means that the specified Turing machine and tape content combination outputs the given string
3 problems with Kolmogorov complexity
- uncomputability (halting problem)
- large constants (depends on the universal TM we choose)
- Universal TM is extremely expressive, but this leads to serious problems
MDL
Minimum Description Length principle:
given a set of models M, the best model m in M is the one that minimises L(M) + L(D|M)
with L(M) the length in bits of the description of M and L(D|M) the length in bits of the description of the data when encoded with M
example: describing polynomials
L(M) refers to the coefficients at a certain precision
L(D|M) is the length needed to encode the residuals: the differences between the actual values and the values given by the model
model class (statistics)
all polynomials
model (statistics)
all polynomials of kth degree
point hypothesis
one specific polynomial (with coefficients specified)
problem with crude two-part MDL
- L(M) needs to be defined
- multiple code words for a single dataset exists, which cannot be optimal (two-part code is incomplete)
universal coding
associate a code not with a single model, but with all m in the set of models => construct a code L(D|M) that is small whenever there is an m in M that fits the data well
refined MDL
uses universal codes (guaranteed to be minimax optimal)
developed for three tasks:
1. model selection
2. point hypothesis selection
3. prediction
Kraft’s inequality
for alphabet S, any prefix codee
Kraft’s inequality
for alphabet S, any prefix code C must satisfy sum_x in S (s^-L_C(x) <= 1
=> code lengths are probabilites, a probability distribution is a code
Normalised Maximum Likelihood (NML) code
example of a universal code: the code given by the equalised distribution of maximum likelihoods
P_nml(D) = P_{tHATa(D)} (D) / sum_{D in D} P_{tHATa(D)} (D)
likelihood distribution P_theta is the probability of any data given parameter(s) theta and the likelihood estimator tHATa (D) = argmax_{theta in theta} P_theta (D)
prequential plug-in code
we want to encode messages over some alphabet, but initially we don’t know anything
1. start with some low count and define probabilities accordingly (we don’t like probabilities of 0)
2. transmit symbols one by one and update counts, meanwhile compute probabilities
this converges to the optimal maximum likelihood estimate
what does MDL give us?
MDL gives us the best model within the class of models considered (more modest than Kolmogorov complexity, but we can compute it). It does not assume one model to be the “true” model