Complexity/Selection/Regularization Flashcards by Oussama Bouanani

What is the bias-variance decomposition?

E[(Y - f̂_D(x))^2] = (Bias^2) + (Variance)

How well did you know this?

Not at all

Perfectly

What does high bias in a model indicate?

A bias towards a particular kind of solution (e.g., linear model), also known as inductive bias.

How well did you know this?

Not at all

Perfectly

What does high variance in a model indicate?

: The estimated model changes significantly when trained on different data sets, indicating overfitting.

How well did you know this?

Not at all

Perfectly

What is the VC dimension?

The maximum number of points that can be correctly classified by at least one member of a set of classifiers.

How well did you know this?

Not at all

Perfectly

What is the VC dimension of a linear classifier on R^p?

VC = p + 1

How well did you know this?

Not at all

Perfectly

How are Degrees of Freedom (DF) defined for an estimate ŷ = f̂(X)?

Back: df(ŷ) = (1/σ^2) Σ cov(ŷ_i, y_i) = (1/σ^2) tr(cov(ŷ, y))

How well did you know this?

Not at all

Perfectly

What is an intuition/interpretation of degrees of freedom in model complexity defined for an estimate ŷ = f̂(X)??

Degrees of freedom in model complexity represent the number of independent parameters that can be adjusted to fit the data, influencing the model’s flexibility.

How well did you know this?

Not at all

Perfectly

Relationship between complexity, bias, variance and total error

The higher the model complexity, the lower the bias, the higher the variance and total error does a convex in between.

How well did you know this?

Not at all

Perfectly

What does a good bias require?

Domain knowledge.

How well did you know this?

Not at all

Perfectly

What is the relationship between degrees of freedom, number of samples, number of features, and λλ in ridge regression?

In ridge regression, the degrees of freedom are influenced by the number of features and the penalty parameter λλ. As λλ increases, the degrees of freedom decrease, reducing model complexity and helping to prevent overfitting, especially when the number of samples is limited relative to the number of features.

How well did you know this?

Not at all

Perfectly

What is the PRESS statistic?

Predicted Residual Error Sum of Squares: PRESS = Σ(y_i - ŷ_-i)^2, where ŷ_-i is the prediction for the i-th sample when the model is estimated on all but the i-th sample.

How well did you know this?

Not at all

Perfectly

Higher model complexity effect on performance? What to do?

It will always lead to better fit on training data but not on test data therefore we need to select a model by estimating its performance for train and validation/test data.

How well did you know this?

Not at all

Perfectly

What is a method for validation?

Cross validation: estimate generalization error on different train/test splits.

How well did you know this?

Not at all

Perfectly

What is Leave-one-out Cross-Validation (LOO-CV)?

A method where the model is trained on all but one sample and tested on the left-out sample, repeated for all samples. The average prediction error is reported.

How well did you know this?

Not at all

Perfectly

What is k-fold Cross-Validation?

A method where data is split into k subsets, the model is trained on k-1 subsets and tested on the remaining one, repeated k times. Often k=5 or k=10 is used in practice.

How well did you know this?

Not at all

Perfectly

What is the relationship between expected prediction error and expected training error?

Study These Flashcards

expected pred error = expected train error + constant (>0) * DF(model)

What is a better selection criteria?

Study These Flashcards

training error + complexity penalty. The larger the complexity (which tends to increase train error and lower generalization), the larger the penalty to keep it down!

If two models fit the data equally well? which to select?

Study These Flashcards

One with less complexity!

What is the problem with cross validation?

Study These Flashcards

If we test too many models, we will reach overfitting..

What is the Bayes factor in model selection?

Study These Flashcards

The ratio of marginal likelihoods: pr(x|m_i) / pr(x|m_j)

What is the Bayes Information Criterion (BIC)?

Study These Flashcards

BIC(x;m) = -2 log pr(x|θ̂,m) + p log(n), where θ̂ is the maximum likelihood estimate, p is the number of parameters, and n is the number of samples.

What is the Fisher Information Approximation (FIA)?

Study These Flashcards

FIA(x;m) = -log pr(x|θ̂,m) + (p/2)log(n/2π) + log C_m, where C_m is the geometric complexity

What is the objective of l_k-penalized regression?

Study These Flashcards

Minimize ω(θ) + λ||θ||_k^k, where ω(θ) is the loss function and ||θ||_k is the l_k norm of the parameters.

What is the objective of l_k-penalized regression?

Study These Flashcards

As the number of parameters increases past the interpolation threshold, test error first increases then decreases again, forming a double descent curve.

What is implicit regularization in the context of gradient descent?

The gradient descent optimization process itself can act as a regularizer, especially in deep neural networks.

How can increasing the number of features reduce model complexity?

When using a minimum l_2-norm estimator, increasing features can lead to implicit regularization.

Why might overparameterized models be beneficial for complex data?

They can capture complex patterns while still benefiting from regularization, which is part of the success behind deep learning.

How does the l_1 norm penalty (Lasso) differ from the l_2 norm penalty (Ridge) in regularization?

L_1 tends to produce sparse solutions (some coefficients exactly zero), while l_2 shrinks all coefficients but rarely to exactly zero.

What is the "No Free Lunch" theorem in machine learning?

There is no single model that works best for every problem; the best model depends on the specific data and domain.

What is Occam's razor in the context of model selection?

If two models fit the data equally well, we should select the simpler one.

How does the AIC (Akaike Information Criterion) differ from BIC?

AIC uses a fixed penalty of 2p, while BIC uses p log(n), making BIC more conservative for large n.

What is the marginal likelihood in Bayesian model selection?

pr(x|m) = ∫ pr(x|θ,m)pr(θ|m)dθ, integrating over all possible parameter values.

How can gradient descent act as an implicit regularizer?

It can bias the solution towards certain regions of the parameter space, often simpler or smoother solutions.

What is the geometric interpretation of l_1 and l_2 regularization?

L_2 forms circular (2D) or spherical (higher dimensions) constraint regions, while l_1 forms diamond-shaped regions with corners on the axes.

How does the effective degrees of freedom change as the ridge penalty λ increases?

The effective degrees of freedom decrease as λ increases, indicating a simpler model.

Complexity/Selection/Regularization Flashcards

(35 cards)