Weeks 5-8 Flashcards

1
Q

What is the formula for AIC?

A

-2 * Log-likelihood(theta-hat(mle)) + 2d

It is the training error + model complexity penalty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the formula for BIC?

A

-2 * Log-likelihood(theta-hat(mle)) + LogN*d

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the differences between AIC and BIC?

A
  1. BIC puts a heavier penalty on model complexity.
  2. AIC works better when n is small, while BIC takes into account sample size.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Holdout Validation? And what error does it approximate?

A

Holdout Validation is reserving a portion of the data for validation after the model is trained.

It approximates Prediction Error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does the size of the training set vs validation set impact holdout validation?

A

Smaller training set: Tends to produce simpler models.

Smaller validation set: Produces more complex models and has a poor approximation of the error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Crossvalidation? And what error does it approximate?

A

It divides the data into K sets (K>=2) and for the 6th set, fit the model on the other K-1 parts and then predict on the 6th set. Do this K times.

It estimates Prediction Error and is poor estimate of Expected Prediction Error (because the folds are so highly dependent of each other).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What should you consider when choosing K in crossvalidation?

A

K = N (known as Leave-one-out validation) is the best approximation of expected prediction error but might be computationally costly for large N.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the assumptions made when using any validation methods?

A

The data is I.I.D: independent and identically distributed. Also (In Crossvalidation), the K splits are random.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the limitations of the validation methods.

A

When choosing between many models, there is a tendency to overfit to the validation set (i.e., low bias^2 but high variance).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is regularisation?

A

Any modification we make to a learning algorithm that is intended to reduce the generalisation error, but not its training error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Ridge Regression? And what is its formula?

A

Restricts the coefficients B to be less than some pre-specified control parameter lambda > 0. The formula optimises an error term and a model complexity term.

B = argmin { sum( yi - B0 - sum(Bjxij) )^2 + lambda*sum(Bj^2) }

The second term penalises model complexity, dependent on the pre chosen lambda.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does ridge regression compare to linear regression?

A
  1. Leads to a larger bias but smaller variance (i.e., generalises better to new data).
  2. It can be shown there is a nonzero lambda that makes the expected prediction error smaller than linear regression.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is LASSO Regression? And what is its formula?

A

It adds a penalisation term on the sum of absolute values of the coefficients to normal least squares.

B = argmin{ sum( yi - B0 - sum(Bjxij) )^2 + lambda*sum(|Bj|) }

It can also be used for variable selection, since it tends to generate 0 for some coefficients for large enough values of lambda.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Compare LASSO and Ridge Regression.

A
  1. Usually Ridge gets a better prediction error than LASSO.
  2. LASSO can be used for variable selection.
  3. LASSO can outperform Ridge when many of the ‘true’ coefficients are zero.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do Ridge and LASSO Regression incorporate prior knowledge on the true coefficients?

A

They incorporate prior knowledge of the distributions (Ridge = Gaussian; LASSO = Laplacian) into Bayes Theorem.

P(B|D) ~ P(D|B) * P(B)
Posterior ~ Likelihood * Prior

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the 0-1 Loss?

A

L(kl) = { 1, k != l }
{ 0, k = l }

Essentially a unit loss is incurred in the case of misclassification.

16
Q

What is Bayes Classifier and how does it relate to the 0-1 loss?

A

Bayes classifier states that you should classify individual x into class c if and only if:

pc(x) >= pj(x) for all j = 1,…,C.

Bayes classifier is optimal under the 0-1 loss.

17
Q

What is the decision boundary in classification?

A

The boundary on a graph that separates predictions for two classes based on Bayes Classifier.

18
Q

What is Logistic Regression?

A

A statistical method used to analyse the relationship between a binary outcome variable (yes/no, 0/1) and one or more predictor variables.

19
Q

What is the maximum likelihood function for Logistic Regression?

A

p(y|X, B) = max( p1(xi)^yi(1 - p1(xi))^(1-yi)

For this, we assume the Bernoulli distribution which has the probability density function of:

p(y/pi) = pi^y (1 - pi)^(1-y)

20
Q

How do you measure accuracy for classification?

A

Use a confusion matrix with the number of false postives, false negatives, true positives, and true negatives. Accuracy is the number of true positives and negatives over the total number of data points (N).

21
Q

What is the formula for the probability of getting a True value for xi when there’s binary outcomes.

A

P1(xi) = ( exp(B0 + B1xi1 +…+ Bpxip)/
(1 + exp(B0 + B1xi1 +…+ Bpxip) )