Weeks 5-8 Flashcards

Question 1

Q

What is the formula for AIC?

Answer

A

-2 * Log-likelihood(theta-hat(mle)) + 2d

It is the training error + model complexity penalty

Question 2

Q

What is the formula for BIC?

Answer

A

-2 * Log-likelihood(theta-hat(mle)) + LogN*d

Question 3

Q

What are the differences between AIC and BIC?

Answer

A

BIC puts a heavier penalty on model complexity.
AIC works better when n is small, while BIC takes into account sample size.

Question 4

Q

What is Holdout Validation? And what error does it approximate?

Answer

A

Holdout Validation is reserving a portion of the data for validation after the model is trained.

It approximates Prediction Error

Question 5

Q

How does the size of the training set vs validation set impact holdout validation?

Answer

A

Smaller training set: Tends to produce simpler models.

Smaller validation set: Produces more complex models and has a poor approximation of the error.

Question 6

Q

What is Crossvalidation? And what error does it approximate?

Answer

A

It divides the data into K sets (K>=2) and for the 6th set, fit the model on the other K-1 parts and then predict on the 6th set. Do this K times.

It estimates Prediction Error and is poor estimate of Expected Prediction Error (because the folds are so highly dependent of each other).

Question 7

Q

What should you consider when choosing K in crossvalidation?

Answer

A

K = N (known as Leave-one-out validation) is the best approximation of expected prediction error but might be computationally costly for large N.

Question 8

Q

What are the assumptions made when using any validation methods?

Answer

A

The data is I.I.D: independent and identically distributed. Also (In Crossvalidation), the K splits are random.

Question 9

Q

What are the limitations of the validation methods.

Answer

A

When choosing between many models, there is a tendency to overfit to the validation set (i.e., low bias^2 but high variance).

Question 10

Q

What is regularisation?

Answer

A

Any modification we make to a learning algorithm that is intended to reduce the generalisation error, but not its training error.

Question 11

Q

What is Ridge Regression? And what is its formula?

Answer

A

Restricts the coefficients B to be less than some pre-specified control parameter lambda > 0. The formula optimises an error term and a model complexity term.

B = argmin { sum( yi - B0 - sum(Bjxij) )^2 + lambda*sum(Bj^2) }

The second term penalises model complexity, dependent on the pre chosen lambda.

Question 12

Q

How does ridge regression compare to linear regression?

Answer

A

Leads to a larger bias but smaller variance (i.e., generalises better to new data).
It can be shown there is a nonzero lambda that makes the expected prediction error smaller than linear regression.

Question 13

Q

What is LASSO Regression? And what is its formula?

Answer

A

It adds a penalisation term on the sum of absolute values of the coefficients to normal least squares.

B = argmin{ sum( yi - B0 - sum(Bjxij) )^2 + lambda*sum(|Bj|) }

It can also be used for variable selection, since it tends to generate 0 for some coefficients for large enough values of lambda.

Question 14

Q

Compare LASSO and Ridge Regression.

Answer

A

Usually Ridge gets a better prediction error than LASSO.
LASSO can be used for variable selection.
LASSO can outperform Ridge when many of the ‘true’ coefficients are zero.

Question 15

Q

How do Ridge and LASSO Regression incorporate prior knowledge on the true coefficients?

Answer

A

They incorporate prior knowledge of the distributions (Ridge = Gaussian; LASSO = Laplacian) into Bayes Theorem.

P(B|D) ~ P(D|B) * P(B)
Posterior ~ Likelihood * Prior

Question 16

Q

What is the 0-1 Loss?

Answer

A

L(kl) = { 1, k != l }
{ 0, k = l }

Essentially a unit loss is incurred in the case of misclassification.

Question 17

Q

What is Bayes Classifier and how does it relate to the 0-1 loss?

Answer

A

Bayes classifier states that you should classify individual x into class c if and only if:

pc(x) >= pj(x) for all j = 1,…,C.

Bayes classifier is optimal under the 0-1 loss.

Question 18

Q

What is the decision boundary in classification?

Answer

A

The boundary on a graph that separates predictions for two classes based on Bayes Classifier.

Question 19

Q

What is Logistic Regression?

Answer

A

A statistical method used to analyse the relationship between a binary outcome variable (yes/no, 0/1) and one or more predictor variables.

Question 20

Q

What is the maximum likelihood function for Logistic Regression?

Answer

A

p(y|X, B) = max( p1(xi)^yi(1 - p1(xi))^(1-yi)

For this, we assume the Bernoulli distribution which has the probability density function of:

p(y/pi) = pi^y (1 - pi)^(1-y)

Question 21

Q

How do you measure accuracy for classification?

Answer

A

Use a confusion matrix with the number of false postives, false negatives, true positives, and true negatives. Accuracy is the number of true positives and negatives over the total number of data points (N).

Question 22

Q

What is the formula for the probability of getting a True value for xi when there’s binary outcomes.

Answer

A

P1(xi) = ( exp(B0 + B1xi1 +…+ Bpxip)/
(1 + exp(B0 + B1xi1 +…+ Bpxip) )