Weeks 1 - 4 Flashcards by Caleb Miller

What is the Loss Function? What are the three main Loss Functions?

It measures the cost of prediction y-hat, when the true value is y (i.e., It is a function of the error in the prediction).
- Squared loss (L2): (y - y-hat)^2
- Absolute Error (L1): |y - y-hat|
- 0-1 loss: {1 if y != y-hat}
{ 0 if y = y-hat}

How well did you know this?

Not at all

Perfectly

What is the Predictive Function?

A function f-hat(x) that is trained on a set of data, and then makes a prediction on a new input f-hat(x0), which is the predictive function evaluated at x0.

How well did you know this?

Not at all

Perfectly

What is the Expected Loss?

It is the sum of all losses, multiplied by the probability of the losses occurring: R(f) = E [ E ( Y - f(x) )^2 | X].

How well did you know this?

Not at all

Perfectly

What is the optimal way to chose the predictive function?

Choose the function that minimises the expected loss as it provides the model with the highest accuracy. However, in some cases we may care more about extreme errors more than the average error and as such, we may choose a different function to minimise.

How well did you know this?

Not at all

Perfectly

What is the relationships between loss and probability?

Under the squared loss, the optimal prediction of Y at any point X = x is the conditional mean E ( Y | X = x ).

How well did you know this?

Not at all

Perfectly

Explain the No Free Lunch Theorem.

There is no single model or approach that works optimally for all problems; the performance of any two models is the same when averaged across all possible problems.

How well did you know this?

Not at all

Perfectly

Explain the Additive Error Model.

It is the basic model for regression, assuming the relationship between Y and X is described as Y = f(x) + Ę, where f(x) is the unknown regression function and Ę is a random error with mean zero.

How well did you know this?

Not at all

Perfectly

Explain the difference between and Parametric and a Non-Parametric method.

Parametric methods assume an underlying model/pattern that generates the data (e.g., Linear Regression).
Non-Parametric methods don’t assume an underlying model (e.g., KNN Regression).

Parametric methods are faster to use and more interpretable, but make stronger assumptions about the data. Non-Parametric methods are more flexible, but have higher variance and computationally expensive.

How well did you know this?

Not at all

Perfectly

What are the Assumptions of Linear Regression.

Linearity: The data is generated from an underlying linear model
Exogeneity: The expected value of the error is zero (E(Ę) = 0).
I.I.D. Data: The data is independent and identically distributed.
4th Moment Exists: No fat tails or large outliers
Homoskedasicity: The error variance is constant over all values of X.
No Multicollinearity: Two predictors cannot be highly correlated (MLR)

How well did you know this?

Not at all

Perfectly

Linear Regression is _________ when the LR assumptions hold.

Unbiased

How well did you know this?

Not at all

Perfectly

Linear Regression is “optimal” when…

The variance is minimised with unbiased estimators when the LR assumptions hold.

Note: You can use Linear Regression when some assumptions do not hold, but it will not be optimal.

How well did you know this?

Not at all

Perfectly

Explain the Maximum Likelihood principle.

Estimating the parameters of an assumed probability distribution by maximising a likelihood function such that, under the assumed statistical model, the observed data is most probable.

The Likelihood Function for Linear Regression is: p(y|B0,B1,ó^2)
= max{ 1/sqrt(2pió^2) exp(-(yi-B0-B1xi)^2/2ó^2)
for i = 1,…,n

How well did you know this?

Not at all

Perfectly

What is the relationship between maximum likelihood (gaussian) and least squares estimation?

Maximising the likelihood function (gaussian) with respect to B0 and B1 leads to exactly the same least squares estimates B0-hat and B1-hat

How well did you know this?

Not at all

Perfectly

What is the hierarchy principle for interaction effects in Linear Regression?

If we include an interaction term in a model, the associated main effects should always be included in the model as well

How well did you know this?

Not at all

Perfectly

What are examples of non-linear transformations?

Log tranformations
Quadratic effects
Interaction effects

How well did you know this?

Not at all

Perfectly

How does linear regression work with transformed variables?

Linear regression is linear in respect to the transformed variables and non-linear in respect to the original explanatory variables

What is KNN Regression and its function.

KNN is a non-parametric method that uses the sample average of the k training response values whose corresponding inputs are closest to the query point x to make predictions:
f(x)-hat = 1/k * sum(yi) for i = 1,…,N(k).

What happens when k = 1 or when k = N?

When k = 1, the model uses the closest point’s response value to find the query point’s prediction (overfitting - squiggly line).
When k = N, the model finds the average of all points’ response values for the query point’s prediction (underfitting - flat line).

What is the Euclidean Distance Function?

It is: d(xi, xl) = sqrt( sum( xij - xlj)^2 ) ) for j = 1,…,p
||xi - xl||2.

How do you calculate a KNN prediction manually?

Find the k nearest points to the target x value using the euclidean distance, and find the average of the response values of those points

How do you change the scale of the Euclidean Distance to incorporate two predictors in different scales.

Use the normalised euclidean distance:
d(xi, xl) = sqrt( sum( (xij - xlj) / Sxj )^2 ) for j = 1,…,p,
where Sxj is the sample standard deviation of predictor j in the training sample.

Can also use the Mahalanobis distance:
d(xi, xl) = sqrt( (xi - xl)^T S^-1 (xi - xl) ),
where S is the sample covariance matrix of the predictors.

What is the curse of dimensionality?

As we increase the predictors, it becomes exponentially more difficult to find training observations reasonably local to prediction point x, and as such, KNN breaks down for high p

What impacts the complexity of KNN?

The sample size N, the number of predictors p, and the chosen k value. It is computationally expensive.

How does KNN extrapolate for values of X not seen in the training set?

It finds the closest ones regardless, and generally struggles to extrapolate.

What is Prediction Error (PE)?

Err(M|D) = E(x,y) [L(y, F-hat(x|D))] Used when the primary goal is to assess how well a model performs in making accurate predictions on unseen data. It focuses on the immediate performance of the model without considering the model complexity or potential overfitting. - Model evaluation - Model comparison

What is Expected Prediction Error (EPE)?

Err(M) = ED[Err(M|D)] = ED[E(x,y)[L(Y, f-hat(X|D))]] It estimates how well a model is expected to perform on average over different datasets, considering both the model's ability to capture the underlying patterns (bias) and its sensitivity to variations in the data (variance). - Model selection - Overfitting prevention - Understanding model behavior

Prediction Error Decomposition (under squared loss) is:

Err(M|D) = (f(x0) - f-hat(x0))^2 + ó^2

Expected Prediction Error Decomposition (under squared loss) is:

Variance(f-hat) + Bias^2(f-hat) +ó^2

How does bias and variance change for a linear model (under squared loss) with different numbers of predictors?

The more predictors p, the greater the variance and smaller the bias^2, and vice versa: Err(M) = ó^2*sum(x0i^2) + (x0B - f(x0))^2 + ó^2

How does bias and variance change for KNN models (under squared loss) for different k

The larger the k, the smaller the variance and the larger the bias^2, and vice versa: Err(x0) = [f(x0) - 1/k*sum(f(xl)]^2 + ó^2/k + ó^2

What is the training error and what is the problem for using it in model selection?

The average loss over the data points from the data used to estimate the model err-bar = 1/n sum( L(yi, f-hat(xi)) ) The training error improves for more values of p (LR) or a lower k (KNN), and can lead to overfitting if just assessing models based on training error.

What does overfitting and underfitting mean in the context of Bias-Variance decomposition.

Overfitting means a low bias and high variance on the training set. Underfitting means a high bias and low variance own the training set.