Weeks 1 - 4 Flashcards
What is the Loss Function? What are the three main Loss Functions?
It measures the cost of prediction y-hat, when the true value is y (i.e., It is a function of the error in the prediction).
- Squared loss (L2): (y - y-hat)^2
- Absolute Error (L1): |y - y-hat|
- 0-1 loss: {1 if y != y-hat}
{ 0 if y = y-hat}
What is the Predictive Function?
A function f-hat(x) that is trained on a set of data, and then makes a prediction on a new input f-hat(x0), which is the predictive function evaluated at x0.
What is the Expected Loss?
It is the sum of all losses, multiplied by the probability of the losses occurring: R(f) = E [ E ( Y - f(x) )^2 | X].
What is the optimal way to chose the predictive function?
Choose the function that minimises the expected loss as it provides the model with the highest accuracy. However, in some cases we may care more about extreme errors more than the average error and as such, we may choose a different function to minimise.
What is the relationships between loss and probability?
Under the squared loss, the optimal prediction of Y at any point X = x is the conditional mean E ( Y | X = x ).
Explain the No Free Lunch Theorem.
There is no single model or approach that works optimally for all problems; the performance of any two models is the same when averaged across all possible problems.
Explain the Additive Error Model.
It is the basic model for regression, assuming the relationship between Y and X is described as Y = f(x) + Ę, where f(x) is the unknown regression function and Ę is a random error with mean zero.
Explain the difference between and Parametric and a Non-Parametric method.
Parametric methods assume an underlying model/pattern that generates the data (e.g., Linear Regression).
Non-Parametric methods don’t assume an underlying model (e.g., KNN Regression).
Parametric methods are faster to use and more interpretable, but make stronger assumptions about the data. Non-Parametric methods are more flexible, but have higher variance and computationally expensive.
What are the Assumptions of Linear Regression.
- Linearity: The data is generated from an underlying linear model
- Exogeneity: The expected value of the error is zero (E(Ę) = 0).
- I.I.D. Data: The data is independent and identically distributed.
- 4th Moment Exists: No fat tails or large outliers
- Homoskedasicity: The error variance is constant over all values of X.
- No Multicollinearity: Two predictors cannot be highly correlated (MLR)
Linear Regression is _________ when the LR assumptions hold.
Unbiased
Linear Regression is “optimal” when…
The variance is minimised with unbiased estimators when the LR assumptions hold.
Note: You can use Linear Regression when some assumptions do not hold, but it will not be optimal.
Explain the Maximum Likelihood principle.
Estimating the parameters of an assumed probability distribution by maximising a likelihood function such that, under the assumed statistical model, the observed data is most probable.
The Likelihood Function for Linear Regression is: p(y|B0,B1,ó^2)
= max{ 1/sqrt(2pió^2) exp(-(yi-B0-B1xi)^2/2ó^2)
for i = 1,…,n
What is the relationship between maximum likelihood (gaussian) and least squares estimation?
Maximising the likelihood function (gaussian) with respect to B0 and B1 leads to exactly the same least squares estimates B0-hat and B1-hat
What is the hierarchy principle for interaction effects in Linear Regression?
If we include an interaction term in a model, the associated main effects should always be included in the model as well
What are examples of non-linear transformations?
Log tranformations
Quadratic effects
Interaction effects