week 2 - regression Flashcards
what are four reasons why overfitting might occur?
The model is too complex
There is not enough training data
The testing set is different to the training data
We have highly correlated variables in multidimensional datasets.
Explain the bias-variance trade off
Bias is how well the model fits to the training data (training error)
Variance is how much the predictability of our model varies across datasets
Generally, lower error = higher variance.
The goal of ML is to find the sweet spot
why is dimensionality a problem?
More data = more parameters. Estimating more parameters accurately is harder and takes more time
As dimensions increases, the volume of data space grows exponentially meaning data points become sparser relative to the available space. This means that it provides a poorer estimate of predictions as the distances between data points, which are important to identify patterns in the data, tend to shrink relative to the total volume space. More and more data points are needed in each dimension to provide a good estimate of predictions.
Variables will often be highly correlated. This makes it very difficult to estimate model parameters accurately.
Its hard to interpret models with vast numbers of predictors as it becomes harder to understand whats actually meaningful within the dataset
how many dimensions does it become completely impossible to estimate data points
If you have more dimensions than you have data points.
This is because, in any dimension of your data, you cannot fit the model to a single data point
what is regularisation in the context of the bias variance trade off?
Increasing bias to try to reduce variance
We weaken or remove some parameters in the model to simplify the model
This makes the model less accurate in the training dataset, but more generalisable to other datasets
when is regularisation helpful?
When we have correlated predictors in our dataset
When we have highly dimensional and noisy data
why is regularisation good for noisy data with many features?
Because in these datasets, some parameters will be overestimated because they are fitting to noise. These will thus have an outsized effect on predictions, meaning our model will overfit to the training data
What is ridge (L2) regression?
OLS regression which minimises the sum of least squares, plus a ‘penalty’ based on the parameter values
This penalty is esssentially an extra loss for the largest predictor coefficient (meaning the predictors that have the highest impact on the dependent variable.
Penalties are higher for models that have many high value parameters
This helps us avoid the problem of the model learning very strong predictive relationships that will not generalise to the testing data
what is the ridge regression cost function formula?
SSE + Parameters^2 x lambda
the penalty in the value of the parameters^2 x lambda
Parameters^2 means that it is always a positive penalty.
(parameters = slopes)
So you try and minimise the error in the OLS equation, but instead of error = SSE, you make error the SSE + Parameters^2 x lambda. Then if you plug that into the equation and recalculate the slope and intercept it’ll change the line slightly.
what is the OLS equation?
Dependent variable (Y) = y-intercept (B0) + Slope*Independent-variable (B1X) + Error (E)
Extra dimensions add extra slopesIV’s so a B2A + a B2*B may be added
How does the slope relate to the penalty in ridge regression?
When the slope is steep, relatively small changes in X correspond to large changes in y. This means the predictions are very sensitive to small changes in X
As we increase lambda, the slope of our model decreases. This means that as lambda increases, the predictions of our model get less sensitive to changes in X
If lambda is 0, the cost function is the same as ordinary least squares. If lambda is very high, all parameters will end up close to zero.
What is the cost function for logistic ridge regression?
Maximum likelyhoods. This is because the cost function for logistic regression is the sum of the maximum likelyhoods
How do we choose the value of lambda?
Cross validation
what is the cost function equation of lasso (L1) regularisation?
SSE + |slope| * lambda
This means it takes the absolute value of the slope (instead of squaring the slope like in ridge). This means it will shrink some coefficients to zero.
is lasso or ridge better?
Lasso is better at reducing variance when theres lots of noisy variables because it excludes useless or noisy variables by shrinking them to zero
Ridge regression performs better when most variables are useful
By excluding some variables, Lasso makes the regression easier to interpret