Regularization Flashcards
What happens to our linear regression model if we have three columns in our data: x, y, z — and z is a sum of x and y? ⭐️
We would not be able to perform the regression. Because z is linearly dependent on x and y so when performing the regression would be a singular (not invertible) matrix.
What is regularization? Why do we need it? 👶
Regularization is used to reduce overfitting in machine learning models. It helps the models to generalize well and make them robust to outliers and noise in the data.
Which regularization techniques do you know? ⭐️
There are mainly two types of regularization,
L1 Regularization (Lasso regularization) - Adds the sum of absolute values of the coefficients to the cost function.
L2 Regularization (Ridge regularization) - Adds the sum of squares of coefficients to the cost function.
Where determines the amount of regularization.
What kind of regularization techniques are applicable to linear models? ⭐️
AIC/BIC, Ridge regression, Lasso, Elastic Net, Basis pursuit denoising, Rudin–Osher–Fatemi model (TV), Potts model, RLAD, Dantzig Selector,SLOPE
How does L2 regularization look like in a linear model? ⭐️
L2 regularization adds a penalty term to our cost function which is equal to the sum of squares of models coefficients multiplied by a lambda hyperparameter. This technique makes sure that the coefficients are close to zero and is widely used in cases when we have a lot of features that might correlate with each other.
How do we select the right regularization parameters? 👶
Regularization parameters can be chosen using a grid search, for example https://scikit-learn.org/stable/modules/linear_model.html has one formula for the implementing for regularization, alpha in the formula mentioned can be found by doing a RandomSearch or a GridSearch on a set of values and selecting the alpha which gives the least cross validation or validation error.
What’s the effect of L2 regularization on the weights of a linear model? ⭐️
L2 regularization penalizes larger weights more severely (due to the squared penalty term), which encourages weight values to decay toward zero.
How L1 regularization looks like in a linear model? ⭐️
L1 regularization adds a penalty term to our cost function which is equal to the sum of modules of models coefficients multiplied by a lambda hyperparameter. For example, cost function with L1 regularization will look like:
What’s the difference between L2 and L1 regularization? ⭐️
Penalty terms: L1 regularization uses the sum of the absolute values of the weights, while L2 regularization uses the sum of the weights squared.
Feature selection: L1 performs feature selection by reducing the coefficients of some predictors to 0, while L2 does not.
Computational efficiency: L2 has an analytical solution, while L1 does not.
Multicollinearity: L2 addresses multicollinearity by constraining the coefficient norm.
Can we have both L1 and L2 regularization components in a linear model? ⭐️
Yes, elastic net regularization combines L1 and L2 regularization.
What’s the interpretation of the bias term in linear models? ⭐️
Bias is simply, a difference between predicted value and actual/true value. It can be interpreted as the distance from the average prediction and true value i.e. true value minus mean(predictions). But dont get confused between accuracy and bias.
How do we interpret weights in linear models? ⭐️
Without normalizing weights or variables, if you increase the corresponding predictor by one unit, the coefficient represents on average how much the output changes. By the way, this interpretation still works for logistic regression - if you increase the corresponding predictor by one unit, the weight represents the change in the log of the odds.
If the variables are normalized, we can interpret weights in linear models like the importance of this variable in the predicted result.
If a weight for one variable is higher than for another — can we say that this variable is more important? ⭐️
Yes - if your predictor variables are normalized.
Without normalization, the weight represents the change in the output per unit change in the predictor. If you have a predictor with a huge range and scale that is used to predict an output with a very small range - for example, using each nation’s GDP to predict maternal mortality rates - your coefficient should be very small. That does not necessarily mean that this predictor variable is not important compared to the others.
When do we need to perform feature normalization for linear models? When it’s okay not to do it? ⭐️
Feature normalization is necessary for L1 and L2 regularizations. The idea of both methods is to penalize all the features relatively equally. This can’t be done effectively if every feature is scaled differently.
Linear regression without regularization techniques can be used without feature normalization. Also, regularization can help to make the analytical solution more stable, — it adds the regularization matrix to the feature matrix before inverting it.