Supervised Learning: Regression Flashcards
Is Decision Tree / Random Forest a regression or a classification model?
Trick question: it can be either!
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
What is MSE?
What is one downside with using it?
Mean Squared Error is the standard cost function to be minimized by linear regression models.
Downside: it is suspectible to outliers.
What is the syntax for calculating MSE and R-squared for a linear regression?
from sklearn import metrics
metrics.mean_squared_error(y, y_pred)
metrics.r2_score(y, y_pred)
What are two ways of calculating R-squared?
1.
from sklearn import metrics
metrics.r2_score(y, y_pred) # Runs instantly as long as y_pred is previously calculated
2.
my_linreg.score(X, y) # has to run .predict behind the scenes and might thus take time
What to do when two of your predictors in a linear regression model are highly correlated with each other (collinear), but you’re not sure which one to leave in the model?
- Test the true importance of each variable by running a model with 1 variable at a time MISSING from the model, recording the R^2 each time. The reduction in R^2 with 1 var missing vs all vars present is the true importance of the missing var.
In addition,
2. Consider running a Ridge or Lasso regression (instead of standard linear regression), which are incentivized to “zero-out” whichever of the two coefficients ends up being really small.
Can linear regression fit non-linear data? If so, how?
Yes, by using polynomials of power 2 or greater. To do this manually, add the higher power of EVERY predictor, as well as every interaction term of that power. For example: if predictors are x1, x2, x3, the quadratic model predictors are x1, x2, x3, x1^2, x2^2, x3^2, x1x2, x1x3, x2*x3.
To do this built-in, use the PolynomialsFeatures() transformer.
What’s the best scaler to use before using PolynomialFeatures?
MaxAbsScaler, because it restricts range to between -1 and 1.
How to add all interaction terms to a linear regression model, while keeping the model truly linear? (i.e., no quadratic or higher powers)
PolynomialFeatures(interaction_only=True)
You suspect a quadratic (non-linear) relationship between at least some of your many predictors and the DV. But adding polynomials and all interactions of all the predictors will result in overfitting (too many predictors for the data size). What’s a solution to this conundrum?
Use Ridge or Lasso regression instead of standard linear regression, which are incentivized to “zero-out” many of the predictors.
How do Ridge and Lasso regression combat overfitting? Be specific.
They are incentivized to minimize the number of betas (coefficients) that are high, balancing that with also minimizing the cost function. This results in many of the betas being near-zero (Ridge) or actually zero (Lasso), which is functionally equivalent to not having put those predictors into the model in the first place.
What is the hyperparameter for Ridge regression and what do its values mean?
Alpha.
Alpha=0 is equivalent to standard regression. The higher alpha is, the more the model is incentivized to produce fewer high betas. Alpha doesn’t have an upper limit; in fact, trying log-increasing alphas (1, 10, 100) is a good idea.
What are the two broad subtypes of supervised learning?
And what are a few examples of each subtype?
Regression and classification.
Regression predicts a continuous outcome variable. Classification predicts a discrete/categorical outcome variable.
Regression examples: linear regression (can be polynomial), Ridge regression, Lasso regression, decision tree, random forest.
Classification examples: logistic regression, decision tree, random forest.
What are the 3 steps involved in training a supervised ML model?
- Choose a FAMILY of models (e.g., linear regression)
- Choose an ERROR METRIC / cost function (often implicitly chosen by default in sklearn)
- ITERATE to find the specific model in the family of models that minimizes the cost function (this is where the power of computing comes in!). For example, a linear regression model with specific Betas is a specific instance of the linear regression family of models.
In a nutshell, how do computers minimize the cost function without finding the exact min (aka “closed solution”) mathematically in a complex multi-dimensional space?
Gradient Descent:
- Calculate the value of the error/cost
- Move a very small distance in the (multidimensional) direction of all the predictors where the function is decreasing the fastest.
- Repeat #1
- Do 1-3 a fixed # of times, or until the error functionally stops decreasing.
- Reaching the GLOBAL min of the cost function is likely but not guaranteed!