Linear Regression Flashcards
DS
What training algorithms are appropriate for a linear regression on large data sets? Which should be avoided?
Appropriate: stochastic gradient descent
Avoided: normal equation (too computationally expensive)
What are some reasons that gradient descent may converge slowly, and how do you address them?
Problem: low learning rate
Solution: increase the learning rate gradually (avoid making it so high that you jump over minima)
Problem: solutions have very dissimilar scales.
Solution: rescale features using a rescaling technique.
How do you select the right order of polynomial for polynomial regressions? What if the data is high dimensional?
This is a difficult question and there is no easy way to automate this selection.
It is suggested that you inspect the data and try to choose the order of polynomial that will best fit the data without overfitting.
If the data is high-dimensional and can’t be visualized, then we can train multiple models and observe when the validation error begins to increase ins tead of decrease, which at the point of increase, you are probably overfitting the data so you should reduce the polynomial order to the point where validation error ir minimized.
Is polynomial regression non-linear?
No, it is linear in the equation, and can be used to fit non-linear data.
Explain how to fit a line to data with least squares to a five year old
A horizontal line drawn at the average of all y observations is likely the worst fitting line possible, however, ybar is a good starting point.
Say y ranges (0, 7), then mean(y) is 3.5 which is also the y intercept. Let intercept b = 3.5.
Now lets compute squared residuals (yhat-yi)^2 = (3.5-yi)^2 for all (xi, yi) and sum them to obtain sum of squared residuals SSR = (3.5-y1)^2 + (3.5-y2)^2 +…+ (3.5-yp)^2
SSR is a measure of how well the line fits the data.
Now let’s see how the fit is when we rotate the line a bit. The SSR now equals 18.72 which an improvement. Continue rotating line and see if SSR improves further. If we over-rotate the line, SSR will begin to increase again. But somewhere between, we will find an optimal rotation where SSR is minimized.
The equation for a line is yhat = ax + b, where “a” is the slope and “b” is the y intercept (location on y-axis that the line crosses when x=0). We want to find optimal values for a,b such that we minimize SSR.
SSR = SUM((ax1+b)-y1)^2 +…+((axn+b)-yn)^2)
Since we want to find the line that gives minimal SSR, the method for finding best values for a and b is called “least squares”.
If we plot y=SSR and x=rotation of y, we’d get a parabola shaped function. So how do we get the optimal x-axis rotation which minimizes y-axis SSR?
Solution is to take the derivative of all the points in the above SSR function, which provides a slope of the curve at each point. We will get steepest slope furthest away from the optimal xi rotation value, and the slope will be zero at the optimal point of the curve, then the slope will begin steepening again when moving past the optimal point.
Note that the different rotation values for the x-axis parameter are just different values for slope “a” and y intercept “b”.
Now imagine a 3d plot with y-intercept along the z-axis, the slope along the x-axis, and RSS along the y-axis. If we fix the intercept values and change slopes ai (partial derivatives wrt ai), and vice versa. Taking the derivatives of both slope and intercept will tell us where the optimal values are the best fit.
Main concept 1: a line through the data will correspond to a SSR value.
Main concept 2: gradient descent of y=SSR, x=(ai,bi) will obtain the optimal parameters for a,b where the slope of the SSR is zero, i.e. we take derivative of SSR curve to find where it equals zero.
Explain how to evaluate performance of a regression model
Use MSE = 1/n * Sum(yhati - yi)^2.
MSE is a measurement of the squared sum of all distances between predicted and true values. The higher the value of MSE, the greater the total squared error and thus the worse the model.
Mathematical benefits of MSE: the metric penalizes outlier (extreme) residuals MORE than smaller residuals even if the abs value of resids is the same. e.g. given two models A, B:
model A has errors 0 and 10 with MSE: 0^2+10^2 = 100
model B has errors 5 and 5 with MSE: 2(5^2) = 50
notice while both models have total error of 10, model A MSE is worse because of the extreme residual outlier
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
# cross validate the linear reg using (negative) mse cv_mse_array = cross_val_score(reg, Xtrain, ytrain)
A common regression metric alternative to MSE is Rsquared, which measures the amount of VARIANCE in the target vector that is explained by the model:
R2 = 1- (SSR i / sum(yi-ybar)^2)
Explain how to handle feature interaction effects
Problem: you have a feature whose effect on the target DEPENDS on another feature.
Solution: Create an interaction term to capture the dependence using sklearn PolynomialFeatures.
Context for believing that interaction features exist:
Sometimes a feature’s effect on our tgt variable is at least partially dependent on another feature. e.g. imagine a simple coffee-based example where we have two binary features–the presence of sugar x1 = [sugar] and whether we have stirred x2= [stirred]–and we want to predict if the coffee tastes y=[sweet]. So, in this case we want to express an interaction effect between the features sugar and stirred. Why is an interaction effect necessary?
just stirring the coffee WITHOUT adding sugar won’t make the coffee sweet (stir=1, sugar=0).
just putting sugar in the coffee will NOT make the beverage sweet as all the suagar falls to the bottom and does not mix into the coffee! (sugar=1,stir=0).
Instead, it is the INTERACTION of of putting sugar AND stirring (sugar=1, stir=1) that makes the beverage sweet=1
The interaction effects of coffee and stir on sweetness are DEPENDENT of each other, i.e. there is an INTERACTION effect.
We account for interaction effects by including a new feature as the product of the two dependent features yhat = b0 + b1x1 + b2x2 + b3 x1x2.
To create an interaction term, we simply multiply sugar and stir features for every obs: interaction_term = np.multiply(X[‘sugar’], X[‘stir’]).
If there is no substantive reason for feature interaction effects:
In case of no intuition for feature interactions, we can use sklearn PolynomialFeatures to create interaction terms for ALL combinations of feature products. We can then use model selection strategies to IDENTIFY which feature combination, if any, produce the best model.
There are three important parameters in PolynomialFeatures:
Most importantly, interaction_only must be set to True to return only interaction terms and not polynomial terms,
By default, PolynomialFeatures will add a feature with ones called a bias, which we can prevent with inlude_bias=False.
Finally, the degree parameter determines the max number of features to create interaction (product) terms from.
Explain how to fit a line to a nonlinear relationship
When assuming a linear relationship, we assume constant relationships between target and features. An example of a linear relationship would be he number of stories and a building’s height. In linear regression, we assume the effect of number of stories and building height is approx constant, meaning a 20 story building will be approx twice as high as a 10 story building, which will be twice as high as a 5 story building. Many relationships of interest, however, are not linear.
e.g. take study time vs. test performance. The effect of studying zero hours vs. 1 hour makes a large difference, but studying 99 hrs vs. 100 hrs makes little difference.
To create a polynomial function, convert a linear function yhat = b0 + b1x1 +…+bpxp into a polynomial function by adding polynomial features yhat = b0 + b1x1 + b2x^2 +…+bpxp^d, where d is degree of the poly.
How are we able to use linreg for a nonlinear function? The answer is we don’t change how the linreg fits the model, i.e. the linreg doesn’t “know” that the x^2 is a quadratic transformation of x, it just considers it one more variable.
Explain how to fit a regression with shrinkage penalty to reduce overfitting
Problem: you want to reduce the variance (prediction variance) of your regression model.
Solution: Use a learning algo that includes a shrinkage (regularization) penalty, like ridge/lasso regression.
In standard linear regression, the model trains to minimize btwn true yi and prediction yhati, ie. min SSR = sum(yi - yhati)^2.
Regularized reg learners are similar except they attempt to minimize RSS AND some penalty FOR THE TOTAL SIZE OF THE COEF VALUES, called a SHRINKAGE penalty because it attempts to “shrink” the model. There are two types of shrinkage learners for reg: ridge and lasso. The only difference is the shrinkage penalty used.
In Ridge, the shrinkage penalty is a tuning hyper multiplied by the SQUARED SUM OF ALL COEFS, RSS + alpha*(bhat j)^2 for jth of p features.
Lasso, is simlar except the shrinkage penalty is a tuning param multiplied by the SUM OF THE ABSOLUTE VALUE OF ALL COEFS, 1/2 RSS + alpha*|bhat j|
So which one to use? As a general rule of thumb, ridge reg produces slightly better predictions, but lasso produces more interpretable models. If we want a balance between ridge and lasso’s penalty functions, we can use ElasticNet, which is simply a regression with both penalties included.
the hyperparameter, alpha, lets us control how much we penalize the coefs, with larger alpha values crating more shrinkage (simpler, less flexible models).
NOTE: Because in linear reg the value of all the coefs is partially determined by the scale of the feature, and in regularized models ALL COEFS ARE SUMMED TOGETHER, we must be sure to standardize the features prior to training.
sklearn has a RidgeCV method to select the ideal value for alpha.
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV
scaler = StandardScaler() X_standardized = scaler.fit_transform(X)
ridge_cv = RidgeCV(alphas=[0.01,0.1,1.0,10.0])
model_cv = ridge_cv.fit(X_standardized,y)
model_cv.coef_
display(model_cv.alpha_)
Explain how to simplify a linear regression model by reducing the number of features
One interesting characteristic of lasso regression’s penalty is that it can shrink the coefs to zero, effectivel reducing the number of features in the model.
e.g., if we set increase alpha and find many of the coefs are 0, then it corresponds to features no longer being used utilized in the model. If we increase alpha to a much higher value, we see that literally none of the features will be used.
The practical benefit of this effet is that it means we could include 100 features and then, through adjusting hyperparam alpha, produce a model that uses only, e.g. 10 of the most important features. THIS LETS us REDUCE VARIANCE WHILE IMPROVING INTERPRETABILITY OF THE MODEL.
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoCV
scaler = StandardScaler() X_standardized = scaler.fit_transform(X)
lasso_cv = LassoCV(alphas=[1,5])
model_cv = lasso_cv.fit(X_standardized,y)
display(model_cv.coef_) # some coef values are 0.0
display(model_cv.alpha_)
Name the assumptions of linear regression
The task of regression is to find a LINEAR relationship between X,y
- there exists a linear relationship between independent and dependent variables (check with scatter)
- recall the textbook individual normal distributions at each X,y–all variables are to be multivariate normal (check with histogram, Q-Q plot, Kolmogorov_Smirnov normality test). When the data is not normally dist’d a non-linear transformation (log transformation) may fix the issue).
- there is no multicollinearity in the data, i.e. independent variables are not correlated with each other (check with correl matrix heatmap, tolerance T = 1-R^2 where T<0.1 may indicate multicollinearity), variance inflation factor VIF).
- there is no autocorrelation in the data, i.e. autocorrelation is when residuals are NOT independent, e.g. stock prices exhibit autocorrel).
- the residuals are normally distributed for small sample sets. this asumption becomes less important as sample size increases due to CLT. check residual normality with residual Q-Q plot.