Linear Regression Flashcards
What is Linear Regression?
Predicting quantitative response variable (Y), given a single predictor variable (X).
The estimation is done using the Least Square Method, which minimizes the Residual Sum of Squares (RSS).
What is Mean Sum of Square?
The average of RSS calculated over all the data points.
Which statistic do we use in the case of Linear regression to check the significance of the predictor variable?
List out Ho and Ha.
t-Statistic
Ho - No relationship b/w “X” and “Y”.
Ha - Some relationship b/w “X” and “Y”.
What question to ask before applying Linear Regression?
- Is there a relationship between Regression and Predictor variable?
- If, yes. How strong is the relationship?
- Is the relationship linear?
What is the Error term (e) in the equation of linear regression.
- The true relationship may not be linear.
- Measurement error.
Error is independent of “X”
What is the meaning of a zero R2?
Value near zero indicates that the model is not able to explain any of the variability in the response.
When, SSR = SST. It is as good as taking the mean value of response variable, without using predictor variable for modelling.
What is the significance of R2 in context of linear regression?
R2 statistic is a measure of the linear relationship between X and Y, just like r.
It can be shown that R2 = r2.
r = Corr(X,Y)
Why, instead of multiple linear regression, we can’t use multiple-simple linear regression?
- It is unclear on how to make prediction based on different simple linear regression.
- The predictor variables may be correlated with each other, But in case of multiple simple linear regression, we are assuming that they are completely independent of each other.
Which statistic do we use in the case of Multiple Linear regression to check the significance of the predictor variable?
List out Ho and Ha.
F-Statistic
Ho - B1 = B2 = …… = Bp = 0
Ha - At least one of Bj ≠ 0
What should be the value of F statistic if there is no relationship between the response and the predictor variable in multiple linear regression.
Value close to 1
What should be the value of F statistic if there is a relationship between the response and the predictor variable in multiple linear regression.
Value greater than 1
Given the individual p values for each variable, why do we need to look at the overall F-statistic?
This is because, if no. of predictor variable (p) is large, there is a 5% probability that at least one of them will have p-value less than 0.05 suggesting that the predictor variable is significant. However, this is not the case in case of F-statistic, which adjusts for the number of predictors.
What to do if no. of predictor variable is more than number of data points?
In that case, we can not fit multiple linear regression, and we cannot use F-statistics.
What are the ways of variable selection in context of linear regression
- Forward Selection - Start with null model with no predictors. Keep on adding predictor that result in lowest RSS. Continue this until some stopping rule is reached.
- Backward Selection - Start with all the predictors. Remove the predictor with largest p-value, repeat this until a stopping point is reached.
- Mixed Selection - Start with null model with no predictors. Keep on adding predictor that result in lowest RSS. Remove the predictor with largest p-value, repeat this forward & backward selection until a stopping point is reached.
What is the relationship between R2 and correlation in multiple linear regression?
R2 = Corr(Y,Y_pred)2
Why do we need variable selection, if R2 increases when more variables are added, even if the variables are weakly associated with the response.
To prevent overfitting
How RSE increases when extra variable is added to the model, given that RSS decreases?
Models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p (no. of predictor variable)
RSE = Sqrt[RSS/(n-p-1)]
What is the additive assumption of linear regression model?
The effect of change in predictor Xj on the response Y is independent of the value of the other predictors.
What is the linear assumption of linear regression model?
A change in response Y due to one unit change in Xj is constant.
What is synergy effect or interaction effect?
A change in one predictor variable affects other predictor variables.
The effect by two different variables combined is more than the sum of their individual effect.
If we are including an interaction term in a model, should we include the main term also, if the p-value for the main term is not significant?
Yes
Explain polynomial regression in context of multiple linear regression.
Polynomial regression is a type of multiple linear regression in which other predictor variable are a polynomial function of one predictor variable.
How to identify nonlinearity in the data?
(Assumption 1)
Residual Plots
For SLR - Residuals vs Predictor
For MLR - Residuals vs Fitted Value
There should not be any pattern evident in Residual Plot
Residual plot indicates that there’s no trend in the residuals, no outliers, and in general, no changing variance across time.
What to do when non linearity is present in the data? (In context of Regression)
Nonlinear transformation like log(X), X2, SQRT(X)
Heteroscedasticity
Non constant variance in residuals
The variance of error is not constant across observations.
How to tackle heteroscedasticity
Transformation of response Y using a concave function like log Y or Sqrt(Y).
Weighted Least Square
Outlier
Outliers are data points that deviate significantly from the expected patterns or values.
Outliers have extreme values in their response variable.
They can arise due to measurement errors, data entry mistakes, or genuine extreme values.
High Leverage Point
Leverage points are data points that have a significant impact on the estimated regression coefficients. These points can distort the regression line.
Unlike outliers, leverage points have extreme values in their predictors.
How to identify outliers?
Box plot
Residual Plots
Scatter Plots
How to detect leverage point?
Leverage statistic
Cook’s Distance
Studentized Residuals
Computed by dividing each residual by its estimated standard error (std. deviation).
For each data point i, the point is deleted and the regression model is re-estimated with the remaining data points. Residual for each data point is calculated to find the std. deviation for residuals.
Studentized residuals > 3 is a possible outlier.
What is collinearity?
A situation in which two and more variables are closely related to each other.