Linear Regression Flashcards
First step in multiple regression analysis?
Compute the F-statistic and examine the associated p-values.
What are the 3 basic types of variable selection? Describe one of them.
- Forward selection: begin with the null model, then fit p simple linear regressions and add the one with the lowest RSS. Repeat with the new model.
- Backward selection
- Mixed selection
What are the two assumptions of a linear model and how can we remove them?
- Additive assumption: the effect of changes in a predictor Xj on the response Y is independent of the values of the other predictors. Remove: add an interaction parameter.
- Linear assumption: the change in the response Y due to a one-unit change in Xj is constant regardless of the value of Xj. Remove: add a transformed feature to the model with functions like Xj^n, log(Xj), sqrt(Xj) etc…
How to show the potential non-linearity of the data?
We plot residuals: yi-yi_estimated versus yi. If there is a pattern, it indicates the presence of non-linearities in the data.
Give an example of a problem due to error terms being correlated? How to detect such a correlation for times series?
Example: if we duplicate all the data, the model won’t change, but the confidence interval would be falsely narrower by a factor of sqrt(2).
We can detect it by plotting residuals versus time and see if there’s a pattern or it there’s local trends in the plot.
How to detect non-constant variance in error terms (heteroscedasticity)? What is a possible solution to that?
We can plot the residuals and see if there’s a funnel shape.
Solution: transform response using a concave function such as log(Y) or sqrt(Y).
What is the difference between an outlier and a high-leverage point? What are the effects of both and how can we detect them.
Outlier:
- Def: a point for which the value of yi is far from the value predicted by the model or far from the trend of the values.
- Effect: doesn’t have a big effect on the least squares line but can really mess up the metrics such as the RSE.
- Detect: studentized residuals (divide each residual by its standard error). Observations whose studentized residuals are greater than 3 in absolute value are potential outliers.
High-leverage point:
- Def: point that has an unusual feature value Xj
- Effect: really messes with the estimated regression line
- Detect: high leverage statistic
A good idea is to plot the studentized residuals versus the leverage statistic.
Why is collinearity a problem in regression?
How to detect it?
Give two solutions on how to deal with it.
It is difficult to separate out the individual effects of collinear variables on the response.
Detect: correlation matrix, variance inflation factor
1st solution: drop one of the variables
2nd solution: combine the collinear variables into one predictor