Advanced Topics Flashcards
Name 4 assumptions of regression
- Linearity
- Normality of residuals
- High influence points
- Colinearity
How do you test the linear regression assumption?
Plot your fitted-Y against observed-Y. Residuals should appear symetrical along fitted.
Sig p-value = probably not linear
How do you test the regression assumption: residuals are normal?
This refers to standardised residuals. To standardise first convert to z-score. The run Shapiro-Wilk or QQ on standardised residuals
What is an outlier in regression?
A data point that has a large residual.
i.e a large distance between data point and regression line.
What is a high leverage point?
An observation that has an extreme or unusual value.
Far along the x-axis.
What is more dangerious to ones regression, an outlier or a high leverage point?
Nether are particularly dangerious in and of itself. However an observation that is BOTH an outlier and a high leverage point is dangerous.
What is leverage, and how is it calculated? how are outliers calculated.
Leverage is calculated using the hat value and tests each data point to see how much it ‘controls’ the regression line. Outliers can be see by plotting standardised residauls
When is (Cooks distance, which = 4) a problem?
If a data point is more than 2k / n
What is the biggest problem with have co-linear data?
Can massively inflation of variance.
Why isn’t choosing R^2 the best way to choose our model?
Because models with more predictors will always have more variance. Some models that are too complex will overfit.
What are two ways to penalise models for additional paramters?
Adjusted R-squared.
AIC and BIC