Week 10 day 1 Flashcards
What are the assumptions made when doing regression?
- Linearity.
- Normality of residuals (both residuals and standardised residuals).
- Checking for high influence points.
- Collinearity.
What is the linearity assumption made in regression analysis?
An assumption made in regression analysis is linearity of residuals. Essentially, this means that there should be a similar range of residual values, if the relationship between the predictor and outcome variables is in fact linear.
We want to be sure that when we are trying to find a linear regression model that we are actually looking at linear relationship, and not just creating a line of best fit and superimposing it on a non-linear relationship.
What functions can you use to determine linearity in R?
plot(model,which=1) - this gives a plot of the predicted values and the actual data.
You can also plot the data and fitted data manually.
You can then assess whether the residuals appear to have a linear distribution, that is, the same range of residuals at each point.
To quantitively measure this, you can use a Tukey test, which can be done using the command residualPlots(model).
This gives us a p-value for each predictor individually, as well as the model as a whole. If the p-value is <.05, then the linearity of residuals is violated.
If the Tukey test for the whole model is p>.05 and a single predictor p-value <.05, then you can still assume that linearity of residuals assumption has been met - according to Dani.
What can you do if the residuals are not linear?
You can transform the data into a log scale or use another type of regression or test.
Another assumption of a regression is the normality of ordinary residuals and standardised residuals.
What does this mean?
Ordinary residuals are the difference between the actual data and the line of best fit.
Standardised residuals are the z-scores of the ordinary residuals.
Why do we care more about the normality of standardised residuals when checking assumptions for a regression?
Standardised residuals are important as they all us to measure the outcome and predictor variables on the same scale, as z-scores tell us how far away from the mean a given value/residual is in terms of standard deviations.
What is a way we can test for the normality of standardised residuals in R?
We can look at a QQ plot and a Shapiro-Wilk test of the standardised residuals.
What is a high-influence point in regression?
High influence points are data points that massively influence the regression model whether they are included or not.
What is the difference between an outlier, a high leverage point, and a high influence point in regression analysis?
An outlier is an outcome data point that has a very large residual, that is, it is very far away from the predicted regression model. Including this point or excluding it, however, does not have a large effect on the regression model.
A high leverage point is an outcome variable data point that does not have a large residual, but whose presence influences the slope of the regression.
A high influence point is a high leverage outlier. That is, it has a high residual and significantly influences the slope/model. Including these points will significantly change the model than if it were excluded.
How are high influence points identified in R?
By looking at Cook’s distance for data points.
What do Cook’s distance measure?
What does a high cook’s distance imply?
Whether a data point is high influence or not. Cook’s distance are a measure of both residuals (outluer-ness) and influence a point has on the model (influence).
What is the Cook’s distance threshold for this subject?
2k/N, where N is the number of data points and k is the number of coefficients (i.e. the intercept and the slopes for the predictor/s).
What is collinearity in regression analysis?
Collinearity refers to whether predictor variables are correlated with each other.
If predictor variables are highly correlated with each other, then there estimated coefficents in a regression that has them both are unreliable. We cannot estimate these coefficients reliably.
Why is it important to check for high influence points?
It is important because we don’t want to be drawing conclusions based on a model that would substantially different if we had not included that point.
How do we measure collinearity?
Using a VIF, or Variance Inflation Factor.
A VIF measures how much the correlation between predictor variables is influencing the confidence interval around the coefficients.
A high VIF means that there is substantial correlation between the predictor viariables and it is increasing the confidence intervals around the coefficients.
A VIF of 1 means the predictor variables are not correlated.
VIFs of 3-4 or above are of concern in general.