Week 10 day 1 Flashcards

Question 1

Q

What are the assumptions made when doing regression?

Answer

A

Linearity.
Normality of residuals (both residuals and standardised residuals).
Checking for high influence points.
Collinearity.

Question 2

Q

What is the linearity assumption made in regression analysis?

Answer

A

An assumption made in regression analysis is linearity of residuals. Essentially, this means that there should be a similar range of residual values, if the relationship between the predictor and outcome variables is in fact linear.
We want to be sure that when we are trying to find a linear regression model that we are actually looking at linear relationship, and not just creating a line of best fit and superimposing it on a non-linear relationship.

Question 3

Q

What functions can you use to determine linearity in R?

Answer

A

plot(model,which=1) - this gives a plot of the predicted values and the actual data.
You can also plot the data and fitted data manually.
You can then assess whether the residuals appear to have a linear distribution, that is, the same range of residuals at each point.

To quantitively measure this, you can use a Tukey test, which can be done using the command residualPlots(model).
This gives us a p-value for each predictor individually, as well as the model as a whole. If the p-value is <.05, then the linearity of residuals is violated.
If the Tukey test for the whole model is p>.05 and a single predictor p-value <.05, then you can still assume that linearity of residuals assumption has been met - according to Dani.

Question 4

Q

What can you do if the residuals are not linear?

Answer

A

You can transform the data into a log scale or use another type of regression or test.

Question 5

Q

Another assumption of a regression is the normality of ordinary residuals and standardised residuals.
What does this mean?

Answer

A

Ordinary residuals are the difference between the actual data and the line of best fit.
Standardised residuals are the z-scores of the ordinary residuals.

Question 6

Q

Why do we care more about the normality of standardised residuals when checking assumptions for a regression?

Answer

A

Standardised residuals are important as they all us to measure the outcome and predictor variables on the same scale, as z-scores tell us how far away from the mean a given value/residual is in terms of standard deviations.

Question 7

Q

What is a way we can test for the normality of standardised residuals in R?

Answer

A

We can look at a QQ plot and a Shapiro-Wilk test of the standardised residuals.

Question 8

Q

What is a high-influence point in regression?

Answer

A

High influence points are data points that massively influence the regression model whether they are included or not.

Question 9

Q

What is the difference between an outlier, a high leverage point, and a high influence point in regression analysis?

Answer

A

An outlier is an outcome data point that has a very large residual, that is, it is very far away from the predicted regression model. Including this point or excluding it, however, does not have a large effect on the regression model.

A high leverage point is an outcome variable data point that does not have a large residual, but whose presence influences the slope of the regression.

A high influence point is a high leverage outlier. That is, it has a high residual and significantly influences the slope/model. Including these points will significantly change the model than if it were excluded.

Question 10

Q

How are high influence points identified in R?

Answer

A

By looking at Cook’s distance for data points.

Question 11

Q

What do Cook’s distance measure?

What does a high cook’s distance imply?

Answer

A

Whether a data point is high influence or not. Cook’s distance are a measure of both residuals (outluer-ness) and influence a point has on the model (influence).

Question 12

Q

What is the Cook’s distance threshold for this subject?

Answer

A

2k/N, where N is the number of data points and k is the number of coefficients (i.e. the intercept and the slopes for the predictor/s).

Question 13

Q

What is collinearity in regression analysis?

Answer

A

Collinearity refers to whether predictor variables are correlated with each other.
If predictor variables are highly correlated with each other, then there estimated coefficents in a regression that has them both are unreliable. We cannot estimate these coefficients reliably.

Question 14

Q

Why is it important to check for high influence points?

Answer

A

It is important because we don’t want to be drawing conclusions based on a model that would substantially different if we had not included that point.

Question 15

Q

How do we measure collinearity?

Answer

A

Using a VIF, or Variance Inflation Factor.
A VIF measures how much the correlation between predictor variables is influencing the confidence interval around the coefficients.
A high VIF means that there is substantial correlation between the predictor viariables and it is increasing the confidence intervals around the coefficients.
A VIF of 1 means the predictor variables are not correlated.
VIFs of 3-4 or above are of concern in general.

Question 16

Q

When calculating VARIANCE INFLATION FACTORS should you use a regression model that includes the interaction term?

Answer

A

No.
If you have a regression model that has an interaction term, then calculate the VIFs for the model where the interaction term is not included. This is because we are wanting to see the influence of individual variables.

Question 17

Q

What do you do if you have high VIFs?

Answer

A

You can remove one of the predictor variables. Which one is dependent on the research question and the theory behind it.

Question 18

Q

For a multiple regression with 2 predictor variables and no interaction term, how many coefficients does the model have?

Answer

A

3 - 2 slopes and one intercept.

Question 19

Q

When doing multiple regression, can adding another predictor ever make the R-squared lower than what it was before having that predictor in the model?

Answer

A

No.
If the predictor has does not aid in generating a model that BETTER explains or encapsulates the variance, then the model will remain as the model/line/plane of best fit that it was before the other predictor was added.
For this reason, when deciding which model to choose, it is not necessarily the model with the highest R-squared (as this will just be the regression with the most predictors and interactions).

Question 20

Q

In multiple regression, what is the problem of “overfitting”?

Answer

A

When we add more and more predictors and interactions we end up with a very complex model that explains a lot of the variance in the data we have, but is too complex to accurately or reliably predict outcomes or account for new data we find.

Question 21

Q

In regression models, is model complexity penalised?

Question 22

Q

How is model complexity penalised in regression?

Answer

A

We can use different ways of penalising model complexity, two discussed in lecture are Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC).
The lower the value of these, the better the model is describes a relationship between predictors and an outcome, such that the complexity of the model is taken into account.
Essentially methods for penalising model complexity are trying to determine how much variance is explained when a predictor is added to a model and whether this extra explained variation is enough to justify adding that predictor.

Question 23

Q

If our model with the lowest AIC or BIC is a model that has two predictors but no interaction term, what does this qualitatively mean about the relationship of these predictors and the outcome variable?

Answer

A

This means that the predictors influence/are related to the outcome variable INDEPENDENTLY, that is, there is not an effect of these two predictors together that influence the outcome variable.

Question 24

Q

Does a high or low AIC/BIC indicate the model we should choose to best explain the relationship between predictors and outcome variable?

Question 25

Q

What does a VIF value tell us?

Answer

A

We get a VIF value for each predictor in our regression model. A VIF value that is higher than 3-4, tells us that that predictor is highly correlated with other predictor/s in the model and therefore the coefficient for that predictor in the model is not that reliable/has a large confidence interval.

Question 26

Q

When testing for normality of residuals in regression, are we mainly concerned with the normality of standardised residuals?