Week 9: Assumptions of Multivariable Linear Regression Flashcards
What is the outcome of linear regression?
Outcome is always continuous
What types of variables can the explanatory variables be in linear regression?
Continuous or categorical
How is a continuous explanatory variable interpreted?
As the explanatory variable increases by one unit, the outcome changes by the value of the coefficient
How is a categorical explanatory variable interpreted?
The outcome changes by the coefficient’s value for the category of interest
What are the assumptions of linear regression?
- Normality of residuals
- Linear relationship between outcome and explanatory variables
- Constant variance (SD) of the outcome over x
- Data independence
What is a residual in regression?
The difference between the predicted value and the observed value of the outcome
How can residuals be checked for normality?
By plotting kernel density (kdensity resid, normal) or using pnorm and qnorm plots
What indicates heteroskedasticity in residual plots?
A fan or funnel shape in residuals vs. fitted values indicates unequal variance
How do you check the linearity and equal variance assumptions in regression?
Plot residuals against fitted values; there should be no clear pattern or funnel shape
What should you do if the linearity assumption is violated?
Consider adding a quadratic term or categorising the explanatory variable
How can you very the independence assumption?
Ensure outcome data come from different individuals at one time point
What can you do if multiple assumptions are violated (e.g., non-normality, non-linearity, heteroskedasticity)?
Transform the outcome variable to address all issues, but interpretation becomes more complex
What transformations are commonly used for improving normality?
Logarithmic, square root, inverse, or power transformations
What is a limitation of logarithmic transformation?
It cannot be used with variables that contain zero, unless a small constant (e.g., 0.1) is added
How do transformations affect regression analysis?
They change the scale of coefficients and standard errors, leading to different results
What is the interpretation of coefficients after a log transformation of the outcome variable?
Coefficients approximate percentage changes for a one-unit increase in the explanatory variable
What are outliers in regression analysis?
Extreme values of the outcome variable with large residuals (positive or negative)
What is leverage in regression analysis?
Observations with extreme values of the explanatory variable that can influence regression coefficients
How can leverage points affect regression results?
They can pull the regression line toward them, distorting the fit
What is collinearity in regression?
When explanatory variables are highly correlated, making it difficult to include both in the model
What should you do if collinearity is present?
Choose only one correlated variable to include in the model
Why is it better to keep a continuous outcome variable as it is?
It allows more explanatory variables to be included in the model and improves statistical power
What is a rule of thumb for including explanatory variables?
One explanatory variable for every 10 observations; categorical variables count as the number of categories minus one (e.g., a variable with 4 categories counts as 3 variables)
What are common methods for model building?
Include all variables, manual backwards selection, automated forward selection, backward selection, or stepwise selection
What is manual backwards selection?
Start with all variables, remove the least significant, and refit the model until all remaining variables meet the significance threshold
What are automated methods of model selection?
Backward selection, forward selection, and stepwise selection using predefined criteria for variable inclusion
What are potential issues with automated model selection?
Different methods may lead to different final models, and some variable categories may be excluded
What pattern should residuals follow in a residual vs fitted plot?
Residuals should sit evenly around 0 with no clear pattern
What does a residual vs fitted plot reveal about non-linearity?
Residuals being positive for moderate fitted values and negative for small/large values indicates a curve in the relationship
What happens to the assumptions as the dataset gets larger?
The assumptions become less critical
What are limitations of dichotomising a continuous variable?
It reduces statistical power and limits the ability to include explanatory variables
What is heteroskedasticity?
Unequal variance of residuals across values of the explanatory variable
How do you identify outliers and leverage points?
Use scatter plots, stem and leaf plots, or examine residual and qnorm plots
How can we get the residuals for each person in Stata?
<predict chosen_varname, resid>
What’s the difference between pnorm and qnorm plots?
- pnorm is sensitive to non-normality in the middle range of the data (where it is likely there will be a lot of data)
- qnorm is sensitive to non-normality at the extremities of the data (where it is likely there will be less data). Non-normal residuals will show on this plot
How do we check for normality on a residual vs fitted plot?
Residuals should sit evenly about 0
If they are densest elsewhere, this indicates that the residuals are not normal
What should you always do when checking the linearity assumption?
Plot a scatter plot between outcome and explanatory variable(s) if continuous
If we need to transform variables, what should be transformed first?
The outcome, then if necessary, the explanatory
If you are including a baseline value of an outcome as an explanatory, it makes sense for them to be on the same scale so transform them both
On what kind of data do logarithmic transformations work best?
Right (positive) skewed data
How can we address outliers and leverage points?
- First check the data source and whether data have been entered correctly
- If data are entered correctly, do not exclude data without good reason. You may want to analyse the data with and without the outlier(s) and/or leverage points to see how the models compare
How do you compile an analysis plan?
- Define research question(s) - check data you have will be able to answer your questions. Define H0 and H1
- Check data - see how variables are distributed (histograms, check for outliers, tabulations)
- Do analyses by an important categorical variable (if appropriate)
- Consider whether you need to do regression analysis - decide which variables to include and check for multicollinearity
- Model building
- Make sure assumptions for regression analysis have been met
What analytical methods can be used for two categorical variables?
Chi-square if assumptions met; Fischer’s exact test if assumptions not met
What analytical methods can be used for a categorical and a continuous variable?
- Two samples t-test if two categories and assumptions met
- One way ANOVA is > 2 categories and assumptions met
- Non-parametric equivalents if assumptions not met
What is forwards selection?
Starting with no variables in the model and adding variables one by one using a pre-specified criteria
what is stepwise selection?
using a combination of forwards and backwards selection with pre-specified criteria for including and excluding variables