Week 9: Assumptions of Multivariable Linear Regression Flashcards
How do you check the distribution of continuous variables?
<hist varname, freq normal> to create a histogram and overlay a normal distribution
How do we call a coefficient when the dependent variable decreases as the independent variable increases?
Negative coefficient
What does the coefficient represent?
The change in the dependent variable for a one-unit change in the predictor, holding other variables constant
What is the significance of residuals in regression analysis?
Residuals measure the difference between observed and predicted values, indicating model fit
How can you test if residuals are normally distributed using a kernel density plot?
<kdensity resid_varname, normal> and overlay a normal curve to check for alignment
How does a pnorm plot help in assessing normality of residuals?
It compares the cumulative distribution of residuals to a normal distribution; closer alignment suggests normality. Qnorm plots show deviations from normality in the middle range of data
What does deviation from the line in a qnorm plot represent?
Deviation at the tails indicates non-normality, suggesting potential outliers or skewness. Shows deviations from normality at the extremities.
Why is normality of residuals important in linear regression?
Normal residuals ensure valid hypothesis testing and confidence intervals
What is the purpose of a residual vs fitted plot?
It checks for patterns that indicate violations of linearity, equal variance, or non-normality
How can you assess if the linearity and equal variance assumptions are met?
Look for random scatter in a residual vs fitted plot; fanning or patterns suggest heteroscedasticity
In a residual vs fitted plot, non-linearity is shown by a pattern, whereas unequal variance is shown by a funnel shape
What does the presence of leverage points indicate?
Leverage points are influential observations that can disproportionately affect model fit
How can you identify leverage points in regression analysis?
Plot residuals or fitted values against predictors and look for isolated points
What does multicollinearity indicate in a regression model?
High correlation between predictors can distort coefficients, making them unreliable
How can multicollinearity be detected?
Calculate correlation coefficients between predictors; values near +/- 1 indicate multicollinearity. Use command <cor></cor>
Why might adding two highly correlated predictors distort regression results?
The shared variance between predictors reduces the model’s ability to isolate individual effects
What does recoding or transforming a variable achieve in regression analysis?
It can improve the fit by correcting for skewness, non-linearity, or address data inaccuracies (e.g., age grouping)
Why is it important to check for missing data after transformations?
Transformations can exclude cases, reducing sample size and potentially biasing results
How can you improve the normality of residuals?
Apply transformations (e.g., log, square root) to the dependent variable to reduce skewness
What does a funnel shape in a residual plot suggest?
It indicates heteroscedasticity - variance of residuals increases with fitted values
What is the implication of non-linearity in residual plots?
Non-linearity suggests that the relationship between predictors and outcome may not be adequately captured by the model
Why might adding interaction terms improve model fit?
Interactions account for cases where the effect of one predictor depends on the level of another predictor
What does a constant (_cons) represent?
It is the predicted value of the dependent variable when all predictors are zero
How can transformations improve model assumptions?
They can stabilise variance, reduce skewness, and make relationships more linear
What does a high peak in a histogram indicate about the distribution?
It may suggest a large concentration of data around a specific value, potentially indicating skewness or rounding
Why is assessing the distribution of predictors and outcomes crucial in regression?
Non-normality or outliers can lead to biased estimates and affect the validity of the model
How does excluding extreme observations affect model fit?
It can reduce leverage effects and improve stability, but may also remove meaningful data
What is the purpose of fitting multiple models with different predictors?
It helps to identify the best combination of variables and assess the robustness of results
Why is it useful to visualise residuals after each regression model?
It allows for continuous assessment of model assumptions and fit
How do you generate residuals for a regression model?
<predict new_varname, resid>
This will generate a new variable. This cannot be overwritten by any other residuals variable created - subsequent variables must have a new name
How do you plot a pnorm and qnorm plot to examine the normality of residuals?
<pnorm>
<qnorm>
</qnorm></pnorm>
What command do you use to check normality of residuals using a residual vs fitted value plot?
<rvfplot, yline(0) msymbol (Oh) msize(tiny)>
The <yline> option draws a line at 0 where the residuals should be densest
The <msymbol> option is used to change the default symbol which is now set at a hollow circle</msymbol></yline>
<msize> changes the size of the symbol
</msize>
How can we make a variable containing the fitted values and why would we want to do this?
<predict fit_varname if e(sample)>
The fitted values can be plotted on a scatterplot to see where there may be issues with fitted values (may be far away from the others) -
<scatter fit_varname varname condition, msymbol (Oh) msize(tiny)> Or
<twoway (scatter fit_varname varname condition, sort) (scatter fit_varname varname condition, sort)>
How can you check the options for transforming a variable?
<gladder>
</gladder>
How do you perform a log transformation on a variable?
<gen log_varname=log(varname)>