Regression diagnostics / Logistic regression Flashcards
Why do we need to consider the assumptions for a linear regression?
So we can rely on those statistics: coefficients, SE, etc.
Which regression assumptions are there?
differs a bit, but roughly:
What happens if you violate regression assumptions?
1) Coeffcients become unreliable –> biased
2) SE become unreliable –> any hypothesis becomes unreliable (incluing p-value/t-stat, etc.)
Linearity
Assumption: the average outcome is linearly related to each term in the model when holding all others fixed –> & technically, the “linear” in “linear regression” refers to the outcome being linear in the parameters, the β’s
Problem: Biased coefficient (true form is curvilinear)
Diagnostic: Component-plus-residual plot (A significant difference between the residual line and the component line indicates that the predictor does not have a linear relationship with the dependent variable)
Solution: Polynomial, Spline, Collapse into categories
Homoskedastic / Normally distributed Residuals
Assumption: constant & normal variance of residuals
Problem: Standard errors usually not correct: underestimated, also, influential observations may be present can also effect coefficients
Diagnostic: Heteroskedastic –> Plot residuals, Normallity –> HIstograms, Qnorm, Studentized residuals plot
Solution: Log-transformation / Power-transformation, Robust SE‘s, Correct coding errors
No multicollinearity
Assumption: Predictors should be independent of each other, very low correlated (not present in SLR but in MLR)
Problem: “holding constant” not possible with correlated variables –> 1) interpretation becomes impossible, also model will not know which varible made difference 2) Loss of precision (inflated standard errors)
Diagnostic: Look at correlations, Variance inflation factor (assesses each variable, what’s the difference in variance if we include/exclude it –> the higher the VIF, the more information is already contained = high multi-collinearity)
Solutions: Get crafty (similar but not collinear variable), Construct an index, Get more data, Mean-center interaction variables
Which diagnostic is important to consider apart from regression assumptions?
Influential observations
- pull the regression fit towards themselves –> results (predictions, parameter estimates, CIs, p-values) can be quite different with and without these cases included in the analysis
- do not necessarily violate any regression assumptions, they can cast doubt on the conclusions drawn from your sample. If a regression model is being used to inform real-life decisions, one would hope those decisions are not overly influenced by just one or a few observations…
What is more important: Coefficient vs SE?
First, estimation then SE, correct SE no use if estamtion is biased ..
Which one is problematic?
B - unusual value and large residual ergo leverage which pulls regression line down, deleting it would change regression line drastically
Are these two influential observations?
NO
1) large sample
2) no unusual x-value
When to delete an outlier?
Cook’s D of 1 approx.
not a problem of bias per se but a lack of data, very little variance + hard to separate makes it hard to be precise