Regression diagnostics / Logistic regression Flashcards
Why do we need to consider the assumptions for a linear regression?
So we can rely on those statistics: coefficients, SE, etc.
Which regression assumptions are there?
differs a bit, but roughly:
What happens if you violate regression assumptions?
1) Coeffcients become unreliable –> biased
2) SE become unreliable –> any hypothesis becomes unreliable (incluing p-value/t-stat, etc.)
Assumption: the average outcome is linearly related to each term in the model when holding all others fixed –> & technically, the “linear” in “linear regression” refers to the outcome being linear in the parameters, the β’s
Problem: Biased coefficient (true form is curvilinear)
Diagnostic: Component-plus-residual plot (A significant difference between the residual line and the component line indicates that the predictor does not have a linear relationship with the dependent variable)
Solution: Polynomial, Spline, Collapse into categories
Homoskedastic / Normally distributed Residuals
Assumption: constant & normal variance of residuals
Problem: Standard errors usually not correct: underestimated, also, influential observations may be present can also effect coefficients
Diagnostic: Heteroskedastic –> Plot residuals, Normallity –> HIstograms, Qnorm, Studentized residuals plot
Solution: Log-transformation / Power-transformation, Robust SE‘s, Correct coding errors
No multicollinearity
Assumption: Predictors should be independent of each other, very low correlated (not present in SLR but in MLR)
Problem: “holding constant” not possible with correlated variables –> 1) interpretation becomes impossible, also model will not know which varible made difference 2) Loss of precision (inflated standard errors)
Diagnostic: Look at correlations, Variance inflation factor (assesses each variable, what’s the difference in variance if we include/exclude it –> the higher the VIF, the more information is already contained = high multi-collinearity)
Solutions: Get crafty (similar but not collinear variable), Construct an index, Get more data, Mean-center interaction variables
Which diagnostic is important to consider apart from regression assumptions?
Influential observations
- pull the regression fit towards themselves –> results (predictions, parameter estimates, CIs, p-values) can be quite different with and without these cases included in the analysis
- do not necessarily violate any regression assumptions, they can cast doubt on the conclusions drawn from your sample. If a regression model is being used to inform real-life decisions, one would hope those decisions are not overly influenced by just one or a few observations…
What is more important: Coefficient vs SE?
First, estimation then SE, correct SE no use if estamtion is biased ..
Which one is problematic?
B - unusual value and large residual ergo leverage which pulls regression line down, deleting it would change regression line drastically
Are these two influential observations?
1) large sample
2) no unusual x-value
When to delete an outlier?
Cook’s D of 1 approx.
not a problem of bias per se but a lack of data, very little variance + hard to separate makes it hard to be precise