Regression diagnostics / Logistic regression Flashcards
Why do we need to consider the assumptions for a linear regression?
So we can rely on those statistics: coefficients, SE, etc.
Which regression assumptions are there?
differs a bit, but roughly:
What happens if you violate regression assumptions?
1) Coeffcients become unreliable –> biased
2) SE become unreliable –> any hypothesis becomes unreliable (incluing p-value/t-stat, etc.)
Linearity
Assumption: the average outcome is linearly related to each term in the model when holding all others fixed –> & technically, the “linear” in “linear regression” refers to the outcome being linear in the parameters, the β’s
Problem: Biased coefficient (true form is curvilinear)
Diagnostic: Component-plus-residual plot (A significant difference between the residual line and the component line indicates that the predictor does not have a linear relationship with the dependent variable)
Solution: Polynomial, Spline, Collapse into categories
Homoskedastic / Normally distributed Residuals
Assumption: constant & normal variance of residuals
Problem: Standard errors usually not correct: underestimated, also, influential observations may be present can also effect coefficients
Diagnostic: Heteroskedastic –> Plot residuals, Normallity –> HIstograms, Qnorm, Studentized residuals plot
Solution: Log-transformation / Power-transformation, Robust SE‘s, Correct coding errors
No multicollinearity
Assumption: Predictors should be independent of each other, very low correlated (not present in SLR but in MLR)
Problem: “holding constant” not possible with correlated variables –> 1) interpretation becomes impossible, also model will not know which varible made difference 2) Loss of precision (inflated standard errors)
Diagnostic: Look at correlations, Variance inflation factor (assesses each variable, what’s the difference in variance if we include/exclude it –> the higher the VIF, the more information is already contained = high multi-collinearity)
Solutions: Get crafty (similar but not collinear variable), Construct an index, Get more data, Mean-center interaction variables
Which diagnostic is important to consider apart from regression assumptions?
Influential observations
- pull the regression fit towards themselves –> results (predictions, parameter estimates, CIs, p-values) can be quite different with and without these cases included in the analysis
- do not necessarily violate any regression assumptions, they can cast doubt on the conclusions drawn from your sample. If a regression model is being used to inform real-life decisions, one would hope those decisions are not overly influenced by just one or a few observations…
What is more important: Coefficient vs SE?
First, estimation then SE, correct SE no use if estamtion is biased ..
Which one is problematic?
B - unusual value and large residual ergo leverage which pulls regression line down, deleting it would change regression line drastically
Are these two influential observations?
NO
1) large sample
2) no unusual x-value
When to delete an outlier?
Cook’s D of 1 approx.
not a problem of bias per se but a lack of data, very little variance + hard to separate makes it hard to be precise
a) hard decision to make
Potential problems, diagnostics, potential solutions: Influential
observations
Which is the link function?
OLS vs Logistic Regression
Linear = Identity link (just the mean –> model of the mean)
Logistic = Logit link (log odds of mean –> model of the log odds /also: logit)
Which distribution?
OLS vs Logistic Regression
Linear = Gauss (continuos)
Logistic = Binomial (discrete)
What are the problems with an OLS model when it comes to binary outcomes
1) binary outcome –> we want to model probability & linear model can give neg values but neg probabilities no meaning
2) unrealistic assumptions about constant effects
3) normal residual assumption is violated
Interpret OLS model
1 unit increase in trust scale, decreases the mean of AFD vote by 0.05 percentage points
What does the logistic regression model?
The logit
How does ML work?
iterates to find best parameters
How to interpret the Log Odds?
very hard, not intuitive ..
How to interpret the Odds Ratio?
1 –> no difference
Limitation of Odds
We can only say what the odds and how odds increase BUT NOT how likely something actuallly is (probabilites)
How to get back to probabilities?
e to the power of function
Example:
plug in 0 for men –> constant = 10%
plug in 1 for women –> constant - coefficient = 5%
What is the average marginal effect?
Averages all the marginal effects of the dimensions of the explanatory variable-
How do the average marginal effect and linear regression relate to each other?
very similar, lin reg approx. AME
How to model this relationship with a linear model?
Polynomial of Trust in Parliament would pick up the slowing of the decline
How to make this log reg table more interpretable?
Transform to probability scale (marginy, dydx)
–> Average marginal effects (makes it more interpretable but also takes out the s-shape meaning it works well in the middle but less at the edges)
Polynomials vs log regression
decline in non lin models
What is the problem with interactions in logistic regressions?
log regression already uses interaction even if it is not explicit
What is the problem with mediation in logistic regressions?
if you put in more, logistic mediation will be different, no cross-country comparison possible –> mediation generally underestimated
What is the solution for logistic mediation?
KHB method (can also be used for linear models) –> Diff gives you mediation effect
Important to consider the absolute as well as the relative ratio
Logistic vs linear models
Binary variables
Why would we prefer logistic regression sometimes?
Captures non-linear relationship really well, with OLS we need to model them ourselves