Regression diagnostics / Logistic regression Flashcards

1
Q

Why do we need to consider the assumptions for a linear regression?

A

So we can rely on those statistics: coefficients, SE, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which regression assumptions are there?

A

differs a bit, but roughly:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What happens if you violate regression assumptions?

A

1) Coeffcients become unreliable –> biased
2) SE become unreliable –> any hypothesis becomes unreliable (incluing p-value/t-stat, etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Linearity

A

Assumption: the average outcome is linearly related to each term in the model when holding all others fixed –> & technically, the “linear” in “linear regression” refers to the outcome being linear in the parameters, the β’s

Problem: Biased coefficient (true form is curvilinear)

Diagnostic: Component-plus-residual plot (A significant difference between the residual line and the component line indicates that the predictor does not have a linear relationship with the dependent variable)

Solution: Polynomial, Spline, Collapse into categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Homoskedastic / Normally distributed Residuals

A

Assumption: constant & normal variance of residuals

Problem: Standard errors usually not correct: underestimated, also, influential observations may be present can also effect coefficients

Diagnostic: Heteroskedastic –> Plot residuals, Normallity –> HIstograms, Qnorm, Studentized residuals plot

Solution: Log-transformation / Power-transformation, Robust SE‘s, Correct coding errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

No multicollinearity

A

Assumption: Predictors should be independent of each other, very low correlated (not present in SLR but in MLR)

Problem: “holding constant” not possible with correlated variables –> 1) interpretation becomes impossible, also model will not know which varible made difference 2) Loss of precision (inflated standard errors)

Diagnostic: Look at correlations, Variance inflation factor (assesses each variable, what’s the difference in variance if we include/exclude it –> the higher the VIF, the more information is already contained = high multi-collinearity)

Solutions: Get crafty (similar but not collinear variable), Construct an index, Get more data, Mean-center interaction variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which diagnostic is important to consider apart from regression assumptions?

A

Influential observations

  • pull the regression fit towards themselves –> results (predictions, parameter estimates, CIs, p-values) can be quite different with and without these cases included in the analysis
  • do not necessarily violate any regression assumptions, they can cast doubt on the conclusions drawn from your sample. If a regression model is being used to inform real-life decisions, one would hope those decisions are not overly influenced by just one or a few observations…
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is more important: Coefficient vs SE?

A

First, estimation then SE, correct SE no use if estamtion is biased ..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which one is problematic?

A

B - unusual value and large residual ergo leverage which pulls regression line down, deleting it would change regression line drastically

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Are these two influential observations?

A

NO
1) large sample
2) no unusual x-value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When to delete an outlier?

A

Cook’s D of 1 approx.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
A

not a problem of bias per se but a lack of data, very little variance + hard to separate makes it hard to be precise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
A

a) hard decision to make

17
Q

Potential problems, diagnostics, potential solutions: Influential
observations

A
18
Q

Which is the link function?

OLS vs Logistic Regression

A

Linear = Identity link (just the mean –> model of the mean)
Logistic = Logit link (log odds of mean –> model of the log odds /also: logit)

19
Q

Which distribution?

OLS vs Logistic Regression

A

Linear = Gauss (continuos)
Logistic = Binomial (discrete)

20
Q

What are the problems with an OLS model when it comes to binary outcomes

A

1) binary outcome –> we want to model probability & linear model can give neg values but neg probabilities no meaning
2) unrealistic assumptions about constant effects
3) normal residual assumption is violated

21
Q

Interpret OLS model

A

1 unit increase in trust scale, decreases the mean of AFD vote by 0.05 percentage points

22
Q

What does the logistic regression model?

A

The logit

23
Q

How does ML work?

A

iterates to find best parameters

24
Q

How to interpret the Log Odds?

A

very hard, not intuitive ..

25
Q

How to interpret the Odds Ratio?

A

1 –> no difference

26
Q

Limitation of Odds

A

We can only say what the odds and how odds increase BUT NOT how likely something actuallly is (probabilites)

27
Q

How to get back to probabilities?

A

e to the power of function

Example:
plug in 0 for men –> constant = 10%
plug in 1 for women –> constant - coefficient = 5%

28
Q

What is the average marginal effect?

A

Averages all the marginal effects of the dimensions of the explanatory variable-

29
Q

How do the average marginal effect and linear regression relate to each other?

A

very similar, lin reg approx. AME

30
Q

How to model this relationship with a linear model?

A

Polynomial of Trust in Parliament would pick up the slowing of the decline

31
Q

How to make this log reg table more interpretable?

A

Transform to probability scale (marginy, dydx)
–> Average marginal effects (makes it more interpretable but also takes out the s-shape meaning it works well in the middle but less at the edges)

32
Q

Polynomials vs log regression

A

decline in non lin models

33
Q

What is the problem with interactions in logistic regressions?

A

log regression already uses interaction even if it is not explicit

34
Q

What is the problem with mediation in logistic regressions?

A

if you put in more, logistic mediation will be different, no cross-country comparison possible –> mediation generally underestimated

35
Q

What is the solution for logistic mediation?

A

KHB method (can also be used for linear models) –> Diff gives you mediation effect

36
Q
A

Important to consider the absolute as well as the relative ratio

37
Q

Logistic vs linear models

Binary variables

A
38
Q

Why would we prefer logistic regression sometimes?

A

Captures non-linear relationship really well, with OLS we need to model them ourselves