3A Flashcards

1
Q

correlation does not imply causation

A

Just that there is a correlation between variables, does not mean that one causes the other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is meant by spurious correlation?

A

Spurrious correlation refers to a connection between two variables that appears to be causal but is not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

suppressor variables

A

a suppressor varriable is a variable that increases the predictie validity of another variable when included in a regression equation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

when do we speak of omitted variable bias?

A

when a regression coefficient is ether overestimated (spurious correlatiom) or underestimated (suppressor variable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is multiple regression

A

In multiple regression, there is always one dependent variable (y) but there can be any number of independent variables.

A multiple regression model provides the best possible prediction of Y based on not one but several independent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why would you want to do multiple regression analysis?

MR/MRA

A

MR provides regression coefficients that take into account the role of other variables in the model.

We also say that we estimate the coefficient for an X variable while statistically controlling for the effect of other variables. This means that MR can be used to mitigate the problem of omitted variable bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

interpretation of coefficients MR

A

Any regression coefficient indicates the predicted increase in Y with every one-unit increase in X while keeping all other variables in the model constant.

The predicted value indicaties the expected value on Y based on all values in the model. The residual indicates the difference between the observerd value on Y an dht expected value on Y based on all X-variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

OLS estimation of multiple regression

A

It allows to estimate the relation between a dependent variable and a set of explanatory variables.
The OLS estimation of multiple regression coefficients does not follow the exact same formula as for simple linear regression. The principle is very much the same. OLS only analyzes cases with complete observations on all variables in the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why should you often include control variables?

A

We can mitigate omitted variable bias and obtain better estimates of causal relations by including control variables even when our research question is not about these variables.

In addition, you can sometimes also get more precise, by including control variables that are very strong predictors of Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why you should not include too many control variables

A
  1. interpretation
  2. overfitting
  3. multicollinaerity
  4. risk of including mediators
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

interpretation

A

The more variables you put in the model, the harder it becomes to interpret and explain your findings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

overfitting

A

overfitting occurs when you include more variables than would be reasonable based on your sample size

Rule of thumb: try to have at least 50 cases, and at least 5-10 cases per X-variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

multicollinearity

A

Multicollinearity refers to the correlation between different X-variables in the model

This is not in itself a problem; in fact it is often the very reason you included a control variable

In the previous example, we controlled for education precisely because we know it is correlated with income and could hence cause omitted variable bias

But if multicollinearity becomes too strong, there is not enough information (especially with a small sample size) to distinguish the effects of different variables (this happens only with very extreme correlations)

If one X variable can be predicted perfectly by another X-variable, or a combination of other X-variables, we speak of perfect multicollinearity. When this occurs, the model cannot be estimated

For example, you cannot include both age and year of birth in one model because these variables have a perfect multicollinearity (in cross-sectional data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

risk of including mediators

A

A mediator is a variable that accounts for the assosciation between X and Y

If you include a mediator in the model, the effect of X on Y will be underestimated or removed, even if there is in fact a causal effect of X on Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

include variables in your model when:

A
  1. They are associated with both your X-variable of interest and Y and hence could induce omitted variables bias
  2. They are extremely strong predictors of Y and could hence make the model more precise (e.g., smaller standard errors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

do not include variables in your model when:

A
  1. There is no reason to expect they are relevant and they would just complicate the interpretation
  2. Your sample size is so small that it will cause overfitting
  3. They are (almost) perfectly correlated with the other X-variables in the model (e.g., perfect multicollinearity)
  4. They could be a mediator of the relation between your X-variable of interest and your Y-variable
17
Q

convention

A

heb jij nog aanvulling hierop?

18
Q

stepwise model specification

first possibility

A
19
Q

stepwise model specification

second possibility

A
20
Q

strenghts of multiple regression

A

Compared to simple regression and bivariate correlations, multiple regression provides us more possibilities to say something about causal relations

By controlling for likely alternative explanations, we can reduce omitted variable bias and we get closer to being able to interpret regression coefficients as causal effects

It is often quite easy to falsify incorrect causal claims with multiple regression by demonstrating how a correlation is spurious

21
Q

Limitations of multiple regression

A

You can never control for all variables that could theoretically induce omitted variable bias, because:
1. You never know precisely what those variables are
2. Even if you would know, you would never have data on all those variables

Although it is a step in the right direction, the principle “correlation does not imply causation” is still true for multiple regression