3A Flashcards
correlation does not imply causation
Just that there is a correlation between variables, does not mean that one causes the other
What is meant by spurious correlation?
Spurrious correlation refers to a connection between two variables that appears to be causal but is not.
suppressor variables
a suppressor varriable is a variable that increases the predictie validity of another variable when included in a regression equation.
when do we speak of omitted variable bias?
when a regression coefficient is ether overestimated (spurious correlatiom) or underestimated (suppressor variable)
what is multiple regression
In multiple regression, there is always one dependent variable (y) but there can be any number of independent variables.
A multiple regression model provides the best possible prediction of Y based on not one but several independent variables
Why would you want to do multiple regression analysis?
MR/MRA
MR provides regression coefficients that take into account the role of other variables in the model.
We also say that we estimate the coefficient for an X variable while statistically controlling for the effect of other variables. This means that MR can be used to mitigate the problem of omitted variable bias.
interpretation of coefficients MR
Any regression coefficient indicates the predicted increase in Y with every one-unit increase in X while keeping all other variables in the model constant.
The predicted value indicaties the expected value on Y based on all values in the model. The residual indicates the difference between the observerd value on Y an dht expected value on Y based on all X-variables.
OLS estimation of multiple regression
It allows to estimate the relation between a dependent variable and a set of explanatory variables.
The OLS estimation of multiple regression coefficients does not follow the exact same formula as for simple linear regression. The principle is very much the same. OLS only analyzes cases with complete observations on all variables in the model.
Why should you often include control variables?
We can mitigate omitted variable bias and obtain better estimates of causal relations by including control variables even when our research question is not about these variables.
In addition, you can sometimes also get more precise, by including control variables that are very strong predictors of Y
Why you should not include too many control variables
- interpretation
- overfitting
- multicollinaerity
- risk of including mediators
interpretation
The more variables you put in the model, the harder it becomes to interpret and explain your findings.
overfitting
overfitting occurs when you include more variables than would be reasonable based on your sample size
Rule of thumb: try to have at least 50 cases, and at least 5-10 cases per X-variable
multicollinearity
Multicollinearity refers to the correlation between different X-variables in the model
This is not in itself a problem; in fact it is often the very reason you included a control variable
In the previous example, we controlled for education precisely because we know it is correlated with income and could hence cause omitted variable bias
But if multicollinearity becomes too strong, there is not enough information (especially with a small sample size) to distinguish the effects of different variables (this happens only with very extreme correlations)
If one X variable can be predicted perfectly by another X-variable, or a combination of other X-variables, we speak of perfect multicollinearity. When this occurs, the model cannot be estimated
For example, you cannot include both age and year of birth in one model because these variables have a perfect multicollinearity (in cross-sectional data)
risk of including mediators
A mediator is a variable that accounts for the assosciation between X and Y
If you include a mediator in the model, the effect of X on Y will be underestimated or removed, even if there is in fact a causal effect of X on Y
include variables in your model when:
- They are associated with both your X-variable of interest and Y and hence could induce omitted variables bias
- They are extremely strong predictors of Y and could hence make the model more precise (e.g., smaller standard errors