14. Linear regression models (Mediation and moderation) Flashcards
What does multiple regression do?
Tells us how does the mean of DV change as a function of IVs- can partial the effect of each IV
* Ceteris paribus = other things equal
* What is the effect of education on salary, keeping gender, region, industry…. equal?
* Whatever is part of the regression is controlled for and held constant
Name the 7 assumptions for multiple linear regression
1- linearity
2- random sampling
3- no perfect collinearity
4- zero conditional mean of error
5- homoscedasticity
6 (Normality)
7. Multicollinearity
What is linearity?
- States that the population model is linear in the parameters β0, β1,β2…βk
- This doesn’t mean that the variable itself cannot be recorded in logarithms and squares and other
functions - For example- log(salary), age and age2 -these change the interpretation of the coefficient!
What is random sampling?
- We minimize residuals in order to estimate betas for a specific sample, but this sample should
reflect the population distribution (and it will, if it was randomly selected from the population)
What is “no perfect collinearity”?
- None of the IVs is constant and exact linear combination of each other
- This still allows for including age and age2 but would not allow to include exact linear function of
age (for example age in different units- decades) - Also fails if n<k+1, we need at least as many observations as many parameters we are trying to
estimate
What is zero conditional mean of error?
- Conditional on any values of the IVs the error term is expected to be 0
- Misspecifing functional relationship or omitting important variable that is correlated with IVs
violates this assumption - Minimization of residuals- calculation of betas- depends on this assumption
What is homoscedasticity?
- Conditional on any values of the IVs the error term has the same variance= σ2
- So meanwhile values of DV are linear combination of IVs,
the variance of u should not depend on values of IVs
(f.e. if errors get bigger at higher level of X- we are more imprecise, with higher X) - Violated if errors autocorrelated
What is normality?
Unobserved error is normally distributed in the population: u ~ Normal (0,σ2)
* We don’t know much about the sampling distribution of our OLS estimator (β’), in order to do statistical
inference we need to have distribution of betas (calculate standard errors of betas)
* Distribution of estimated betas depend on distribution of errors- so we assume distribution of errors is normal
* Assumption 6 is stronger/necessarily includes assumption 4 and 5
What is multicollinearity?
Variance of beta (coefficient) should be as low as possible- that means our coefficient is precise
* Variance of beta (coefficient)= variance of error / [total variance in x (1- correlation of x with other IVs in regression)]
Elements influencing variance of beta:
* Variance of error- the higher is σ2 the higher is variance of beta
* Variance of beta is lower the higher the variance of IV in the data
* Variance of beta is higher the higher the correlation of the IV with other IVs (R2j)- MULTICOLLINEARITY
* Multicollinearity does not invalidate assumptions but it creates problem- increases variance of beta to possibly such and
extent that it is impossible to distinguish its separate effect (becomes insignificant)
* Multicollinearity can be detected in simple correlation table but also through variance inflation factor
* VIF= 1/(1-R2j)- reflects how much variance in one IV can be explained by the other IVs
* Rule of thumb usually above 4, above 10 is serious multicollinearity (means R2j is above 0.9)
What is VIF?
VIF is actually in the main table already
* VIF= 1/(1-R2j)- reflects how much variance in one IV can be explained by the other IVs
* Rule of thumb usually above 4, above 10 is serious multicollinearity (means R2j is above 0.9)
What is outliers?
Pulling the regression line in the wrong direction, creating large residuals
One extreme value (could be typo, mistake in the data)