Linear Regression Flashcards
interpret constant & coeffcient
constant = when education equals 0, income is 457
coeffcient = with every year of education the mean income increases by 104
What is a prediction error?
Lin regression assumptions:
- Linear relationship
- Multivariate normality (all variables need to be normally distributed –> When the data is not normally distributed a non-linear transformation e.g., log-transformation might fix this issue)
- No or little multicollinearity (If multicollinearity is found in the data, centering the data (that is deducting the mean of the variable from each score) might help to solve the problem. However, the simplest way to address the problem is to remove independent variables with high VIF values)
- No auto-correlation (Autocorrelation occurs when the residuals are not independent from each other. For instance, this typically occurs in stock prices, where the price is not independent from the previous price)
- Homoscedasticity (The scatter plot is good way to check whether the data are homoscedastic (meaning the residuals are equal across the regression line).
How does OLS work?
What is r^2?
How is R^2 calculated?
What is the loss function for linear regression?
What does a lin regression model predict?
The mean value of y for a given value of x
(No probabilities, it’s a model of the mean)
How can we make a constant more meaningful? (1)
centering: usually mean centered (subtract -12,5 years from years of education)
How can we make a constant more meaningful? (2)
standardizing: subtract mean / by SD
1) For every 1-SD of education, mean of income rises by 402
2) For every 1-SD of education, mean of income rises by 0,3 SD of mean income
Why would you want to standardize?
Allows comparison
How are the standardized coeffcients also called?
What is true about correlations?
1) Standardizing gets rid of scale –> whole point
3) Perfect correlation = 0 error
4) just not linear -> just a measure for linear relationships!
Why would we even need a regression, why not only calculate the conditional means?
1) reduce noise -> virtue of abstraction
2) prediction even for data that is not there
3) allows for more control i.e. mediation, moderation, controls, etc.
Why do we square residuals in r^2?
1) prevent cancelling out
2) bigger penalty for large residuals
- no threshold will do
- similar rationale with alpha value
- highly noisy data in SS how could we possibly achieve it? or should even want to?
Is my R² too low?
Low R-Squared is often good BUT also a limitation
Is my R² too high?
High R-Squared is often not good
BUT can be
Why a Low R-Squared is often good
Why a Low R-Squared is also a limitation
Why a High R-Squared is often not good
Why a High R-Squared can be good
very accurate prediction if really captures the relationship
What do we need to control for?
How can Parent’s SES influence education –> income?
Confounding as well as Mediation, interlaced
Income achieved or inhereted?
Reduction of FISEI much more than reduction of education –> rather achieved
When do we need to put in a control variable?
1) not a good idea –> kitchen sink approach leads to overfitting (unless you want a really good in-sample prediction)
2) not enough - not clear in which direction the correlation works
What is the collider bias?
When an exposure and an outcome independently cause a third variable –> ‘collider’. Inappropriately controlling for a collider variable, can induce a distorted association between the exposure and outcome, when in fact none exists.
What is happening here?
Expectation: predictor variable will get smaller when other variables are added to the regression model BUT sometimes a coefficient gets larger when other variables are added –> Special case of confounding: Surpression
Usuallyoccurs when there’s an inconsistency of signs:
- younger people more education (edu expansion)
- older people more income (curvilinear rel)
–> Age neg rel to edu BUT pos rel to income
Which of these var could be overcontrolled?
First three could be mediators, gender needs to be controlled bc moderator
Guidelines for selecting explanatory variables:
Gander Pay Gap
- Argumentation makes sense?
Conditioning on a mediator NOT a control
Why would you look at the gap after cancelling out reasons why it exists?
What could be other mediators for gender –> income?
How do you assess a mediator?
How can you interpret the child effect theoretically?
Having child maybe relates to a higher income group as costly decision
also, usually older when having a child
could also be usually shared income when having a child
How does an interaction effect look like for gender –> income with having a child or not
Symmetrical effect of child effect differed by gender OR gender effect differed by child status
Main effects are defined for the
interacting variable equalling zero
How would this reg table look in a normal table?
Why are margins larger at the beginning/end?
fewer people in sample with 0 or 30 years of education
Why is different with standardizing?
more meaningful
Why is the effect more important than the statistical significance?
Can be not significant at 0 but highly significant at other ages
Key takeaways from visualizations
Three types of plots
- Coefficient plot
- Profile plot
- Conditional effect plot
What is a coeffcient plot?
What is a profile plot?
What is a conditional effects plot?
Often used for interactions