S7 OLS Regression & Assumptions (complete) Flashcards
Correlation vs. Regression
What is the difference?
Correlation:
tells us how strongly associated two variables are
Regression:
can tell us, on average, how much a one unit increase in the iv increases/decreases the predicted value of the dv
-> Regression gives us more precise information on the strength of a relationship
-> bivariate regression finds the best fit for a line through the data; the line with the best fit is the one that minimizes the Y distance from each observation to the line -> to find the best line use OLS
Ordinary Least Squares
OLS minimizes the prediction errors in yi - [y-hat]
b= Covar(x,y) / Var(x) (-> siehe Formelsammlung)
Standard error of the slope
Why does the slope have a standard error and how is it calculated? -> Hint RMSE
Howcan we build a confidence interval around the slope?
coefficients (b) are also sample statistics -> random sampling error
standard error of the slope (b), is given by the root mean square error (RMSE) over the standard deviation
- RMSE is given by the root of the error sum of squares (ESS) over the adjusted sample size; the RMSE is a useful measure of goodness of fit
beta = b +/- 1.96 x RSME
Hypothesis testing with regression
2 ways to do it
- calculate degrees of freedom: df = n minus # of parameters (a&b)
- Form a null hypothesis: i.e. no effect -> beta = 0, the regression line is horizontal
- Evaluate: to reject the null, the confidence intervals around b should exclude zero
Alternatively: calculate a t-ratio
t= b-beta(H0) / s.e.
with beta(H0) usually zero
-> If our t-ratio is greater than 2 (i.e., 1.96), p is under .05 and we can reject the Null hypothesis
What are the two general wys to measure the performance of an estimator?
two general ways to measure the performance of an estimator:
> Bias:
- a systematic tendency to produce estimates that are too high or too low relative to the true value
- minimize the bias
> Efficiency
- an efficient estimator yields standard errors that are as small as possible
What are the 5 OLS assumptions?
- Linearity: The dependent variable y is a linear function of the x’s, plus a population error term.
- Mean independence: The mean value of the error does not depend on any of the x’s.
- Homoscedasticity (variance dependence): The variance of the error cannot depend on the x’s. The variance is constant.
- Uncorrelated disturbances: The value of the error for any observation is uncorrelated with the value of the error for any other observation.
- Normal disturbance: The disturbances/errors are distributed normally.
Which OLS assumptions guarantee what?
Assumptions (1) Linearity & (2) Mean independence -> linear and unbiased estimated
Assumptions (3) homoscedasticity and (4) uncorrelated disturbances -> efficient model -> “best”
together: BLUE
Adding assumption (5) normality implies that a t- or z-table can be used to calculate p-values
Mean Independence
Most important assumption because violations
- can generate LARGE bias in the estimates and often occur
- cannot be tested for without additional data-> if your x’s are related to something outside of the model, they might be picking up its effect on y as well as their own!
-> this is called omitted variable bias
Dangers of Violating Mean Independence
Omitted variable bias
- can generate LARGE bias in the estimates and often occur
-> if your x’s are related to something outside of the model, they might be picking up its effect on y as well as their own!
- cannot be tested for without additional data
Endogeneity bias
->explanatory variable is correlated with the error term
- often reverse causation or selection effects
- If y has a causal effect on any of the x’s, then the error term will indirectly affect the x’s
Measurement Error
- If that x’s are measured with error, that error becomes part of the error term
- Because the measurement error affects the measured value of the x’s, the error term is related to the x’s
Assumption (3) Homoscedasticity
Wanted: homoscadicity;
bad brother = heteroscadicity
-> Non-constant variance (scatterplot that looks like a “joint”)
-> Biased standard errors (in either direction)
- easily fixed with “robust standard errors”
(4) Uncorrelated Errors
The disturbances (errors) for any two observations must be uncorrelated.
Correlated errors can arise from connected observations (e.g. Husbands and Wives), causal effects (e.g. peer pressure) or serial connection (measuring same unit over time)
- correlated errors do not bias coefficient estimate
But they do
- shrink the standard errors
- observations are assumed to be more independent than they are
- DANGER: Type 1 error!! False positive
- solution depends on type of correlation in errors, e.g. “clustered standard errors”
Normality
The population disturbance term must be normally distributed
Note that only disturbances, not the variables, must be normally distributed (Big misconception!!)
Normality is the least important assumption because OLS can be BLUE without it (unbiased and efficient)
Normally distributed disturbances simply enable the use of a z- or t-table for the p-values. Thus, in large samples we don’t even care about normality of disturbances
Which pitfalls can bias estimates and which can influence the standard errors?(assumptions + darüber hinaus)
Pitfalls that can bias estimates:
(1) Non-linearity (misspecification)
(2) Violation of mean independence > omitted var bias (misspecification)
- endogeneity (= explanatory variable is correlated with the error term)
- measurement error
Standard errors:
- Outliers - sometimes from skew
- heteroskedasticity
- correlated errors
- multicollinearity
Consider a linear function, y = α + βx. What does the constant α signify? (Select ALL the answers that apply)
a. The value of x when the y-intercept is 0
b. The value of y when x is 0
c. The value of the residuals when x is 0
d. The Y-intercept
Correct: b & d
Which of these statements does not form part of the OLS assumptions?
Select one:
a. Mean independence. The mean value of ε does not depend on any of the x’s. Assume that e(ε)=0.
b. Linearity. The dependent variable y is a linear function of the x’s, plus a population error term, ε.
y = α + β1 x1 + β2 x2 + ε
c. Normality. The dependent variable is approximately normally distributed around its mean.
d. Uncorrelated disturbances. The value of ε for any observation is uncorrelated with the value of ε for any other observation.
Correct: C