Chapter 1 Linear regression Flashcards
A1. Linearity, implication
the marginal effect of the regressors do not depend on the level
A2. Strict exogeneity
A3. No mulitcolinearity
The rank of nxk data matrix X is K with probability 1
A4. Spherical error variance
Implications of A2
Justified by economic theory, not metrics.
Usually not satisfied by time series data
Implications of A4
How does iid affect our assumptions’ restrictiveness?
In random sampling we obtain uncorrelated xi, therefore E(epsilon_i|x_i)=0.
E(epsilon_i^2) remains constant across i -> unconditional homoskedasticity, but the value E(epsilon_i^2|x_i) may differ across i, therefore A4 remains restrictive.
How do we look for the parameters in OLS? does it make sense?
We minimize the SSR (loss function) to minimize the errors.
It makes sense if we want to predict, but not necessarily if we want to interpret causality.
SSR formula
to isolate b from the minimized SSR function we need the inverse of x’x to exist. Is this fulfilled?
Yes:
1. By A3 the determinant is different from 0
2. It is a square matrix by definition
3. n>k
Projection matrix
P=X(X’X)^{-1}X’
PY
=Xb
Anihilator matrix
M=I-P
MY
=Y-Xb=e
(residuals)
Property of M and P
They are both symmetric and idempotent (AA=A)
PX
=X
MX
=0
PM
=0
Finding the variance in OLS (sigma^2)
Since we don’t know epsilon^2, we need an estimator that gives us an approximation to the variance
Finding the R^2
Centered R^2
By removing the mean, the centered R^2 describes the explanatory power of the Xs, not the mu
Influential power of an observation
where the subindex i indicated the estimator without the ith observation.
Pi=x_i(x’x)^{-1}x_i’
trace P=k.
If all i have similar contribution, Pi is approx k/n. If i is an outlier, Pi is much larger
Statistical properties of b: 1. unbiasedness (E(b)=\beta).
Which assumptions do we need?
- Linearity to change y to its meaning
- Strict exogeneity to cancel out the second term in part 1
- No multicolinearity so that the inverse exists.
Note: if 2 doesn’t hold (like in time series), b is biased.
Definition of conditional variance for a vector
Statistical properties of b: 2. BLUE.
Develop the variance of b OLS under conditional homoskedasticity
This is the smallest variance we can obtain with a BLUE estimator (proof by Gauss-Markov theorem)
Develop the Gauss-Markov theorem
Notice that DD’ is a quadratic form, so it is a positive definite matrix, thus cond.var of beta hat is bigger or equal to cond.var of b
Covariance of b, Depsilon (part of G-M theorem)
Statistical properties of b: 3. Cov(b,|x)=0
Prove the unbiasedness pf the variance estimator for OLS
Assumption 5
We assume Normality (of epsilon given x) to perform tests
how is b-beta distributed?
N(0, sigma^2(x’x)^{-1})
If we want to standardize the b-beta distribution? (with sigma squared)
If we want to standardize the b-beta distribution? (with s squared)
How do we prove that the t test follows t(n-k) degrees of freedom (just steps)
- The numerator follows a N(0,1)
- The denominator follows a chi squared
- The numerator and denominator are independent
How do we prove that the t test follows t(n-k) degrees of freedom (step 1)
We already showed that b-beta is the sampling error, imposing the normality assumption we know that it will follow a N(0, sigma^2(x’x)^{-1}), If we standardise it, the numerator will follow a N(0,1)
How do we prove that the t test follows t(n-k) degrees of freedom (step 2)
How do we prove that the t test follows t(n-k) degrees of freedom (step 3)
Cov(b,e |x)=0, since the numerator is a function of b and the denominator is a function of e and OLS sets b orthogonal to e.
Does normality of x affect the distribution of t?
NO! We need to impose it to derive the distribution, but ultimately, t(n-k) does not depend on x, meaning that it stands even if not conditioned in the last step
Scalar parameter hypothesis testing
apply t-test. Under the null beta_k=\bar{beta_k}
Confidence intervals for the t test
P(b_k +- critical point*SE(b_k))=1-alpha
What would happen with the t-test if strict exogeneity fails?
We’d start rejecting sooner than we ought to
linear combination hypothesis testing (dimensions of the matrices in H0). Which is the assumption on the hypothesis?
Ho=Rbeta=r
R is #rxk
beta is kx1
r is a #rx1
Assumption: full row rank
linear combination hypothesis testing, Test statistic formula
Notice F is always positive
What is the distribution of F under Ho? (steps)
- Show the numerator follows a chi^2(denom. of numer., m1)
- Show the denominator follows a chi^2(denom. of numer., m2)
- Show 1 and 2 are independent
Then F follows F(m1,m2)
In this case, under Ho, F follows F(#r,n-k)
What is the distribution of F under Ho? (step 1)
What is the distribution of F under Ho? (step 2)
What is the distribution of F under Ho? (step 3)
numerator and denominator are independent bc the first is a function of e and the second a function of b, and by ols, they are orthogonal.
Formula for special case of F test where it’s the test for joint significance
With MLE we obtain
The same parameter estimator as in ols
log density of a multivariate normal
MLE for sigma^2
the estimator for variance is biased because it doesn’t have the correction for the degrees of freedom
What is the Cramer Rao lower bound
We’d find it by taking the log density function and deriving it twice for the parameter, then -E[]
What is the Fischer information matrix
Where a11 is the lower bound for variance and a22 is the lower bound for the sigma^2, although no unbiased estimator achieves it.
How can we prove BLUE?
- Via Cramer Rao’s lower bound, which relies heavily on the normality assumption but it holds for non linear models.
- Via Gauss-Markov, which assumes linearity but doesn’t rely so heavily on normality
Consequences of relaxing the spherical error assumption for the parameter estimation
bols is still unbiased
var(b|x) no longer the minimum -> NOT BLUE
the tests don’t follow a t(n-k) or F(#r,n-k) anymore
Adapting the data so that the consequences of relaxing the spherical error assumption doesn’t negatively affect the parameter estimation properties. Steps
Thus, we can use OLS again, which is now the same but with tildes everywhere -> GENERALIZED LEAST SQUARES ESTIMATOR
Differences between GLS and OLS
- OLS puts equal weight to all observations, while GLS accounts for the variance (more weight if less variance)
- The variance of the beta is now smaller! beta GLS is the BLUE in this model
Testing with GLS
t is the same
F is the same but with tildes
Special GLS case when V(x) is diagonal
- No serial correlation but we have heteroskedasticity
- In this case GLS becomes weighted least squares
Causality in linear regressions: ATT, ATE, CATE
Can we interpret beta ols as ATE?
NO! beta ols=cov(Y,D)/Var(D) , when we develop this function we find that it is different than ATE UNLESS the treatment is independent of the potentials.
Interpretation: we require treatment to be randomly assigned.
How can we separate OLS in terms of ATT + smth
Where the latter two bits are the selection effects. The first bit is ATT
What happens when we have omitted variables?
Strict exogeneity is breached.
1. Randomize treatment
2. Use a quasi-experiment and use available control variables