Lecture 8 Flashcards
What is a proxy variable and why would we use it?
Proxy variable is used to substitute for an unobserved variable in a regression model
- basically mitigate the bias caused by the unobserved variable by including a measurable variable which correlates with the unobserved one
E.g. say ability is an unobserved factor affecting wages, measurable proxy like IQ score can be included
Example of an regression model, where proxy variable might have to be used:
Lnwage = B0 + B1educ + B2exper + B3abil + u
In practise, we cant include ability as it is unobserved
- but if we omit abil, then if, say, educ and abil are correlated, OLS suffers from OVB
Therefore use a measured variable like IQ score to use as a proxy for ability
2 key requirements for proxy to ‘work’
- The proxy x3 must not directly affect y, the OVB is the actual driver, this is only a substitute, so cov(x1,u) = 0
- Once x3 is included in the model, it should fully explain x3, i.e., x3 cant be better explained by any combination x1,x2,x3
Write x3* as a linear function of x3 and an error v3
X3* = k0 + k3x3 + v3
- this equation recognises that x3* is not perfectly related to x3
- cov(x1,v3) = 0, and cov(x2,v3) = 0, so essentially, conditional on x3, x1 and x2 are not partially correlated with x3*]\
OLS with proxy under the two requirements
Plug in the x3 regression on x3* equation into the original equation and rearrange
- expected value of the new error is 0, so the new OLS function will consistently estimate the parameters.
Measurement error, what is it?
Measurment error is when the observed value of a variable does not equal its true value
-> Xobs = xtrue + ME
- can lead to endogeneity, leading to biased and inconsistent estimates of the coefficients.
Measurement in the dependent variable;e
In many cases we can’t observe y* directly, instead we see y, a noisy or imprecise version of y*, like a family might report annual savings inaccurately
- inflates the error term, leading to less precise estimates and larger SEs
Let’s say ln(produc*) = B0 + B1grant + u
But we actually see produc, where:
- ln(produc) = ln(produc*) + e0, as produc is hard to mesure and is self reported by the firm
-> ln(produc) = ln(produc*) + B0 + B1grant + u + e0
How would u mathematically define measurement error?
E0 = y-y* , where y = y* + e0
Key assumption for OLS to be consistent with this new measurement error accounted for:
E0 is uncorrelated with all explanatory variables, i mean because under MLR.4 u is already uncorrelated with each xj
Measurement Error in an Independent Variable
- consider: y = B0 + B1x1* + u, E[u|x1] = 0
- x1 is noisy, so x1 = x1* + e1
- substitute in x1* = x1 - e1
Y = B0 + B1x1 + (u-B1e1)
We know plim(B1^) = B1 + ((Cov(x1, u-B1e1))/(Var(x1)))
- we know the covariance cant be 0, as x1 contains e1, so we have bias
CEV assumptions, starting with:
plim(B1^) = B1 + ((Cov(x1, u-B1e1))/(Var(x1)))
Sub in x1* + e1 = x1, use cov(x1*,u) = 0
= cov(e1,u) - B1cov(x1*,e1) - B1Cov(e1,e1)
CEV: assume that cov(e1,u) = cov(x1*,e1) = 0
= -B1o^2(e1)
And Var(x1) = o^2(x1*) + o^2(e1)
Attenuation bias following the CEV assumptions
Plim(B1^) = B1 - B1((o^2(e1))/(o^2(x1*) + o^2(e1)))
= B1((o^2(x1)/(o^2(x1) + o^2(e1))
Therefore the |plim(B1^)|< |B1|, so OLS underestimates the true effect of x1* due to measurement error in x1
Why can non random sampling be an issue?
Challenges the OLS assumptions
1. Exogenous sampling
2. Endogenous sampling
What is exogenous sampling and why may it be an issue?
If the sampling process depends on the error term or the dependent variable y, it creates bias
- i.e. if we leave out a variable which correlated with the x term, this will lead to omitted variable bias.
Endogenous sampling, what is it, and why is it an issue?
The sampling is systematically related to y or u
Let’s say we had GPA = …
- but maybe people with lower GPAs are less likely provide their GPAs in a survey
- introduces a sample-selection problem, as the reason a unit cant be observed is systematically related to u
How to ensure IQ is a good proxy for ability
In the following equation:
Abil = k0 + k1educ + k2 exper + k3IQ + v3, we can run the OLS and then do they hypothesis test to make sure that k1 = k2 = 0
Is the CEV assumptions realistic?
A simplification which works in theory, but may not hold in real-world datasets
Does the CEV assumptions hold under rounding
I.e., is the assumption sensible when e1 arises from rounding, as if x is a truncated version of x, it losses variance and thus challenges the CEV assumption as e1 is correlated with educ
How can outliers affect the OLS?
If u’s distribution has ‘fat tails’, can cause OLS to break down as formally:
Var(B1^) = o^2/SSTx, fat tails means o^2 is very large so the variance of the estimator is very high
Questions raised about removing outliers?
Define an outlier
- when should we remove them?
A data point that seems fundamentally different from the rest of the sample
- easier to determine if an observation is an influential observation, i.e. excluding it will ,change the OLS estimates in an important way
LAD - least absolute deviations estimators
LAD minimises the sum of absolute deviations, so less influenced by extreme values
Y = b0 + u, where E[u] = 0
- E[y] = b0
B0^ = minSUM|y - b0| = (MEDy)^, whereas OLS minimises the sum of the squared residuals, which gives the mean value for y
- if u, and therefore y, has fat tails, then LAD may be more efficient than OLS
OLS vs LAD
- OLS is more sensitive to outliers due to large residuals
- if u follows a normal distribution, with thin tails, the OLS performs better
- in large samples they will behave the same if u is symmetrical,