Lecture 8 Flashcards

1
Q

What is a proxy variable and why would we use it?

A

Proxy variable is used to substitute for an unobserved variable in a regression model

  • basically mitigate the bias caused by the unobserved variable by including a measurable variable which correlates with the unobserved one
    E.g. say ability is an unobserved factor affecting wages, measurable proxy like IQ score can be included
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Example of an regression model, where proxy variable might have to be used:

Lnwage = B0 + B1educ + B2exper + B3abil + u

A

In practise, we cant include ability as it is unobserved
- but if we omit abil, then if, say, educ and abil are correlated, OLS suffers from OVB

Therefore use a measured variable like IQ score to use as a proxy for ability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

2 key requirements for proxy to ‘work’

A
  1. The proxy x3 must not directly affect y, the OVB is the actual driver, this is only a substitute, so cov(x1,u) = 0
  2. Once x3 is included in the model, it should fully explain x3, i.e., x3 cant be better explained by any combination x1,x2,x3
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Write x3* as a linear function of x3 and an error v3

A

X3* = k0 + k3x3 + v3
- this equation recognises that x3* is not perfectly related to x3
- cov(x1,v3) = 0, and cov(x2,v3) = 0, so essentially, conditional on x3, x1 and x2 are not partially correlated with x3*]\

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

OLS with proxy under the two requirements

A

Plug in the x3 regression on x3* equation into the original equation and rearrange
- expected value of the new error is 0, so the new OLS function will consistently estimate the parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Measurement error, what is it?

A

Measurment error is when the observed value of a variable does not equal its true value
-> Xobs = xtrue + ME
- can lead to endogeneity, leading to biased and inconsistent estimates of the coefficients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Measurement in the dependent variable;e

A

In many cases we can’t observe y* directly, instead we see y, a noisy or imprecise version of y*, like a family might report annual savings inaccurately

  • inflates the error term, leading to less precise estimates and larger SEs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Let’s say ln(produc*) = B0 + B1grant + u

A

But we actually see produc, where:
- ln(produc) = ln(produc*) + e0, as produc is hard to mesure and is self reported by the firm

-> ln(produc) = ln(produc*) + B0 + B1grant + u + e0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How would u mathematically define measurement error?

A

E0 = y-y* , where y = y* + e0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Key assumption for OLS to be consistent with this new measurement error accounted for:

A

E0 is uncorrelated with all explanatory variables, i mean because under MLR.4 u is already uncorrelated with each xj

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Measurement Error in an Independent Variable
- consider: y = B0 + B1x1* + u, E[u|x1] = 0
- x1
is noisy, so x1 = x1* + e1

A
  • substitute in x1* = x1 - e1
    Y = B0 + B1x1 + (u-B1e1)

We know plim(B1^) = B1 + ((Cov(x1, u-B1e1))/(Var(x1)))
- we know the covariance cant be 0, as x1 contains e1, so we have bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

CEV assumptions, starting with:

plim(B1^) = B1 + ((Cov(x1, u-B1e1))/(Var(x1)))

A

Sub in x1* + e1 = x1, use cov(x1*,u) = 0

= cov(e1,u) - B1cov(x1*,e1) - B1Cov(e1,e1)

CEV: assume that cov(e1,u) = cov(x1*,e1) = 0

= -B1o^2(e1)
And Var(x1) = o^2(x1*) + o^2(e1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Attenuation bias following the CEV assumptions

A

Plim(B1^) = B1 - B1((o^2(e1))/(o^2(x1*) + o^2(e1)))

= B1((o^2(x1)/(o^2(x1) + o^2(e1))

Therefore the |plim(B1^)|< |B1|, so OLS underestimates the true effect of x1* due to measurement error in x1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why can non random sampling be an issue?

A

Challenges the OLS assumptions
1. Exogenous sampling
2. Endogenous sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is exogenous sampling and why may it be an issue?

A

If the sampling process depends on the error term or the dependent variable y, it creates bias

  • i.e. if we leave out a variable which correlated with the x term, this will lead to omitted variable bias.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Endogenous sampling, what is it, and why is it an issue?

A

The sampling is systematically related to y or u

Let’s say we had GPA = …
- but maybe people with lower GPAs are less likely provide their GPAs in a survey
- introduces a sample-selection problem, as the reason a unit cant be observed is systematically related to u

17
Q

How to ensure IQ is a good proxy for ability

A

In the following equation:
Abil = k0 + k1educ + k2 exper + k3IQ + v3, we can run the OLS and then do they hypothesis test to make sure that k1 = k2 = 0

18
Q

Is the CEV assumptions realistic?

A

A simplification which works in theory, but may not hold in real-world datasets

19
Q

Does the CEV assumptions hold under rounding

A

I.e., is the assumption sensible when e1 arises from rounding, as if x is a truncated version of x, it losses variance and thus challenges the CEV assumption as e1 is correlated with educ

20
Q

How can outliers affect the OLS?

A

If u’s distribution has ‘fat tails’, can cause OLS to break down as formally:
Var(B1^) = o^2/SSTx, fat tails means o^2 is very large so the variance of the estimator is very high

Questions raised about removing outliers?

21
Q

Define an outlier
- when should we remove them?

A

A data point that seems fundamentally different from the rest of the sample
- easier to determine if an observation is an influential observation, i.e. excluding it will ,change the OLS estimates in an important way

22
Q

LAD - least absolute deviations estimators

A

LAD minimises the sum of absolute deviations, so less influenced by extreme values
Y = b0 + u, where E[u] = 0
- E[y] = b0

B0^ = minSUM|y - b0| = (MEDy)^, whereas OLS minimises the sum of the squared residuals, which gives the mean value for y

  • if u, and therefore y, has fat tails, then LAD may be more efficient than OLS
23
Q

OLS vs LAD

A
  • OLS is more sensitive to outliers due to large residuals
  • if u follows a normal distribution, with thin tails, the OLS performs better
  • in large samples they will behave the same if u is symmetrical,