Lecture 8 Flashcards

Question 1

Q

What is a proxy variable and why would we use it?

Answer

A

Proxy variable is used to substitute for an unobserved variable in a regression model

basically mitigate the bias caused by the unobserved variable by including a measurable variable which correlates with the unobserved one
E.g. say ability is an unobserved factor affecting wages, measurable proxy like IQ score can be included

Question 2

Q

Example of an regression model, where proxy variable might have to be used:

Lnwage = B0 + B1educ + B2exper + B3abil + u

Answer

A

In practise, we cant include ability as it is unobserved
- but if we omit abil, then if, say, educ and abil are correlated, OLS suffers from OVB

Therefore use a measured variable like IQ score to use as a proxy for ability

Question 3

Q

2 key requirements for proxy to ‘work’

Answer

A

The proxy x3 must not directly affect y, the OVB is the actual driver, this is only a substitute, so cov(x1,u) = 0
Once x3 is included in the model, it should fully explain x3, i.e., x3 cant be better explained by any combination x1,x2,x3

Question 4

Q

Write x3* as a linear function of x3 and an error v3

Answer

A

X3* = k0 + k3x3 + v3
- this equation recognises that x3* is not perfectly related to x3
- cov(x1,v3) = 0, and cov(x2,v3) = 0, so essentially, conditional on x3, x1 and x2 are not partially correlated with x3*]\

Question 5

Q

OLS with proxy under the two requirements

Answer

A

Plug in the x3 regression on x3* equation into the original equation and rearrange
- expected value of the new error is 0, so the new OLS function will consistently estimate the parameters.

Question 6

Q

Measurement error, what is it?

Answer

A

Measurment error is when the observed value of a variable does not equal its true value
-> Xobs = xtrue + ME
- can lead to endogeneity, leading to biased and inconsistent estimates of the coefficients.

Question 7

Q

Measurement in the dependent variable;e

Answer

A

In many cases we can’t observe y* directly, instead we see y, a noisy or imprecise version of y*, like a family might report annual savings inaccurately

inflates the error term, leading to less precise estimates and larger SEs

Question 8

Q

Let’s say ln(produc*) = B0 + B1grant + u

Answer

A

But we actually see produc, where:
- ln(produc) = ln(produc*) + e0, as produc is hard to mesure and is self reported by the firm

-> ln(produc) = ln(produc*) + B0 + B1grant + u + e0

Question 9

Q

How would u mathematically define measurement error?

Answer

A

E0 = y-y* , where y = y* + e0

Question 10

Q

Key assumption for OLS to be consistent with this new measurement error accounted for:

Answer

A

E0 is uncorrelated with all explanatory variables, i mean because under MLR.4 u is already uncorrelated with each xj

Question 11

Q

Measurement Error in an Independent Variable
- consider: y = B0 + B1x1* + u, E[u|x1] = 0
- x1 is noisy, so x1 = x1* + e1

Answer

A

substitute in x1* = x1 - e1
Y = B0 + B1x1 + (u-B1e1)

We know plim(B1^) = B1 + ((Cov(x1, u-B1e1))/(Var(x1)))
- we know the covariance cant be 0, as x1 contains e1, so we have bias

Question 12

Q

CEV assumptions, starting with:

plim(B1^) = B1 + ((Cov(x1, u-B1e1))/(Var(x1)))

Answer

A

Sub in x1* + e1 = x1, use cov(x1*,u) = 0

= cov(e1,u) - B1cov(x1*,e1) - B1Cov(e1,e1)

CEV: assume that cov(e1,u) = cov(x1*,e1) = 0

= -B1o^2(e1)
And Var(x1) = o^2(x1*) + o^2(e1)

Question 13

Q

Attenuation bias following the CEV assumptions

Answer

A

Plim(B1^) = B1 - B1((o^2(e1))/(o^2(x1*) + o^2(e1)))

= B1((o^2(x1)/(o^2(x1) + o^2(e1))

Therefore the |plim(B1^)|< |B1|, so OLS underestimates the true effect of x1* due to measurement error in x1

Question 14

Q

Why can non random sampling be an issue?

Answer

A

Challenges the OLS assumptions
1. Exogenous sampling
2. Endogenous sampling

Question 15

Q

What is exogenous sampling and why may it be an issue?

Answer

A

If the sampling process depends on the error term or the dependent variable y, it creates bias

i.e. if we leave out a variable which correlated with the x term, this will lead to omitted variable bias.

Question 16

Q

Endogenous sampling, what is it, and why is it an issue?

Answer

A

The sampling is systematically related to y or u

Let’s say we had GPA = …
- but maybe people with lower GPAs are less likely provide their GPAs in a survey
- introduces a sample-selection problem, as the reason a unit cant be observed is systematically related to u

Question 17

Q

How to ensure IQ is a good proxy for ability

Answer

A

In the following equation:
Abil = k0 + k1educ + k2 exper + k3IQ + v3, we can run the OLS and then do they hypothesis test to make sure that k1 = k2 = 0

Question 18

Q

Is the CEV assumptions realistic?

Answer

A

A simplification which works in theory, but may not hold in real-world datasets

Question 19

Q

Does the CEV assumptions hold under rounding

Answer

A

I.e., is the assumption sensible when e1 arises from rounding, as if x is a truncated version of x, it losses variance and thus challenges the CEV assumption as e1 is correlated with educ

Question 20

Q

How can outliers affect the OLS?

Answer

A

If u’s distribution has ‘fat tails’, can cause OLS to break down as formally:
Var(B1^) = o^2/SSTx, fat tails means o^2 is very large so the variance of the estimator is very high

Questions raised about removing outliers?

Question 21

Q

Define an outlier
- when should we remove them?

Answer

A

A data point that seems fundamentally different from the rest of the sample
- easier to determine if an observation is an influential observation, i.e. excluding it will ,change the OLS estimates in an important way

Question 22

Q

LAD - least absolute deviations estimators

Answer

A

LAD minimises the sum of absolute deviations, so less influenced by extreme values
Y = b0 + u, where E[u] = 0
- E[y] = b0

B0^ = minSUM|y - b0| = (MEDy)^, whereas OLS minimises the sum of the squared residuals, which gives the mean value for y

if u, and therefore y, has fat tails, then LAD may be more efficient than OLS

Question 23

Q

OLS vs LAD

Answer

A

OLS is more sensitive to outliers due to large residuals
if u follows a normal distribution, with thin tails, the OLS performs better
in large samples they will behave the same if u is symmetrical,