Part 2 : Dummy Variables Flashcards by lew lee

What are dummy variables

Variables to encode qualitative info

(So far we have only considered quantitive e.g wages, savings etc)

How well did you know this?

Not at all

Perfectly

Suppose we have a model like this

ln(wagei) = β₀ + β₁ Educi + δ₀ Femalei + εi

assume females =1 (so male=0)
δ₀ is parameter for the dummy variable (same concept as β)

Find expected value of log wage of individuals for
A) female
B) male
C) so what is the difference in expected wage between gender?

A) For femalei=1
Log wage = (β₀+δ₀) +β₁Educi

B) For femalei=0
Log wage = β₀+β₁Educi

C)
Subtract one from the other = δ₀
δ₀ is the difference in expected wage between men and women at a given education.

How well did you know this?

Not at all

Perfectly

Why cant we include indicators for both men and women e.g ln(wagei) = β₀ + β₁ Educi + δ₀ Femalei + δ’₀ Male + εi

Dummy variable trap - Mutually exclusive (cannot occur at same time) , Exhaustive, so perfect collinearity

We can’t have both female=1 and male=1
, female+male must=1

How well did you know this?

Not at all

Perfectly

What if qualititative variables have multiple catagories: what is the general rule of thumb

For a qualitative variable with m categories, we need to include m-1 dummy variables

E.g gender has 2 categories , so m-1 =1 hence why we only used one variable !

How well did you know this?

Not at all

Perfectly

So gender is a qualitiative variable with 2 categories.

Example of qualitative variables with multiple categories:

Different sectors of the economy in which indiviudal i works in can be manufacturing, services or agriculture.

So m=3. Using rule of thumb, how can we create a model?

Firstly, rule of thumb m-1 means we need 2 dummy variables. So…

Manufi=1 if i is in manufacturing, 0 otherwise
Servi= 1 if i is in services, 0 otherwise
This leaves agriculture as the basline category.

Then estimate ln(wagei) = β₀+β₁Educi + δ₀Manufi +δ₁Servi + εi

Where δ₀ and δ₀ capure wage gap between manu and services, relative to agriculture (since the baseline!)

How well did you know this?

Not at all

Perfectly

Assumptions: What must the m-1 dummy variables must be (2)

Mutually exclusive (cannot occur at same time)
Exhaustive

How well did you know this?

Not at all

Perfectly

Recall previous model
ln(wagei) = β₀ + β₁ Educi + δ₀ Femalei + εi

What if returns to education differ by gender?
(The 2 explanatory variables interact!?)

include a multiplicative dummy variable

How well did you know this?

Not at all

Perfectly

How can we capture this interaction via a multiplicative dummy variable?

(What would our expression become?)

B) so normal dummy variables as previous just cause a change in intercepts (e.g higher intercept = higher starting wage for men than women as we found δ₀ was negative).
What do multiplicative dummy variables change?

ln(wagei) = β₀ + β₁Educi + δ₀Femalei + δ₁ Femalei × Educi + εi

B)
multiplicative variables cause changes in gradients AND intercepts.

How well did you know this?

Not at all

Perfectly

So now we have model

ln(wagei) = β₀ + β₁Educi + δ₀Femalei + δ₁ Femalei × Educi + εi

Interpretate β₀ and β₁ now (find log wages for females and males separetely again!)

For Femalei = 1
Log wage= (β₀ + δ₀) + (β₁ + δ₁) Educi

For Femalei =
Log wage = β0 + β1 Educi

Subtracting one from the other leaves us δ₀ and δ₁
So difference in intercepts when δ0≉ 0
But there is also a difference in gradients when δ1≉ 0 (the returns between different genders on given education differs!)

How well did you know this?

Not at all

Perfectly

Visualising multiplicative dummy variables: 4 scenarios

A) if δ₀=0 and δ₁=0

B) if δ₀≠0 and δ₁=0

C) if δ₀=0 and δ≉0

D) if δ₀≉0 and δ≉0

If both deltas = 0. It means men and women have the same line. (Same slope and intercept)

If δ₀≉0 and δ₁=0 means parallel regressions (same gradient just different intercepts)

If δ₀=0 and δ≉0 means same intercept, concurrent (different) slope. So one has a high return to education

If δ₀≉0 and δ≉0, means dissimilar regressions. Different intercepts and slopes.

How well did you know this?

Not at all

Perfectly

Just as we can multiply a dummy variable with a quantitative variable (gender x education), we can also interact two dummy variables!

Suppose we suspect a differential gender wage gap in services (2 dummys are services and gender, obviously)

ln(wagei) = β₀+β₁Educi + δ₀Femalei + δ₁Servi + εi

What does our model become?

ln(wagei) =
β₀ + β₁Educi + δ₀Femalei + δ₁Servi + δ₂Femalei × Servi + εi

So gender wage gap within service sector.

δ2 Femalei × Servi is the interactive dummy variable

How well did you know this?

Not at all

Perfectly

So how to interpret δ₀,δ₁ and δ₂?

ln(wagei) =
β₀ + β₁Educi + δ₀Femalei + δ₁Servi + δ₂Femalei × Servi + εi

Find log wages of males and females in service/non service!

Male in non-service (femalei=0 servicei=0) (simplest one)
Log wage = β₀ + β₁ educi

Female in non service (femalei=1 servicei=0) (2nd simplest)
Log wage = (β₀+ δ₀) + β₁ educi

Male in service (femalei=0 servicei=1
(β₀ + δ₁) + β₁ educi

Female in service (femalei=1 servicei=1)
(β₀+δ₀+δ₁+δ₂) + β₁educi

How well did you know this?

Not at all

Perfectly

Male in non-service (femalei=0 servicei=0) (simplest one)
Log wage = β₀ + β₁ educi
Female in non service (femalei=1 servicei=0) (2nd simplest)
Log wage = (β₀+ δ₀) + β₁ educi
Male in service (femalei=0 servicei=1
(β₀ + δ₁) + β₁ educi
Female in service (femalei=1 servicei=1)
(β₀+δ₀+δ₁+δ₂) + β₁educi

What is the difference between women and men in non-service?

SUBTRACT ONE FROM OTHER
=δ₀

How well did you know this?

Not at all

Perfectly

What is the difference between men in non-service and men in service

Expected value of men in servive minus men in not service
=δ₁

How well did you know this?

Not at all

Perfectly

What is the difference between the difference in service and non-service for women, and difference in service and non-service for men?

[Expected value of female in service - not in service] which is (δ₁+δ₂) - [expected value of men in service- not in service] which is δ₁

So (δ₁+δ₂) - δ₁
=δ₂

How well did you know this?

Not at all

Perfectly

OLS estimator of B1

Use basic regression model
Yi=β₀ +β₁Xi + ε

proof for OLS estimator: not sure if required to learn. ask again to confirm if not needed in revision.

so far we have looked at dummy explanatory variables (gender, sector on wage)

But sometimes we want to know about dummy dependent variables e.g is an individual employed. (Cant have a number, so dummy e.g 1=yes 0=no)

What model do we use for dependent dummy variables?

Linear probability model

is used for modelling dummy dependent variables

Di =β₀ +β₁ X₁i +β₂ X₂i +…+βkXki +εi

Standard multi regression model (just replaced y with D i.e dependent variable)

Questions arising for the linear probability model (3)

How to interpret Bj
Can we estimate using OLS
Are classical assumptions violated?

First Q: How to interpret Bj with a dummy DEPENDENT variable. E.g do β₁ in this example

Final equation for β₁ and it’s interpretation

Di = 0 or Di = 1, and therefore take expectations:
E(Di|Xi) = P (Di = 1|Xi)
So, can rewrite our equation for the expectation:
P(Di =1|Xi)= β₀ +β₁X₁i +β₂X₂i +…+βk Xki

therefore we can interpret, e.g, β1 as:
β₁ = ∂P(Di = 1|Xi) / ∂X₁i

β₁ is the change in probability that Di=1 following changes in X₁!

Drawbacks of using this linear probability model (4)

Non-normality of error term

Heteroscedasticity

Predicted probabilities outside [0,1]

Assumes constant marginal effects

Non-normality of error term

εhat = Di -Dhati
(Predicted error = actual dummy variable - predicted dummy variable)

And Dhati = βhat₀+βhat₁X₁ + βhat₂X₂ + BhatkXk
So

εhat = Di - βhat₀+βhat₁X₁ + βhat₂X₂ + BhatkXk

And Di can only be 0 or 1, we have 2 possible values for a given Xi.

0 - (βhat₀+βhat₁X₁ + βhat₂X₂ + BhatkXk)
or
1 - (βhat₀+βhat₁X₁ + βhat₂X₂ + BhatkXk)

Therefore ε is not normally distributed!

So what does this mean for the Ehat (predicted error)

and eval

it cannot be normally distributed.

eval: in practice, with a large sample N, converges to a normal distribution.

2nd drawback of linear probability model: heteroscdasticity

proof shows us the variance is not constant (not always = σ²) however we can escape using the various stuff (estimators) used before i.e WLS

3. Predicted probabilities outside [0,1]

sometimes we get nonsensical predictions as a result.

4. Assumes constant marginal effects

LPM assumes this, but this is unrealistic. In real life Xi (may have larger impact in the centre of the distribution, and less impact at the extremes e.g females in middle wage distribution, rather than females at the high end!