Econometrics 7: Dummy Variables Flashcards
What are ‘dummy variables’?
Dummy variables (dummies) assume only two values: {0,1}
▶ Also known as: binary variables, indicator variables, zero-one variables, …
We use dummy variables to encode qualitative information, e.g.:
▶ Femalebottom righti = 1 if individual i is female, 0 otherwise
▶ Marriedbottom righti = 1 if individual i is married, 0 otherwise
▶ Recessionbottom rightt = 1 if economy in recession in quarter t, 0 otherwise
Things to note at this stage:
▶ We wrote down the gender dummy as Femalebottom righti, which “switches on” for
female individuals
▶ We could equally have written Malebottom righti, switching on for male individuals
▶ In terms of econometrics, picking one or the other makes no difference
▶ In terms of economic interpretation, the choice matters and is
context-dependent
Using an example where we’re interested in the gender pay gap, how might we use dummy variables?
Include: how δ0 would be interpreted, its visual representation, mode…
…lling choices
Suppose we are interested in the gender pay gap
We might specify a model like this:
ln(wagei) = βbottom right0 + βbottom right1 Educbottom righti + δbottom right0 Femalebottom righti + εbottom righti
Where ln(wagebottom righti) is the log wage of individual i, Educbottom righti is years of education, and Femalebottom righti is a dummy variable taking a value of 1 for female individuals
We know how to interpret β1, but how to interpret δ0?
Let us first consider the expected value of the log wage of individuals
with Femalebottom righti = 1 and a given level of education (Educbottom righti)
E(ln(wagebottom righti)|Femalebottom righti = 1, Educbottom righti) = (β0 + δ0) + β1 Educbottom righti
And now the that of individuals with Femalebottom righti = 0
E(ln(wagebottom righti)|Femalebottom righti = 0, Educbottom righti) = β0 + β1 Educbottom righti
Subtracting one from the other:
E(ln(wagebottom righti)|Femalebottom righti = 1, Educi) − E(ln(wagebottom righti)|Femalebottom righti = 0, Educbottom righti) = δ0
So: δ0 is the difference in expected wage between men and women
for a given level of education
Visual interpretation
Graph with “wage” as y-axis and “educ” as x-axis. 2 parallel positive linear lines, both beginning above origin on y-axis. Both have label “slope = β1”. Top line labelled “men: wage = β0 + β1educ”. Bottom line has label “women: wage = (β0 + δ0) + β1educ”. Difference between bottom line and origin along y-axis is “β0+δ0”. Difference between top line and origin along y-axis is “β0”
A few things to note:
▶ The intercept for men is at β0, the intercept for women is at β0 + δ0
▶ β0 + δ0 < β0 ⇒ δ0 < 0
▶ The gradient on Educbottom righti (β1) is the same for men and women (parallel)
▶ Interpretation: δ0 is an intercept shift, with no change in gradient
Note that we chose to specify the model with a Femalebottom righti dummy
▶ Making men (Femalebottom righti = 0) the base, benchmark or reference group
▶ And therefore giving δ0 an interpretation as the difference in intercepts
for women compared to men
We could have equally used a Malei dummy, with women as reference group
ln(wagebottom righti) = β0 + β1 Educbottom righti + δ′0Malebottom righti + εbottom righti
Econometrically, this would be equivalent, but of course δ′0 = −δ0
▶ Modelling choice will depend on what makes most sense for interpretation
Why don’t we include indicators for both men and women?
ln(wagebottom righti) = β0 + β1 Educbottom righti + δ0 Femalebottom righti + δ′0 Malebottom righti + εi
This is because of the dummy variable trap:
▶ Since for each i we must have Femalebottom righti = 1 or Malebottom righti = 1, but not both:
Femalebottom righti + Malebottom righti = 1
▶ That is: Femalebottom righti and Malebottom righti are perfectly collinear
▶ We therefore cannot separately estimate δ0 and δ′0
In general: for a qualitative variable with m categories, we need to
include m − 1 dummy variables
Describe dummy variables for multiple categories
So far, we considered qualitative variables with 2 categories
▶ But what if we want to consider multiple categories?
Suppose we have an ordinal variable, Sectori, containing information on the
sector of the economy in which individual i works, with the following values
▶ Sectorbottom righti = 1 if i is in agriculture
▶ Sectorbottom righti = 2 if i is in manufacturing
▶ Sectorbottom righti = 3 if i is in services
One could use Sectorbottom righti directly, and estimate
ln(wagebottom righti) = β0 + β1 Educbottom righti + β2 Sectorbottom righti + εbottom righti
β2 can be estimated, but it is not clear how one would interpret it
▶ And requires difference between agriculture and manufacturing to be the
same as that between manufacturing and services (which is unrealistic)