Week 6 Flashcards
How do we describe binary qualitative information?
- e.g. a person is either male or female
Can be captured by defining a binary variable
- e.g. 1 if female, 0 if male
SLR with dummy variable as regressor
Wage = B0 + y0female + u
- assuming SLR.4 holds: E[u|female] = 0
E[wage|female] = B0 + y0female
= b0 if female -> 0,
Or
B0 + y0 if female -> 1
What does the coefficient of the dummy variable mean in the OLS of SLR with dummy variable?
Y0 = E[wage|female = 1] - E[wage|female = 0]
- the difference in average wage between women and men
- difference in average outcomes between the two groups
How does the choice of the base group work
We get the same answer if we flip the base group, so the base group in the other example was male, as its B0
- because male = 1 - female, coefficient on the dummy changes sign, but must remain the same magnitude
- intercept changes because now the base group is female
What happens if we put female AND male both in the equation?
It is redundant, this is the simplest case of the dummy variable trap - example of perfect collinearity.
Dummy variables for multiple categories
- female and married
Single male is the base group
Marriedmale = married(1-female)
Married female = married.female
Single male = (1-married).(1-female)
Single female = (1-married).(female)
If we have lets say 4 groups as before, how many would be in the regression?
Only 3, the base group would not be included here.
- the base group would be the intercept
Whats the point of interaction terms among dummy variables?
Used to model conditional effects, so the effect of one variable, depending on another, e.g. effect of being married on wages can differ based on gender.
Chow test is for what?
To test whether two groups have the same regression functions
How to compute the chow test statistic?
- Pool the data and estimate a single regression, this is the restricted model, and produces the restricted SSR, call this SSRp, the pooled SSR
- Split sample into the two groups, and estimate regression for each subsample, unrestricted SSR would be the added SSR of these two groups
Fchow = ([SSRp - (SSRur)]/(k+1)) / ((SSRur)/(n-2(k+1))
What is the linear probability model (LPM)?
LPM is a special case of regression analysis where the dependent variable y is binary:
- y = 1 if a young man is arrested for a crime, y = 0 if otherwise.
How do we interpret the LPM:
Y = b0 + b1x1 + b2x2 … + bkxk + u
When y is binary?
E[y|x] = Pr(y=1|x) x 1 + Pr(y=0|x) x 0
= Pr(y=1|x), called the response probability
So let’s say b1 was 0.035, that means one more unit of x1 means an increase by 3.5% in the probability of y=1
Reformulated linear probability model once expected value calculations are done:
Pr(y=1|x) = b0 + b1x1 … + bkxk
Therefore, as said earlier, b1 = the change in estimated probability of y=1 given an added unit of x, other factors held fixed
First shortcoming of the LPM
1 - the fitted values from an OLS regression are never guaranteed to be between 0 and 1, yet these fitted values are estimated probabilities, i.e y can sometimes be outside the range [0,1] - invalidates the estimate but not the LPM for estimating partial effects
Issue with LPM’s partial effects
LPM assumes partial effects are constant throughout the range of the explanatory variables, but for the estimated model to truly represent a probability, the effect of education should be diminishing.
Second short coming - heteroskedasticity
Because y is binary, its variance is p(1-p), following Bernoulli
, as p = Pr(y=1), can sub that in
- var(y|x) = p(x)(1-p(x))
Since u = y-p(x) in expectation - var(u|x) = var(y) = p(x)(1-p(x))
Therefore the variance of u is a function of x, implying heteroskedasticity
What happens now that LPM means MLR.5 is violated?
T statistics often depend on the assumption of correct SEs, so with incorrect, implies t stats not derived from N(0,1)
- cant trust the p values or CIs derived from these stats either
Goodness of fit in LPMs
- can still use R and adjusted R squared, but difficult due to y being binary
Use Percent Correctly Predicted
- let yi^ be the OLS fitted value - a probability estimate
- convert yi^ into a binary yi_, so it is 1 if yi^>/0.5 and 0 if not for example
- here, yi^ can take any real value, while yi_ is strictly binary - matching the structure of y
This assesses the classification accuracy of the LPMs
Four possible cases in the percent correctly predicted model:
(Yi,yi_) = 1,1 - correct prediction
= 0,0 - correct prediction
Etc
Then compute the accuracy rate