EXAM 2 Flashcards
How do we draw a line through data to estimate the population slope?
Ordinary Least Squares
What is the OLS Estimator
The OLS estimator minimizes the average squared difference b/w the actueal values of Y, and the prediction based on the estimated line
Regression R^2
1. R^2=0 means ESS=0
2. R^2=1 means ESS=TSS
3. 0<=R^2<=1
- for regression w/ a single X, R^2 is equal to the square of the correlation coefficient b/w X&Y
Measures the fraction of the variance of Y that is explained by X
1. is unitless
2. between 0 (no fit) and 1 (perfect fit)
Standard Error of the Regression
Measures the magnitude of a typical regression residual in the units of Y
- measures the spread of the distribution of u
- units of u, which are the units of Y and measures the average size of the OLS residual
TSS=ESS+SSR(sum of squared residuals)
R^2=TSS/ESS (total sum of squares/explained sum of squares)
R^2=1-SSR/TSS
Root mean squared error (RMSE ) is closely related to the SER
The Least Squares Assumptions
1. The conditional distribution of ui, given Xi, has the mean = zero
2. (Xi,Yi), i=1,2,3,…n are iid
3. Large outliers in X and/or Y are rare
- The conditional distribution of ui, given Xi, has the mean = zero
Failure of this leads to omitted variable bias
- means if there is a omitted variable that correlations with the equation, the condition fails and there is OV bias
Is equivalent to assuming that the population regression line is the conditional mean of Yi given Xi
- Because X is assigned randomly, all other individual characteristics–the things that make up u– are distributed independently of X , so u and X are independent.
- Thus, in an ideal randomized controlled experiment, E [ui |Xi ] = 0
In actual experiments, or with observational data, we will need to think hard about whether E [ui |Xi] = 0 holds.
- (Xi,Yi), i=1,2,3,…n are iid
Assumption automatically applies if sampled by random sampling because chosen from same location
- they’re selected at random so the values are independently distributed
We can expect to encounter non-i.i.d. data when information is
recorded over time for the same entity (panel data and time series
data)
- Large Outliers are rare
assuming that Xi and Yi have nonzero
finite fourth moments, i.e., 0 < E [X 4
i ] < ∞ and 0 < E [Y 4
i ] < ∞, or
in other words, that the distributions of Xi and Yi have finite kurtosis - outliers are often data glitches (coding or recording
problems). Sometimes they are observations that really shouldn’t be
in your data set
R^2
This measures the variance of Y that is explained by X
SER- standard error of the regression
Measures the magnitude of a typical regression residual in the units of Y.
t= (estimator- hypothesized value)/
(SE of estimator)
t= (average of Y- mean y) / (Sy/sqrt(n))
95% Confidence Interval
- the set of points that cannot be rejected at the 5% significance level;
- a set-valued function of the data (an interval that is a function of the
data) that contains the true parameter value 95% of the time in
repeated samples.
Homoskedastic
If variance of the conditional distribution of u given X doesn’t depend on X
E[u|x]=0
Heteroskedastic
If variance of the conditional distribution of u given X does depend on X
V [ui |Xi = x] changes with x
Homoskedastic- only SE are valid if errors are homoskedastic
-
Both homoskedasticity and heteroskedasticity formula differs, and will get different SE
Usual SE are heteroskedasticity- robust SE because they’re valid whether or not errors are heteroskedastic
Advantages of homoskedasticity
Equation is simpler
Disadvantage:
- formula is only correct if errors are homoskedastic
Homoskedasticity
- ## default setting in regression software
Population vs Sample Parameter
A parameter is a measure that describes the whole population. A statistic is a measure that describes the sample.
Slope in population regression line
Expected effect of on Y of a unit change in X
Regression Error
consists of omitted factors that affect measurement of Y, and error in measuring Y
Omitted Variable Bias
Variables that are omitted but causes bias in the OLS estimator
Causality
effect measured in ideal randomized controlled experiment
IDEAL
RANDOMIZED
CONTROLLED
EXPERIMENT
Ideal: subjects all follow the treatment protocol –perfect compliance,
no errors in reporting, etc.
Randomized: subjects from the population of interest are randomly
assigned to a treatment or control group (so there are no confounding
factors).
Controlled: permits measuring the differential
effect of the treatment.
Experiment: the treatment is assigned as part of the experiment: the
subjects have no choice, so there is no “reverse causality” in which
subjects choose the treatment they think will work best.
Three ways to overcome omitted variable bias
- Run a randomized control experiment in which STR is randomly assigned
- Adopt cross tabulation approach
- data tables that present the results of the entire group of respondents, as well as results from subgroups of survey respondents. - Use regression in which omitted variable is no longer omitted
Adjusted R^2
Penalized you for including another regressor
- this is because R^2 always increases when adding another regressor
———————-
adjusted R^2< R^2
values are close when n is larger
- There is no perfect multicollinearity
- Perfect collinearity is when one of the regressors is an exact linear function of other regressors
- accidentally include same variable twice
Imperfect multicollinearity occurs when two or more regressors are very
highly correlated.
- If two regressors are very highly
correlated, then their scatterplot will pretty much look like a straight
line –they are “colinear” –but unless the correlation is exactly 1 or
−1, that collinearity is imperfect.
Imperfect multicollinearity implies that one or more of the regression
coefficients will be imprecisely estimated
imperfect multicollinearity
- results in large standard errors
for one or more of the OLS coefficients
p value > alpha
pvalue<alpha
p value > alpha
ACCEPT NULL HYPOTHESIS
pvalue<alpha
REJECT NULL HYPOTHESIS
Joint Hypothesis
specifies a value for 2 or more coefficients; imposes a restriction on 2+ coefficients
How to test join hypothesis
F STATISTICS
F-statistics
- Large when t1 and t2 is large
when n is large
F=.5(t1+t2)
Chi Squared Distribution
q degrees of freedom
Control Variable W
Correlated with + controls for; an omitted causal factor in regression of Y on X, but doesn’t have causal effect on Y
Three interchangeable statements about effective control variable
- when included in the
regression, makes the error term uncorrelated with the variable of
interest.
2 Holding constant the control variable(s), the variable of interest is “as
if” randomly assigned.
3 the variable of interest is uncorrelated with the omitted
determinants of Y
When control variables are included, the LSA (1) E [ui |X1,i , …, XK ,i ] = 0
must not hold
Conditional mean independence
Given the control variable, the mean of ui doesn’t depend on variable of interest
E [ui |Xi , Wi ] = E [ui |Wi ]