EXAM 2 Flashcards
How do we draw a line through data to estimate the population slope?
Ordinary Least Squares
What is the OLS Estimator
The OLS estimator minimizes the average squared difference b/w the actueal values of Y, and the prediction based on the estimated line
Regression R^2
1. R^2=0 means ESS=0
2. R^2=1 means ESS=TSS
3. 0<=R^2<=1
- for regression w/ a single X, R^2 is equal to the square of the correlation coefficient b/w X&Y
Measures the fraction of the variance of Y that is explained by X
1. is unitless
2. between 0 (no fit) and 1 (perfect fit)
Standard Error of the Regression
Measures the magnitude of a typical regression residual in the units of Y
- measures the spread of the distribution of u
- units of u, which are the units of Y and measures the average size of the OLS residual
TSS=ESS+SSR(sum of squared residuals)
R^2=TSS/ESS (total sum of squares/explained sum of squares)
R^2=1-SSR/TSS
Root mean squared error (RMSE ) is closely related to the SER
The Least Squares Assumptions
1. The conditional distribution of ui, given Xi, has the mean = zero
2. (Xi,Yi), i=1,2,3,…n are iid
3. Large outliers in X and/or Y are rare
- The conditional distribution of ui, given Xi, has the mean = zero
Failure of this leads to omitted variable bias
- means if there is a omitted variable that correlations with the equation, the condition fails and there is OV bias
Is equivalent to assuming that the population regression line is the conditional mean of Yi given Xi
- Because X is assigned randomly, all other individual characteristics–the things that make up u– are distributed independently of X , so u and X are independent.
- Thus, in an ideal randomized controlled experiment, E [ui |Xi ] = 0
In actual experiments, or with observational data, we will need to think hard about whether E [ui |Xi] = 0 holds.
- (Xi,Yi), i=1,2,3,…n are iid
Assumption automatically applies if sampled by random sampling because chosen from same location
- they’re selected at random so the values are independently distributed
We can expect to encounter non-i.i.d. data when information is
recorded over time for the same entity (panel data and time series
data)
- Large Outliers are rare
assuming that Xi and Yi have nonzero
finite fourth moments, i.e., 0 < E [X 4
i ] < ∞ and 0 < E [Y 4
i ] < ∞, or
in other words, that the distributions of Xi and Yi have finite kurtosis - outliers are often data glitches (coding or recording
problems). Sometimes they are observations that really shouldn’t be
in your data set
R^2
This measures the variance of Y that is explained by X
SER- standard error of the regression
Measures the magnitude of a typical regression residual in the units of Y.
t= (estimator- hypothesized value)/
(SE of estimator)
t= (average of Y- mean y) / (Sy/sqrt(n))
95% Confidence Interval
- the set of points that cannot be rejected at the 5% significance level;
- a set-valued function of the data (an interval that is a function of the
data) that contains the true parameter value 95% of the time in
repeated samples.
Homoskedastic
If variance of the conditional distribution of u given X doesn’t depend on X
E[u|x]=0