Session 2: Bivariate Regression: Review of Ordinary Least Squares, Multiple Regression Flashcards
OLS estimator
The OLS estimator minimizes the average squared difference between the actual values of Yi and the prediction (predicted value) based on the estimated line.
measures the spread of the sampling distribution of B-hat1
OLS estimator chooses the regression coefficients so that the estimated regression line is as close as possible to the observed data, where closeness is measured by the sum of the squared mistakes made in predicting Y given X.
Ordinary Least Squares
the slope and intercept of the line relating X and Y can be estimated by a method called Ordinary Least Squares
Yi = Βo + B1Xi + ui
Yi = (indepedent/dependent) variable
Xi = (indepedent/dependent) variable
Bo + B1x =
Bo = (intercept/slope) of pop regress line
B1 = (intercept/slope) of pop regress line
ui =
Yi = dependent variable, regressand, left hand variable
Xi = independent variable, regressor, right hand variable
Bo + B1x = pop regression line or (PRF) population regression function, this is the relationship that holds between Y and X over a population
Bo = intercept of pop regress line
B1 = slope of pop regress line
ui = error term
error term
incorporates all of the factors responsible for the difference between the ith district’s average test score and the value predicted by the pop regress line. it contains all other factors besides x that determine the value of the dependent variable y for a specific observation i.
OLS Regression Line
aka sample regression line, sample regression function
is the straight line constructed using the OLS estimates: B-hat0 and B-hat1*X.
The residual for the ith observation is the difference between Yi and its predicted value: Yi - Y-hati
Test score-hat = 689.9 - 2.28*STR (student teacher ratio)
What does STR coefficient mean?
The slope of -2.28 means that an increase in the student-teacher ratio by one student per class is, on average, associated with a decline in districtwide test scores by 2.28 point on the test.
Negative slope indicates that more students per teacher (largest classes) is associated w/ poorer test performance.
R^2 and Standard Error measure:
R^2 and Standard Error: measure how well OLS regression line fits the data.
R^2 ranges between ___ and ___ and measures:
SE of the regression measures:
R^2 ranges between 0 and 1 and measures: the fraction of the variance of Yi that is explained by Xi.
SE of the regression measures: how far Yi typically is from its predicted value.
regression R^2
is the fraction of the sample variance of Yi explained (or predicted) by Xi.
R^2
= ESS / TSS
= Explained Sum of Squares / Total Sum of Squares
= sum of squared deviations of the predicted values of Yi, Y-hati from their avg / sum of squared deviations of Yi from its average
OR can also be: the fraction of the variance of Yi not explained by Xi
R^2 = 1 - (SSR/TSS)
ESS
TSS
SSR
SER
ESS Explained Sum of Squares
TSS Total Sum of Squares
SSR Sum of Squared Residuals
SER Standard Error of the Regression
Standard Error of the Regression
???
R^2 of 0.051 means that….
the regressor student-teacher ratio explains 5.1% of the variance of the dependent variable testscore
SER of 18.6 means that
SE of Regression. Means that there is a large spread of the scatterplot around the regression line as measured in points on the test. this means that the predictions of test scores using only STR variable will often be wrong by a large amount.
t =
t = estimator - hypothesized value / SE of the estimator
test of the Ho against 2-side altnernative steps
- compute the SE of Y-bar
- compute the t-statistic t = (Y-bar - meany,0 / SE(Y-bar)
- compute the p-value. which is the smallest significance level at which the Ho could be rejected, based on tobserved. ALSO, probability of obtaining a statistic, by random sampling variation, at least as different from the Ho value as is the statistic actually observed, assuming Ho is correct
p value
the smallest significance level at which the Ho could be rejected, based on tobserved.
ALSO, probability of obtaining a statistic, by random sampling variation, at least as different from the Ho value as is the statistic actually observed, assuming Ho is correct
at 5% significance level, reject the Ho if…
at 5% significance level, reject the Ho if… |tactual| GREATER than 1.96.
If we can reject the Ho, then we can say:
Population mean is said to be statistically significantly different from the hypothesized value at the 5% significance level.
testing Ho about the slope B1
Ho: B1 = B1,0 vs H1: B1 DNE B1,0
- compute the SE of B-hat1
- compute the t-statistic
- compute the p value. reject the hypothesis at the 5% sign. level is the p-value is less than .05 or, equivalently if |tact| is greater than 1.96
the only difference between a one-side and two-sided hypothesis test is
the only difference between a one-side and two-sided hypothesis test is… how you interpret the t statistic
95% CI for B1 definitions
(1) It is the set of values that cannot be rejected using a two-sided hypothesis test with a 5% significance level.
(2) It is an interval that has a 95% probability of containing the true value of B1 (in 95% of possible samples that may be drawn, the CI will contain the true value of B1).
Observational studies: advantages and disadvantages
Key advantage: Reflects the real world utilization rather than artificial designs
Key disadvantage: You must address validity concerns:
In this case, you must address the question of why are some children in small classes and others in large classes?
Interpretation of the estimated slope and intercept
= 698.9 – 2.28*STR
How should we interpret the estimated intercept 698.9?
Districts with one more student per teacher on average have test scores that are 2.28 points lower.
???
The error term consists of
omitted factors, or possibly measurement error in the measurement of Y or prediction limitations.
t statistic formula
t = estimator - hypothesis value / SE
t statistic hypothesis testing
Reject at 5% significance level if |t| > 1.96
typically n = 30 is large enough for the approximation to be excellent.
The F-test evaluates the null hypothesis that _____________ are equal to zero versus the alternative that __________.
Rejection of this hypothesis indicates that _____ of the regression slopes is ___________.
The F-test evaluates the null hypothesis that all regression slope coefficients are equal to zero versus the alternative that at least one does not.
Rejection of this hypothesis indicates that at least one of the regression slopes is non-zero
R2, is the ratio of _______ to the _____ of the dependent variable y.
R2 ranges from ________.
Easiest to think of this as the _______ explained by the model”
R2, is the ratio of “explained” variance to the “total” variance of the dependent variable y
R2 ranges from 0 to 1.
Easiest to think of this as the “% of variance explained by the model”
p-value is the probability of obtaining __________, given that the _________ is true.
p-value is the probability of obtaining a value that is as adverse or more given that the null hypothesis is true.
t test. 2sided vs. 1 sides values
in a 2-way test, if |t|>1.96 then reject while in a one-way test if t
AtypeI error(alpha) is:
AtypeI error(alpha) is the incorrect rejection of a true null hypothesis (false positive).
AtypeII error(beta) is:
AtypeII error(beta) is when you fails to reject a false null hypothesis (false negative).
Counterfactual
We would like to have a counterfactual, to be able to measure what would have occurred if the action had not taken place (i.e. no treatment or no program).
We do not actually observe the counterfactual and so we must design studies and use statistical techniques to mimic the outcomes of the counterfactual