Session 2: Bivariate Regression: Review of Ordinary Least Squares, Multiple Regression Flashcards

1
Q

OLS estimator

A

The OLS estimator minimizes the average squared difference between the actual values of Yi and the prediction (predicted value) based on the estimated line.

measures the spread of the sampling distribution of B-hat1

OLS estimator chooses the regression coefficients so that the estimated regression line is as close as possible to the observed data, where closeness is measured by the sum of the squared mistakes made in predicting Y given X.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Ordinary Least Squares

A

the slope and intercept of the line relating X and Y can be estimated by a method called Ordinary Least Squares

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Yi = Βo + B1Xi + ui

Yi = (indepedent/dependent) variable
Xi = (indepedent/dependent) variable
Bo + B1x =
Bo = (intercept/slope) of pop regress line
B1 = (intercept/slope) of pop regress line
ui =

A

Yi = dependent variable, regressand, left hand variable
Xi = independent variable, regressor, right hand variable
Bo + B1x = pop regression line or (PRF) population regression function, this is the relationship that holds between Y and X over a population
Bo = intercept of pop regress line
B1 = slope of pop regress line
ui = error term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

error term

A

incorporates all of the factors responsible for the difference between the ith district’s average test score and the value predicted by the pop regress line. it contains all other factors besides x that determine the value of the dependent variable y for a specific observation i.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

OLS Regression Line

A

aka sample regression line, sample regression function

is the straight line constructed using the OLS estimates: B-hat0 and B-hat1*X.

The residual for the ith observation is the difference between Yi and its predicted value: Yi - Y-hati

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Test score-hat = 689.9 - 2.28*STR (student teacher ratio)

What does STR coefficient mean?

A

The slope of -2.28 means that an increase in the student-teacher ratio by one student per class is, on average, associated with a decline in districtwide test scores by 2.28 point on the test.

Negative slope indicates that more students per teacher (largest classes) is associated w/ poorer test performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

R^2 and Standard Error measure:

A

R^2 and Standard Error: measure how well OLS regression line fits the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

R^2 ranges between ___ and ___ and measures:

SE of the regression measures:

A

R^2 ranges between 0 and 1 and measures: the fraction of the variance of Yi that is explained by Xi.

SE of the regression measures: how far Yi typically is from its predicted value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

regression R^2

A

is the fraction of the sample variance of Yi explained (or predicted) by Xi.
R^2
= ESS / TSS
= Explained Sum of Squares / Total Sum of Squares
= sum of squared deviations of the predicted values of Yi, Y-hati from their avg / sum of squared deviations of Yi from its average

OR can also be: the fraction of the variance of Yi not explained by Xi
R^2 = 1 - (SSR/TSS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

ESS
TSS
SSR
SER

A

ESS Explained Sum of Squares
TSS Total Sum of Squares
SSR Sum of Squared Residuals
SER Standard Error of the Regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Standard Error of the Regression

A

???

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

R^2 of 0.051 means that….

A

the regressor student-teacher ratio explains 5.1% of the variance of the dependent variable testscore

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

SER of 18.6 means that

A

SE of Regression. Means that there is a large spread of the scatterplot around the regression line as measured in points on the test. this means that the predictions of test scores using only STR variable will often be wrong by a large amount.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

t =

A

t = estimator - hypothesized value / SE of the estimator

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

test of the Ho against 2-side altnernative steps

A
  1. compute the SE of Y-bar
  2. compute the t-statistic t = (Y-bar - meany,0 / SE(Y-bar)
  3. compute the p-value. which is the smallest significance level at which the Ho could be rejected, based on tobserved. ALSO, probability of obtaining a statistic, by random sampling variation, at least as different from the Ho value as is the statistic actually observed, assuming Ho is correct
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

p value

A

the smallest significance level at which the Ho could be rejected, based on tobserved.

ALSO, probability of obtaining a statistic, by random sampling variation, at least as different from the Ho value as is the statistic actually observed, assuming Ho is correct

17
Q

at 5% significance level, reject the Ho if…

A

at 5% significance level, reject the Ho if… |tactual| GREATER than 1.96.

If we can reject the Ho, then we can say:
Population mean is said to be statistically significantly different from the hypothesized value at the 5% significance level.

18
Q

testing Ho about the slope B1

A

Ho: B1 = B1,0 vs H1: B1 DNE B1,0

  1. compute the SE of B-hat1
  2. compute the t-statistic
  3. compute the p value. reject the hypothesis at the 5% sign. level is the p-value is less than .05 or, equivalently if |tact| is greater than 1.96
19
Q

the only difference between a one-side and two-sided hypothesis test is

A

the only difference between a one-side and two-sided hypothesis test is… how you interpret the t statistic

20
Q

95% CI for B1 definitions

A

(1) It is the set of values that cannot be rejected using a two-sided hypothesis test with a 5% significance level.
(2) It is an interval that has a 95% probability of containing the true value of B1 (in 95% of possible samples that may be drawn, the CI will contain the true value of B1).

21
Q

Observational studies: advantages and disadvantages

A

Key advantage: Reflects the real world utilization rather than artificial designs
Key disadvantage: You must address validity concerns:
In this case, you must address the question of why are some children in small classes and others in large classes?

22
Q

Interpretation of the estimated slope and intercept
= 698.9 – 2.28*STR

How should we interpret the estimated intercept 698.9?

A

Districts with one more student per teacher on average have test scores that are 2.28 points lower.

???

23
Q

The error term consists of

A

omitted factors, or possibly measurement error in the measurement of Y or prediction limitations.

24
Q

t statistic formula

A

t = estimator - hypothesis value / SE

25
Q

t statistic hypothesis testing

A

Reject at 5% significance level if |t| > 1.96

typically n = 30 is large enough for the approximation to be excellent.

26
Q

The F-test evaluates the null hypothesis that _____________ are equal to zero versus the alternative that __________.

Rejection of this hypothesis indicates that _____ of the regression slopes is ___________.

A

The F-test evaluates the null hypothesis that all regression slope coefficients are equal to zero versus the alternative that at least one does not.

Rejection of this hypothesis indicates that at least one of the regression slopes is non-zero

27
Q

R2, is the ratio of _______ to the _____ of the dependent variable y.
R2 ranges from ________.
Easiest to think of this as the _______ explained by the model”

A

R2, is the ratio of “explained” variance to the “total” variance of the dependent variable y
R2 ranges from 0 to 1.
Easiest to think of this as the “% of variance explained by the model”

28
Q

p-value is the probability of obtaining __________, given that the _________ is true.

A

p-value is the probability of obtaining a value that is as adverse or more given that the null hypothesis is true.

29
Q

t test. 2sided vs. 1 sides values

A

in a 2-way test, if |t|>1.96 then reject while in a one-way test if t

30
Q

AtypeI error(alpha) is:

A

AtypeI error(alpha) is the incorrect rejection of a true null hypothesis (false positive).

31
Q

AtypeII error(beta) is:

A

AtypeII error(beta) is when you fails to reject a false null hypothesis (false negative).

32
Q

Counterfactual

A

We would like to have a counterfactual, to be able to measure what would have occurred if the action had not taken place (i.e. no treatment or no program).
We do not actually observe the counterfactual and so we must design studies and use statistical techniques to mimic the outcomes of the counterfactual