07 Linear Regression Flashcards
σ
Population standard deviation
s
Sample standard deviation:
An estimator of the population standard deviation
s_y
s_y is the estimate of the population standard deviation for the random variable Y of the population from which the sample was drawn
SE()
Standard error of an estimator:
An estimator of the standard deviation of the estimator
SE( ̄Y) = ˆσ_ ̄Y = s_y / √n
μ
Population mean
u
all other factors than X that affects Y
synonyms for “dependent variable” and “independent variable”
dependent variable vs. independent variable
explained variable vs. explanatory variable
predicted variable vs. control variable
response variable vs. control variable
regressand vs. regressor
A normally distributed variable (X) can be made standard normal by:
Z = (X - μ) / ( σ / root(n))
The sample average is normally distributed whenever:
- Xi is normally distributed
- n is large (CLT)
T variable
T = (X - μ) / (s_x / root(n))
SLRM
Simple Linear Regression Model
The sum of squared prediction mistakes over all n observations
sum[(Y - E(β0) - E(β1)X)^2]
E(β0)
E(β0) = avg(Y) - E(β1) * avg(X)
Given by derivation++ of sum[(Y - E(β0) - E(β1)X)^2]
E(β1)
E(β1) = sum[(X - avg(X)) (Y - avg(Y))] /
sum[(X - avg(X)^2]
Given by derivation++ of sum[(Y - E(β0) - E(β1)X)^2]
E(β1) = r_{XY} * s_Y / s_X
If uˆi is positive, the line ____ Yi
If uˆi is positive, the line underpredicts Yi
By the definition of uˆ and the first OLS first order condition the sum of the prediction error is …
By the definition of uˆ and the first OLS first order condition the sum of the prediction error is zero
Sum(û_i) = 0
The sample covariance between the independent variable and the OLS residuals is …
The sample covariance between the independent variable and the OLS residuals is zero.
The point … is always on the regression line (OLS)
The point (X ̄ , Y ̄) is always on the regression line (OLS)
different goals of regression
Among others:
- Describe data set
- Predictions and forecasts
- Estimate causal effect
Causality
Causality is the effect measured in an ideal randomized controlled experiment
The OLS estimator is unbiased, consistent and has asymptotically normal sampling distribution if:
The OLS estimator is unbiased, consistent and has asymptotically normal sampling distribution if:
- Random sampling
- Large outliers are unlikely
- The conditional mean of u_i given X_i is 0:
E (u|X ) = 0
E(abil | educ = 8) = E(abil | educ = 16).
The OLS estimator is ___, ____ and has _____ if:
- Random sampling
- Large outliers are unlikely
- The conditional mean of u_i given X_i is 0
The OLS estimator is unbiased, consistent and has asymptotically normal sampling distribution if:
- Random sampling
- Large outliers are unlikely
- The conditional mean of u_i given X_i is 0
(OLS) When dealing with outliers one may want ______
When dealing with outliers one may want to report the OLS regression both with and without the outliers
OLS is the most e cient (the one with the lowest variance) among all linear unbiased estimators whenever:
OLS is the most ecient (the one with the lowest variance) among all linear unbiased estimators whenever:
- The 3 OLS assumptions hold
- The error is homoskedastic
TSS
Total sum of squares:
Sum[(Yi - avg(Y))^2]
TSS = ESS + SSR
ESS
Explained sum of squares:
Sum[(^Yi - avg(Y))^2]
SSR
Sum of squared residuals:
Sum[û_i^2]
R^2
The regression R^2 is the fraction of the sample variance of Yi explained by Xi:
R^2 = ESS / TSS = 1 - SSR / TSS
R2 = 0 - none of the variation in Yi is explained by Xi
R2 = 1 - all the variation is explained by Xi, all the data points lie on the OLS line.
A high R2 means that the regressor is good at predicting Yi (not necessarily the same as a ”good” regression)
SER
The standard error of the regression (SER) is an estimator for the standard deviation of the regression error u_i.
SER = SSR / (n - 2)
It measures the spread of the observations around the regression line.
If the independent variable is multiplied by som nonzero constant c, then the OLS slope coeffi cient is _____
If the independent variable is multiplied by som nonzero constant c, then the OLS slope coeffi cient is divided by c.
Homoskedasticity
The error u has the same variance given any value of the explanatory variable, in other words: Var(u|x) = 2
Homoskedasticity is not required for unbiased estimates, but it is an underlying assumption in the standard variance
calculation of the parameters. To make the variance expression easy the assumption that the errors are homoskedastic are added.
The larger the variance of X, the ____ the variance of E(β1)
The larger the variance of X, the smaller the variance of E(β1)
Var(E(β1))
Holy shit. (Appendix 4.3)
s_xy
Sample covariance
1 / (n - 1) * sum{(Xi - avg(X))(Yi - avg(Y))}
s^2_X
Sample variance of X
1 / (n - 1) * sum{(Xi - avg(X))(Xi - avg(X))}
sample correlation coefficient
r_{XY} = s_{XY} / (s_X * s_Y)
consistency
A variable is consistent if the spread around the true parameter approaches zero as n increases
Normality assumption
The population error u is independent of explanatory variables and is Normal(0, σ^2)
- Whenever y takes on just a few values it cannot have anything close to a normal distribution.
- The exact normality of OLS depends on the normality of the error.
- If the βˆ is not normally distributed the t-statistic does not have t-distribution.
- The normal distribution of u is the same as the distribution of Y given X.
- In large samples we can invoke the CLT to conclude that the OLS satisfy asymptotic normality.
E(β1) ∼
Normal[β1, Var(E(β1))]
Thus (E(β1) − β1) / std(E(β1)) ∼ Normal(0, 1)
This comes from:
• A random variable which is a linear function of a normally distributed variable is itself normally distributed.
• If we assume that u ∼ N(0, σ2) then Yi is normally distributed.
• Since the estimators βˆ and βˆ is linear functions of the Yi’s then the estimators are normally distributed.
In general the t-statistics has the form:
t = (estimator - hypothesised value) / standard error of the estimator
A coefficient can be statistically significant either because ____
A coefficient can be statistically significant either because the coefficient is large, or because the standard error is small.
(OLS) Which standard errors are prefered?
Heteroskedasticity robust standard errors
In econometric applications the errors are rarely homoskedastic and normally distributed, but as long as n is large and we compute heteroskedasticity robust standard errors we can compute t-statistics and hence p-values and confidence intervals as normal.
(OLS) Most often the violated assumption is ___
(OLS) Most often the violated assumption is the zero conditional mean assumption, X is often correlated with the error term.
Sum[ (Xi - avg(X)) (Yi - avg(Y)) ] = Sum[ … ]
Sum[ Xi (Yi - avg(Y)) ]