Week 3 and Week 4 Flashcards
What is a short model?
We replace some variables with a new variable, ex. Beta:2X2 + U becomes V and replace beta for alpha
if the short model satisfies the exogeneity assumption E[V | X1] = 0 then
αˆ_1 ≈ α_1 = β_1
What 2 condition can confirm that cov(X1, V ) = 0
1 of these has to hold:
• the variable X2 does not affect the outcome, β2 = 0,
• the regressors X1 and X2 are uncorrelated, cov(X1, X2) = 0.
What is omitted variable bias /OV bias and how does the formula look?
The term (cov(X1,X2)/var(X1))β2 is called the omitted variable bias of αˆ1.
This bias indicates by how much the estimator αˆ1 deviates systematically from its estimand β1.
When is OV bias not zero?
The bias is different from zero if
• the variable X2 does affects the outcome, β2≠ 0 and
• the regressors X1 and X2 are correlated, cov(X1, X2)≠ 0.
Can we reduce OV bias by adding more regressors?
No.
When dealing with measurement error what is omitted variable bias called?
(cov(X1∗, W )/var(X1∗))β2 is called the attenuation bias.
What does it mean if αˆ1 is biased toward zero.
Therefore, in large samples, αˆ1 is a scaled-down version of the true effect β1, it has the same sign but is smaller in absolute value. In other words, αˆ1 estimates a value that is closer to zero than the true effect. We say that αˆ1 is biased toward zero.
if the variance of the measurement error is small relative to the variance of X1 the attenuation factor will be …
close to 1 and the attenuation bias will be small.
- Conversely, if the variance of the measurement error is large relative to the variance of X1 the attenuation factor will be close to 0 and the attenuation bias will be large.
-In particular, we may estimate the effect of X1 to be close to zero even if its true effect is substantially different from zero.
What does exogeneity mean?
That we can’t predict U from the regressors.
Transforming a long model to a short model might create an issue
Endogeneity, we can then predict U from ex. B2
Is the correlation (covariance) never zero in the short model.
It is never zero.
If we have a positive B1, Can a VERY negative OV-bias flip the sign of B^1?
Yes. A very negative can do that.
classical measurement error assumptions
- E[w]=0 measures correctly on average
- W is independent of X1 and U, no systematic MISmeasurement.
- var(w)>0 measurement error exist
Attenuation bias formula, can we switch B2 to -B1
Yes.
What does RCT stand for?
Randomized controlled trial
3 examples for when exogeneity isn’t fulfilled
- omitted variables
- measurement error exists
- equilibrium conditions
OLS function for B^_1
E^[Y|X_1=1]-E^[Y|X_1=0] / E^[X|X_1=1] - E^[Y|X_1=0]
IV regression for B^_1
E^[Y|group1]-E^[Y|group 2] / E^[X|group1]-E^[X|group 2]
What does endogenous sorting do
reveals ceteris paribus effect horizontally.
Instrumental variable in IV
Z, binary or dummy
Instrumental exogeneity
E[U|Z]=0
Biv for instrumental variable
E^[Y|Z=1]-E^[Y|Z=0] / E^[X|Z=1]-E^[X|Z=0]
What does Instrumental exogeneity and instrument relevance mean and imply?
Instrument exogeneity: E[U|Z]=0
Instrument relevance: E^[X|Z=1]≠E^[X|Z=0]
both assures that we are only moving horizontally in graph.
OLS characteristics
- for all X1
- B^1= Cov^ (Y,X) / var (X)
for x binary
B^_1 = E^[Y|X=1]-E^[Y|X=0] / E^[X|X=1]-E^[X|X=0]
- slope coefficient: B^_1 is the estimated change in y and X when Z increases by one unit
IV characteristics
- for all X Z
- B^1= Cov^ (Y,Z) / cov ^ (X, Z)
for binary instrument
B^_1 = E^[Y|Z=1]-E^[Y|Z=0] / E^[X|Z=1]-E^[X|Z=0]
- slope coefficient: delta / fi is the estimated change in y and X when X´Z increases by one unit
What differs between the first stage and second stage regression in 2SLS
The regression at the second stage deviates from the OLS regressions that we have considered so far in that one of the regressors is an estimated quantity.
instrument relevance assumption
cov(X1,X2)≠0
Is this true: A linear model that may be suitable for causal inference may not be a good choice for prediction and vice versa.
True.
instrument exogeneity assumption
x1 can’t predict U
In the context of prediction what do we refer a long linear model as?
Complex model
passing from a linear model with only a few variables to a complex model with many control variables tends to …
inflate the variance of estimated (causal) marginal effects.
By constructing an IV estimator of the short linear model we INCREASE OR DECREASE the variance of the estimated slope coefficients compared to direct OLS estimation of the short linear model.
Increase
A high variance means that we will have large confidence intervals and therefore the power will
be lacking on the tests on hypothesis about the true causal effect.
Does using a less complex model guarantee a well-behaved error term?
No, using a less complex model does not guarantee a well-behaved error term,
- it will decrease variance but allow the possibility of systematic bias of unknown (and in general unbounded) size.
What is the bias-variance trade-off of prediction?
we want to avoid specifying a very complex model that is difficult to estimate, i.e., for which we estimate parameter values (think slope coefficients) with very large variances.
Difficult or easy to find a model that is valid for causal inference?
Difficult
In the context of prediction what is this called:
- a random sample (Yi, X1,i, . . . , Xk,i) Ni=1 from our population then we can
estimate these coefficients by the OLS estimators βˆ0, . . . , βˆk.
training sample –> estimating model parameters from this is called model training.
Are we interested in the predictions of the training sample
No
training sample version of the mean-squared error is called a
training error
What does a low/high training error mean in R2
- low training error corresponds to a high R2 value (close to one),
- large training error corresponds to a low R2 value (close to zero).
Since the training error is not a good measure of predictive power, neither is the R2.
How close our prediction is to the realized outcome Y oos(ω1) depends on three factors:
- How close the estimates βˆ0(ω0),…,βˆk(ω0) are to the population coefficients b∗∗,…,b∗. This factor exists because of randomness of the training sample (uncertainty about ω0).
- The realized values of the predictors x1,…,xk. The linear prediction rule will typically work better for some realizations than for others. What values we see depends on the randomness of the out-of-sample draw (uncertainty about ω1).
- How the part of Y oss that is not predictable from the predictors realizes. This is determined by how the out-of-sample draw realizes (uncertainty about ω1).
what does EPE take account to?
both uncertainty about the realization of the training sample and uncertainty about the realization of the out-of- sample draw.
ex ante measure of the cumulative errors
EPE has three errors, which are these?
- irreducible error (U)
- bias (approximation error)
- variance (estimation error)
To choose a prediction model with a low EPE we have to optimally trade off two effects. Which ones?
Bias and variance, bias-variance trade-off
why is training error not a good estimate of EPE?
training error estimates only bias but ignores the variance.
-making a model more complex by adding additional predictors will never increase (and in practice almost always strictly decrease) the training error. However, as we add more and more predictors the variance component of the EPE is expected to dominate eventually and, unlike the training error, the EPE will increase.
what is sample splitting?
To make sure that there is both a training and a test sample the common approach is to randomly split the available data (of size m + n) into a training and test sample (of size n and m, respectively)
What is overfitting
OLS usually overfits, fit too close to the true functional form.
Usually a problem if the researchers keeps adding new predictors in order to decrease the training error (equivalently increase the R2) even further.
Ridge regression
Ridge regression improves on our previous approach by shrinking different slope coefficients by different factors.
What does OLS only care about?
It would only care about reducing the bias component of the EPE and would tend to overfit.
Lasso regression
Lasso tends to produce models that are of low complexity
If predictors are correlated then Ridge regression will not …
apply the same amount of shrinkage to all coefficients. This distinguishes Ridge regression from the na ̈ıve shrinkage method discussed above.
discrete time series
Often, a time series can only be observed at pre-defined discrete points in time.
Forecast
Predictions about the future
nowcasting
Predicting the current period yt (or recent past periods such as yt−1) from the data that is available to the econometrician in period t
Can we do statistical inferenceon a sample of size one?
Observing the time series in only a single state means that we have a sample of size one. SO NO
Stationarity
requires that these two segments have an identical UNCONDITIONAL distribution.
-Under stationarity, Y1 and Ys2 have the same distributions and in particular
E[Y1] =E[Ys2 ]
var(Y1) = var(Ys2 ).
Weak dependence
Weak dependence restricts the information about the time series that becomes available dynamically as time passes and more and more periods of the time series are observed.
Serial- / autocorrelation
Serial correlation means that observations of the time series at different points in time are correlated. One important example of serial correlation is auto-correlation. This refers to correlation between two subsequent time periods.
Weak time dependence
Time Yt isn’t affected by Yt-1
sample independently and identically
- We already understand that weak dependence ensures that every period reveals new information. (independent)
- in a time series we observe many k segments and under stationarity (drum roll) they are all the same (=they have same distribution). (identical)
forecasting model left and right side
(outcome on left-hand side)
(predictors on right-hand side).
cross-sectional data
Cross-sectional data can be represented in a spread sheet format where each row represents an observed unit and each column describes a unit characteristic.
wide vs long model
contains more columns and is therefore wider than the table in Figure 1 Therefore, the representation of the data in Figure 2 is called the “wide” and Figure 1 is called the long model.
fixed effect
Suppose that At does not change over time. In that case, we can write At = A. Such an A describes the total effect of unobserved unit characteristics that do not change over time.
fixed effect transformation
fr =(fr1982 + fr1988)/2, (average year)
tax =(tax1982 + tax1988)/2, (average tax)
U ̄ =(U1982 + U1988)/2. (Average U)
fr = β1tax + A + U
subtract averages. from the original model ?
gives us frt = β1taxt + Ut t = 1982, 1988. where the fixed effect is removed and B1 is preserved
clusters
Computing standard errors under the assumption that certain blocks of observations exhibit correlation is called “computing standard errors with clustering”. The blocks of correlated observations are called clusters. For panel data, it is often sensible to assume that all the observations of one unit form a cluster.
Errors depend on:
Bias: good estimation with more Xs
Variance: good estimation with less Xs
Training error
-Does not measure EPE
-Estimated idiosyncratic + bias
Test error
-measures EPE
cross section sampling
Random sampling
identical & independent
imposed by sampling design
time series sampling
stationary and weakly time dependence
property of economic environment
difficult to verify empirically
often fail
first difference transformation
delta all variables