Week 4 Flashcards
Predictors
Observed economic variables that we can base predictions on. This is what we call the regressors in a prediction context.
Training sample
The data we use to estimate our predictive model. Contains both outcome and predictors.
Out-of-sample observation
OSS, the test data. A new draw from the population that is independent of the training sample.
Mean squared prediction error (MSPE):
The expected squared deviation between the test outcome and the predictive model that is estimated on the training data. This measure takes into account the uncertainty due to estimation error as well as uncertainty about the out-of-sample observation. In other words, it measure predictive ability.
Difference between MSPE and MSE:
While the MSE measures the estimators fit , i.e. how well we fit the line in a bunch of true value observations, the MSPE measures a predictor’s fit, i.e. how well we make predictions with this fitted line.
How can MSPE be decomposed?
Three parts:
- Irreducable error
- Approximation error (bias)
- Estimation error (variance)
The bias/variance trade-off of prediciton
The predictive model doesn’t affect the irreducable error. Complex models have low approx error (bias) and high est error (variance). Simple models have the reverse properties. The optimal model is somewhere in between.
Training error
The MSE of the predictive model evaluated on the training sample. Measures the sum of irreducable and bias but ignores variance. (OLS)
Test error
The MSE of the predictive model evaluated on the test sample. The test error estimates the MSPE.
Overfitting
Choosing a model that has non-optimal (too high) MSPE because it focuses too much on reducing the bias. OLS minimizes the training error and therefore tends to overfit.
The principle of shrinkage
Changing a predictive model to react less strongly to variation in the predictors can increase the MSPE (increases bias but lowers variance). This is a good method since the OLS may have overfitted the regression model so that it doesn’t work well in a new dataset –> by shrinking towards zero, the MSE will typically decrease.
Two methods doing this is the Ridge and Lasso.
Ridge regression
A version of OLS with shrunken slope coefficients (smaller in absolute value than OLS). Ridge considers the cost of complexity through a penalty term that is parameterized by a regularization parameter lambda. If lambda=0 then the ridge regression is identical to OLS and if lambda=infinity then ridge is identical to sample mean (no variance).
Lasso regression
Similar to ridge but uses a diff penalty term. Whereas ridge never shrinks b’s to zero, Lasso typically shrinks some (or many) to zero. WE say that some are selected (still in the model) and some are not selected (shrunken to 0).
Random sampling: two properties:
- Independently - Yt and Yt+r is independent if r is large and we end up somewhere unpredictable. New info is added when travelling in time.
- Identical population - stationarity, ex ante we should not be able to predict because of rules/patterns. All samples from the same population have unconditional distributions. At t=0 we predict the same.
Observations that are generated through random sampling are identically independently distributed (iid). The order in which rows are arranged doesn’t matter but the rows (n) should be large.
Cross-sectional data
Rows in the data set correspond to units and the columns describe unit characteristics. Rows are randomly sampled.