Lecture 18 Flashcards
(22 cards)
Forecast is not a prediction
Forecasting is out of sample - you’re using data up to time T to predict future values like Yt+1, etc
- Y^t+h|T, the forecast of Y at time T + h, using data up to time T
- Y^t+h - y^t+h|T = forecast error
- forecasts can be one-step ahead, or multi-steps ahead
Forecasting is about using past data to -predict the future
MSFE - Mean Squared Forecast Error
Measures the average squared difference between the actual future value and your forecast
- MSFE = E[(Yt+1 - Y^(t+1|T))^2]
- squaring errors penalises big mistakes more than small ones
Decomposition shows two sources of forecast error:
1. Oracle error - due to unpredictable shocks ut+1
2. Estimation error, as we estimated coefficients, and they deviate from true ones
RMSFE - Root MSFE
- just the square root of MSFE - interpreted like a typical forecast error, but hard to get as we dont know future values, YT+1, etc
- if we assume stationarity and no estimation error, we can approximate it with:
RMSFEser = ROOT((SSR)/(T - n - 1)) - if data is stationary, forecast errors have mean zero, and RMSFE can be estimated using the regression’s residual variance, ignores estimation error, but if sample size large relative to predictors, that’s often okay
FPE - Final Prediction Error
Adjust RMSFE to include estimation error, applies only if data is stationary and homoskedstic
- RMSFE^fpe = ROOT((SSR/T).((T + n - 1)/(T - n - 1)))
- T is the number of observations, n is number of predictors
- previous SER version understates forecast error by ignoring estimation uncertainty, this tries to fix that but still relies on strong assumptions, like errors being homoskedasticity, model being stationary
POOS - Pseudo Out-of-Sample explanation
- doesn’t require strong assumptions and captures both estimation and forecast error
- most honest/ realistic forecast evaluation
Avoids unrealistic assumptions of SER and FPE, mimics real forecasting conditions, so at each date, you only use information that would have been available then, and by re-estimating the model each time, it naturally incorporates estimation error
How POOS works
- Split the sample - use the first 90% of data for model estimation, final 10% for forecasting
- Re-estimate your model each time - for each date s, fit the model using data up to s
- Forecast one step ahead of - Predict Ys+1 using the model fit through s, get Y^(s+1|s)
- Compute forecast error - Ys+1 - Y^(s+1|s)
- Compute the POOS RMSFE : ROOT((1/P).SUM(forecast error squared))
Forecast intervals
- key points
If the forecast error is normally distributed, we can build a 95% forecast interval around our prediction:
- y^t+1|t +- 1.96RMSFE^
- This is NOT a CI, here YT+1 is a future random variable, so we’re capturing outcome uncertainty not parameter uncertainty
- Strictly valid only if uT+1 is normal, but in practise approximation often works reasonably well too
Forecast intervals for transformations
- e.g. TRI(ln(lPt)) as the dependent variable
Often in time series, we model transformations rather than raw levels, but what if we want the forecast in levels rather than the growth rates
- Forecast the change in logs, to get percentage change
- Convert this forecast back to levels
- Use RMSFE to build a forecast interval for the transformation regression, then convert bounds back into levels mode using step 2
To convert:
IPt^ + 1 = IPt.(1+change(IPt^)+1), but if % changes are not normal, then correct for the variance
Forecasting oil prices using time series methods
- Model selection so AR, ADL, etc, use tools like BIC or AIC to pick lag length and variables
- Checking for breaks - chow if you know roughly where, QLR to detect them
- Point forecast - forecast log change, to %, then back to actual level
- Forecast levels - use RMSFE, assuming small changes and normality
- Choosing the right RMSFE - SER, FPE, POOS
Prediction in a big data or high-dimensional setting
- in traditional regressions, we usually have fewer predictors than observations k < N
- in big data we may have many predictors, some it’s even more predictors than data points which can make OLS unreliable due to overfit.
Whats an estimation sample and what’s a holdout sample
- formalised predictive regression setup - MLR
Estimation Sample: data used to estimate/fit your model
Holdout Sample: used for out of sample evaluation, crucial for forecasting, as in-sample fit doesn’t tell us about predictive power
assume that holdout sample comes from same distribution as estimation sample, otherwise OOS performance isn’t meaningful
- can get MSPE
MSPE - Mean Squared Prediction Error
MSPEols = (1 + (k/N).o^2) under homoskedasticity
- more predictors you use, k, worse your out of sample prediction can be, unless you have lots of data, N
How to estimate the MSPE using cross validation
- m-fold cross validation
Simulation of an out-of-sample testing environment
1. Split data into m chunks
2. For each chunk, estimate the model on (1-1/m).N observations, then predict the remaining N/m observations
3. Rotate through all m folds so each observation is used once for testing
4. Compute prediction errors for all test predictions, average them to get your MSPE
Why not use SER for model fit?
- SER is in sample, only tells you how well the model fits the data it was trained on
- MSPE is out sample, tells you how well the model predicts new data
How is m fold cross validation like POOS?
- you pretend to be in real time, re estimating and predicting forward
- cross validation does the same, but instead of a time sequence, splits data randomly or sequentially depending on context.
Ridge regression
A regularisation technique used when you have many predictors, maybe more than observations, but at least enough that OLS starts to break down
- Ridge: OLS + k.SUM(bj)^2
- second term is the penalty, proportional to the sum of squared coefficients.
- if k=0, SSR, if K is large, coefficients shrink to 0, k is chosen as the value which minimises MSPE via cross-validation
- discourages large coefficients
- low k behaves like OLS, high, there is strong shrinkage and coefficients pulled to 0
How to choose k for ridge
- Pick a range of values of k to try
- Use K-fold cross validation, split into K folds, use K-1 folds to estimate model, use the held-out fold to predict and compute MSPE for each
- Average the MSPEs across the 5 folds
- Choose the k with the lowest average MSPE across folds, optimal shrinkage level
Lasso regression
First term is the same, but second term is k.SUM(|bj|)
- forces small coefficients to 0, creating sparsity
- use when you have lots of predictors and you think many may be irrelevant
- with ridge, all coefficients are shrunk smoothly towards 0, but never exact
- with lasso, due to penalty, optimisation geometry means some coefficients are forced to be exactly 0
Principal components regression
Instead of selecting variables, PCR transforms them:
- takes linear combinations of the original variables - called PCs
- combinations chosen to maximise variance, capture as much info from X as possible
- first PC captures the largest share of variance in X, etc
- only keeps the top p components, which explain most of the variance
- run OLS regression of y on these p PCs instead
- reduces dimensionality, avoids multicollinearity and overfitting, while keeping most of the predictive power
SO: MAX var(SUM(aji.Xi)), s.t. Uncorrelated components and normalised
Difference between MSFE and MSPE
MSFE is in the context of time series forecasting, is measuring how well a time series model predicts the future
MSPE is more general predictive modelling, measuring how well a regression model predicts NEW observations
When to use ridge vs lasso
Ridge is best when predictors are many and highly correlated, but we think most are useful in some way
Lasso is best when we think many predictors are irrelevant and the true model is sparse
Why PCR?
Firstly, OLS breaks down when you have lots of predictions, so these strategies are to tame too may predictors so forecasts are stable and accurate
- problem with ridge and lasso is not just that there are too many predictors, but is that these predictors are highly correlated
- PCR solves this by replacing the predictors with a smaller set of linear combinations of them