Framework for Prediction Flashcards
What is the difference between original data and live data?
Original data is the data we have to build the model
Live data is data we do not have yet
Fill in the blanks
- If y is quantitaive, then there is a
\_\_\_\_
prediciton and\_\_\_\_
problem. - If y is binary, then there is a
\_\_\_\_\_
predicitn and\_\_\_\_\_
problem
- quantitative prediction, regression problem
- probability prediciton, classification problem
What do we care less about in a predictive analysis? What do we still care about?
- Individual coefficient values (multicollinearity)
- Still care about stability of the results
What is the ideal prediciton error?
None, we want to be as close as possible.
When evaluting prediction errors, how do we define direction and size?
Direction: Did we over- or underpredict values
Size: The absolute value of how far off the prediction was.
Define prediciton error
The difference between the predicted value of the target variable and its actual value for the target observation.
PE = yhat - y
What are the three components of a prediciton error?
- Estimation error: difference between the estimated value and the true value
- Model error: the difference between the true value and the best predictor model (might not have the best model)
- Genuine error/idiosyncratic/irreducible: due to not being able to perfectly estimate all predicted values even if estimation error is zero
Give an example of estimation, model, and an irreducible error.
Estimation: You predict a drive from your house to the store will take 5 minutes. Due to traffic it takes 11 minutes. The estimatione error would be 6 minutes
Model: In a model, you used age instead of age^2 for predicting.
Irreducible: Even though you have collected all the data you can, there still are differences between y and yhat.
What are loss functions?
They evaluate the consequences of a prediction error. Specifically, how bad it is. Can be symmetric, asymmetric, linear, and convex
What does it mean for a loss function to be symmetric? Asymmetric? Convex?
Symmetric: Positive and negative errors produce the same value and magnitude of loss
Asymmetric: Positive and negative errors produce different values but the same magnitude of loss
Convex: disproportionate losses to larger errors
Define bias and variance for MSE.
Bias: is the average of its prediction error. A biased prediction produces nonzero erorrs on average.
Variance: how the prediciton varies around its average value when multiple predicitons are made.
What is overfitting?
When a model has a better fit for the original data, but a worse fit for live data.
It is a key aspect of external validity and can make worse actual predicitons. Usually caused by too many variables.
Ex. Adding in extra variables that only slightly improve the model.
What is underfitting?
When the model is a worse fit for both the original and live data.
Ex. Only using one variable in a regression instead of five
What are some ways of finding the best fit?
- Using the AIC/BIC
- Using training models and test samples (cross-validation)
What is a downside to using BIC?
It penalizes more complex models even when they would do better at prediciton in the live data.