Framework for Prediction Flashcards

Question 1

Q

What is the difference between original data and live data?

Answer

A

Original data is the data we have to build the model
Live data is data we do not have yet

Question 2

Q

Fill in the blanks

If y is quantitaive, then there is a \_\_\_\_ prediciton and \_\_\_\_ problem.
If y is binary, then there is a \_\_\_\_\_ predicitn and \_\_\_\_\_ problem

Answer

A

quantitative prediction, regression problem
probability prediciton, classification problem

Question 3

Q

What do we care less about in a predictive analysis? What do we still care about?

Answer

A

Individual coefficient values (multicollinearity)
Still care about stability of the results

Question 4

Q

What is the ideal prediciton error?

Answer

A

None, we want to be as close as possible.

Question 5

Q

When evaluting prediction errors, how do we define direction and size?

Answer

A

Direction: Did we over- or underpredict values
Size: The absolute value of how far off the prediction was.

Question 6

Q

Define prediciton error

Answer

A

The difference between the predicted value of the target variable and its actual value for the target observation.

PE = yhat - y

Question 7

Q

What are the three components of a prediciton error?

Answer

A

Estimation error: difference between the estimated value and the true value
Model error: the difference between the true value and the best predictor model (might not have the best model)
Genuine error/idiosyncratic/irreducible: due to not being able to perfectly estimate all predicted values even if estimation error is zero

Question 8

Q

Give an example of estimation, model, and an irreducible error.

Answer

A

Estimation: You predict a drive from your house to the store will take 5 minutes. Due to traffic it takes 11 minutes. The estimatione error would be 6 minutes
Model: In a model, you used age instead of age^2 for predicting.
Irreducible: Even though you have collected all the data you can, there still are differences between y and yhat.

Question 9

Q

What are loss functions?

Answer

A

They evaluate the consequences of a prediction error. Specifically, how bad it is. Can be symmetric, asymmetric, linear, and convex

Question 10

Q

What does it mean for a loss function to be symmetric? Asymmetric? Convex?

Answer

A

Symmetric: Positive and negative errors produce the same value and magnitude of loss
Asymmetric: Positive and negative errors produce different values but the same magnitude of loss
Convex: disproportionate losses to larger errors

Question 11

Q

Define bias and variance for MSE.

Answer

A

Bias: is the average of its prediction error. A biased prediction produces nonzero erorrs on average.

Variance: how the prediciton varies around its average value when multiple predicitons are made.

Question 12

Q

What is overfitting?

Answer

A

When a model has a better fit for the original data, but a worse fit for live data.

It is a key aspect of external validity and can make worse actual predicitons. Usually caused by too many variables.

Ex. Adding in extra variables that only slightly improve the model.

Question 13

Q

What is underfitting?

Answer

A

When the model is a worse fit for both the original and live data.

Ex. Only using one variable in a regression instead of five

Question 14

Q

What are some ways of finding the best fit?

Answer

A

Using the AIC/BIC
Using training models and test samples (cross-validation)

Question 15

Q

What is a downside to using BIC?

Answer

A

It penalizes more complex models even when they would do better at prediciton in the live data.

Question 16

Q

T/F: The training-test split aviods overfitting the training set, but it may overfit the test set.

Question 17

Q

Define k-fold cross-validation.

Answer

A

It’s a repeating training test split (k times). With each split called a fold. And the predition of a model is evaluated across the folds.

Question 18

Q

T/F: Machine learning is an umbrella concept for methods that use algorithms to find patterns in data and use them for prediction purposes.

Question 19

Q

Fill in the blank

To find the best fit using 5-fold cross validation you average —–

Answer

A

The goodness of fit on the test set across the 5 folds

Question 20

Q

How can adding more x variables and interactions lead to over- and underfitting?

Answer

A

This can overfit the model by adding interactions that are catered to the original data, but are not present in the live data. Or these varaibles may be insignificant in the regression (aka adding white noise) that leads the model to not accurately represent the original or live data.

Question 21

Q

Consider forcasting if it will rain tomorrow. Is is likely that people will have the same loss function? Why or why no?

Answer

A

No, they will likely not have the same loss function. For example, someone who lives close to the beach will not loss as much than someone who lives in a landlocked state and is travelling for vacation. The person who lives closeby may lose one good beach day out of the year, but the travelling person losses more as this might have been the only day they could visit the beach.