Prediction with the linear model Flashcards

Question 1

Q

The aim of prediction

Answer

A

The aim of prediction is to estimate a numeric variable.

Question 2

Q

how to call the variable we’re estimating

Answer

A

The variable we’re estimating is the target, outcome, or response variable

Question 3

Q

how to call the variables we use to do the estimation

Answer

A

The variables we use to do the estimation are the predictors, covariates or features.

Question 4

Q

training data

Answer

A

We typically have a training data set which contains both the target and predictors so that we can build a
model for prediction

Question 5

Q

testing data

Answer

A

We then have a testing data set which contains only the predictors. The goal is to estimate the resulting
target on this data set.

For testing purposes we often create an artificial test set or validation data set so that we can check how
well our model is performing by comparing the predicted target against the actual target

Question 6

Q

Methods of prediction

Answer

A

Linear regression models
Regression trees
Random Forests
(Simple!) Neural networks

Question 7

Q

binary indicator variables

Answer

A

To handle factors, we introduce binary indicator variables (taking the value zero or one), and use these to
code each level of a factor

Question 8

Q

Prediction intervals

Answer

A

Prediction intervals give the range of values associated with specified level of confidence.
E.g. 95% sure that predicted value is in range
(a, b)

Question 9

Q

difference between prediction and confidence intervals

Answer

A

A prediction interval is less certain than a confidence interval. A prediction interval predicts an individual number, whereas a confidence interval predicts the mean value. A prediction interval focuses on future events, whereas a confidence interval focuses on past or current events.

Question 10

Q

Bias

Answer

A

It’s a systematic error because it’s consistent and not random. So, even if you shoot many times, your shots will always miss in the same way.

Question 11

Q

Variance

Answer

A

Variance is a measure of the variability of prediction. Variance is how much your shots vary from each other. Let’s say sometimes you shoot and the ball goes too far to the left, sometimes it goes too far to the right, and sometimes it’s just right. That’s variance. It measures how much your shots differ from each other.

Question 12

Q

Decomposition of the MSE

Answer

A

Mean squared error can be decomposed into squared bias and variance components

Question 13

Q

Mean squared error

Answer

A

The overall accuracy of a prediction can be measured by its mean squared error (MSE).
Mean squared error can be decomposed into variance and squared bias components.
Accepting some bias will be advantageous if it results in a more substantial decrease in variance.
In practice we will want to use a prediction model that gets the right balance between prediction variance
and bias so as to minimize the prediction MSE.

Question 14

Q

where the Mean Squared Error (MSE) on the test data is much higher compared to the training MSE,

Answer

A

is a common phenomenon in machine learning. It’s typically attributed to the model’s inability to generalize well to unseen data.

Overfitting: During training, the model tries to capture all the patterns and nuances present in the training data, including noise. If the model becomes too complex or is trained for too long, it may start to memorize the training data instead of learning general patterns. This phenomenon is called overfitting. As a result, the model performs well on the training data (low training MSE) but poorly on new, unseen data (high test MSE).

Generalization: When the model encounters new data during testing, it may encounter patterns or variations that it hasn’t seen before. If the model hasn’t learned to generalize well from the training data, it may struggle to make accurate predictions on this new data. This leads to higher MSE on the test data compared to the training data.

Data Distribution: Sometimes, the distribution of the test data may be different from the distribution of the training data. If the model is trained on one type of data but tested on another type, its performance may suffer. This emphasizes the importance of having representative training and test datasets.

Question 15

Q

The Bias-Variance trade-off

Answer

A

The bias-variance tradeoff implies that as we increase the complexity of a model, its variance decreases, and its bias increases. Conversely, as we decrease the model’s complexity, its variance increases, but its bias decreases.

Question 16

Q

validation dataset.

Answer

A

In practice we will not have target values for test cases.
This means that we cannot use test data to obtain reliable MSE for prediction.
One approach is to split training data.
One part is used to ‘build model’ (i.e. estimate model parameters);
Other part is used as a kind of independent test dataset to compute reliable MSE values. Often called
the validation dataset

Question 17

Q

Akaike information criterion

Answer

A

The Akaike information criterion, usually abbreviated to AIC, is a measure of model quality.
AIC requires only a training dataset.
The lower the AIC value, the better the model.

wage.lm.step <- step(wage.lm)

Prediction with the linear model Flashcards

week 4