Measuring Quality of Fit Flashcards

Question 1

Q

What is the most commonly used measure of fit in regression?

Answer

A

Mean squared error (MSE), square the difference between predicted and actual response values and divide by n.

Question 2

Q

What does “overfitting the data” refer to?

Answer

A

The statistical learning procedure is working too hard to find patterns in the training data, and may be picking up some patterns that are just caused by random chance, rather than by true properties of the unknown function f.

Overfitting refers to the case in which a less flexible model would have yielded a smaller test MSE.

Question 3

Q

What is the relationship between model flexibility, test MSE (mean-squared error), and the training MSE? This is a fundamental property of statistical learning models.

Answer

A

As model flexibility increases, the training MSE will decrease. However, the relationship between model flexibility and the test MSE will be U-shaped. As flexibility increases the test MSE will initially decrease, however, at a certain point the test MSE will begin to rise. When the test MSE begins to rise we are said to be overfitting the data.

Question 4

Q

What is the “bias-variance trade off” in statistical learning methods?

Answer

A

Variance: the amount by which the predicted model would change if we estimated it with new data

Bias: the error that is introduced by approximating a real-life problem by a much simpler model

As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease. It is fairly easy to obtain a method with high variance and low bias (very flexible model) or a method with low variance and high bias (rigid model like linear regression). The challenge is to find a model with low variance and low bias.

Question 5

Q

What is the most common approach for quantifying the accuracy of classification models?

Answer

A

The training error rate, also known as the proportion of mistakes that are made if we compare our model’s predictions of the outcome to the actual outcomes

Question 6

Q

What is the “K-Nearest Neighbors” approach?

Answer

A

Knn is a non-parametric classification method used for both classification and regression.

For classification, an object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors.

For regression, the output is the property value for the object. The value is the average of the values of k nearest neighbors.

So basically in order to predict an outcome from a new observation, you select the outcome that is most likely given the relationship between outcome & observation for k data points with the data you already have.

Despite its simplicity, this method tends to be fairly accurate

The selection of K matters a lot

Measuring Quality of Fit Flashcards

(6 cards)