Chapter 2 - Statistical learning Flashcards

Question 1

Q

Explain typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves as we go from less flexible statistical learning methods towards more flexible approaches.

Exercise 2.4.3

Answer

A

Flexibility can be introduced by having more parameters where only using the mean is the simplest model.

Bayes error is an irreducible error. It is not possible to make a model better than that.

Training error will start relatively high when the model is not flexible, then it will be reduced once the flexibility increases. If the training error goes down below the Bayes error, the model is overfit.

Bias follows the training error, the better fit, the less error we have and the more flexible the model is, the more bias because we fit the points better.

Variance is the opposite to bias. Small changes in the input will have high impact in the response.

The test error will likely be higher than the training error at the start with a low flexibility model. Once the flexibility increases, the error will go down but eventually it will increase again. This leads to the trade-off between variance and bias. The test error increases when we start to overfit the model.

Question 2

Q

Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a non-parametric approach)? What are its disadvantages?

Exercise 2.4.6

Answer

A

Parametric approaches assume a form or relationship in the data, for example that it is linear. The problem then becomes reduced into estimating the parameters required for that assumed relationship.

Non-parametric approaches have no such assumption and there is no functional form. Using a non-parametric approach requires a very large dataset if we want to make accurate predictions.

Advantage of parametric approaches: Computations is much simpler, it is more interpretable, and requires less data. A parametric approach model will have less parameters which makes it more computationally efficient, and we do not need as many observations or a large dataset.

Disadvantage of parametric approaches: we might assume the wrong underlying function and if we choose to use a flexible model, there is a high risk of overfitting.

Advantages/Disadvantages of non-parametric approaches: we do not need any assumption about the underlying function or the relationship between the parameters but to find this, we require a large dataset. This type of model will work well for complex relations and many real-life applications since most things are not linear. Likely, the predictions will be better if we use a non-parametric model, but it is often a black box, and we cannot explain or interpret how it works. The risk of overfitting is high.

Question 3

Q

K-nearest neighbors - If the Bayes decision boundary in this problem is highly non linear, then would we expect the best value for K to be large or small? Why?

Exercise 2.4.7

Answer

A

The best value of K should be small because if we use a large value, then we will have to average over all the values which lead to a linear relation. Small values of K result in a KNN model that is more flexible and non-linear. Large values of K means that we use more datapoints in the KNN model and the decision boundary becomes closer to linear. → Highly non-linear boundaries should have a small value of K. More flexibility.

Question 4

Q

Is nearest neighbor averaging good for small or large values of p?

Question 5

Q

Curse of dimensionality

Answer

A

In larger dimensions, near neighbors are far away in high dimensions. We need to average over a large neighborhood.

Question 6

Q

The output is qualitative - Regression or classification?

Answer

A

Classification

Question 7

Q

The output is quantitative - Regression or classification?

Answer

A

Regression

Question 8

Q

Give examples of classification problems

Answer

A

Drinkable water
Pass or fail exam
Animals on pictures
Disease

Question 9

Q

Give examples of regression problems

Answer

A

Money spent yearly on medical care
House prices
Predict profit and sales

Question 10

Q

Give examples when we can use cluster analysis

Answer

A

Product recommendations
Anomaly detection
Image separation

Question 11

Q

Advantages of very flexible models

Answer

A

Can capture complex relations
Incorporate more variables and learn more about how they relate to each other and the desired output
Bias has decreased and we can fit the data better
Less assumptions need to be made beforehand

Question 12

Q

Disadvantage of very flexible models

Answer

A

Higher risk of overfitting
Can become computationally difficult
Need more data to train the model
The variance is increased.
Make assumptions beforehand on the underlying structure or on what variables to include

Question 13

Q

When to use less flexible models

Answer

A

We do not have a lot of data and/or variables.
Prioritizing illustrate and interpret,
Identifying inferences is desired

Question 14

Q

When to use more flexible models

Answer

A

When a simpler model does not perform well enough
When we think the underlying relation is complex
High quality and detailed predictions are a priority
When the data set is large andwe have computational power
Interpretability is not crucial

Chapter 2 - Statistical learning Flashcards

(14 cards)