Chapter 2 - Statistical Learning Flashcards

1
Q

3 important facts about e from Y = f(x) + e

A

e is the random error term, it is assumed to be independent of X, it has mean 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

define prediction

A

prediction is when sets of input variables are known, but the response cannot be easily obtained. since u(e) = 0, the formula becomes Yhat = fhat(X). fhat is a blackbox in prediction, because we don’t care about the form, as long as it predicts Y accurately. The difference between Yhat and Y comes from reducible (fhat is an imperfect representation of f) and irreducible error (Y contains e).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

define inference

A

inference is when we try how the response is affected by changes in X. in this scenario, we care about estimating f, but not necessarily with predicting Y. f is not a black box, we need to know its exact form. Inference is: which predictors are associated with response? what is relationship between response and each predictor? is f linear?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

explain the tradeoff between solving for prediction and solving for inference

A

inference = simpler models that are easier to interpret, predictions not as accurate. prediction = complex models that are harder to interpret, but make better predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what are the classes of statistical learning methods.

A

parametric and non-parametric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

facts about parametric statistical learning methods

A

parametric is where you make an assumption about f before you begin, and then you select a procedure (e.g., OLS) to fit the model to the training data. Parametric is easier because its easier to fit model coefficients once you have a model assumption (rather than finding a completely new f). Unfortunately, the assumption we make about the model form is usually wrong. we can make the model more flexible, but that could lead to over-fitting. parametric methods reduce the problem of estimating f to estimating a few coefficients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

why do we want to estimate f in Y = f(x) + e?

A

prediction or inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

facts about nonparametric statistical learning methods

A

no assumption made about the form of f, allows a wide range of possible shapes for f. drawback of nonparametric methods is that they require a large number of observations to get an accurate approx of f.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

unsupervised learning

A

data matrix, goal is to find meaningful relationships between variables (correlation), find low variable representations of the data (PCA), find meaningful groupings (clustering)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

supervised learning

A

input variables + output variables. if output variable is quantitative this is a regression problem, if not its a classification problem. our goal is to learn f (the true function of the data) using the training set. Y = f(X) + epsilon

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

prediction error

A

goal is supervised learning is to minimize prediction error. for regression problems this is usually MSE = E(Y - fhat(X))^2. We have to compute the training MSE because we don’t know the true MSE.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Bias Variance Decomposition

A

MSE = E(Y - fhat(X))^2 = Var(fhat(X)) + [Bias(fhat(X))]^2 + Var(epsilon). Var(epsilon) is the irreducible error. Other parameters are the variance of the estimate of Y, the squared bias of the estimate. Both variance and squared bias are always positive. High variance implies more flexibility implies less bias (goal is to minimize both sources of error simultaneously). You only know bias/variance if you know the true curve. Variance is the measure of how much the estimate of fhat at x0 changes when we sample new training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Classification problems

A

The output takes values in a discrete set. Y is not necessarily real values so we use a different notation. and we use the training error rate (i.e., mis-classification rate)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Bayes Classifier

A

The test error rate is minimized by a simple classifier that that assigns each observation to the most likely class, given its predictor value. yhat i = argmax j P(Y = j | X = x_i). The error rate of the bays classifier (i.e., the Bayes Error rate) is 1 - E(max_j P(Y = j | X)). the decision you would make if you had an infinite amount of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

KNN (comment on decision boundary shape wrt K)

A

imagine our blue and yellow classification problem (purple dashed line is bayes boundary, known distribution of (X,Y)). To assign a color to point x0, you look at its k-nearest neighbors and decide. the decision you come up with can be number within a certain radius or distance weighted. KNN has a decision boundary - the higher the K the smoother the decision boundary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly