Chapter 2 - Statistical Learning Flashcards
3 important facts about e from Y = f(x) + e
e is the random error term, it is assumed to be independent of X, it has mean 0.
define prediction
prediction is when sets of input variables are known, but the response cannot be easily obtained. since u(e) = 0, the formula becomes Yhat = fhat(X). fhat is a blackbox in prediction, because we don’t care about the form, as long as it predicts Y accurately. The difference between Yhat and Y comes from reducible (fhat is an imperfect representation of f) and irreducible error (Y contains e).
define inference
inference is when we try how the response is affected by changes in X. in this scenario, we care about estimating f, but not necessarily with predicting Y. f is not a black box, we need to know its exact form. Inference is: which predictors are associated with response? what is relationship between response and each predictor? is f linear?
explain the tradeoff between solving for prediction and solving for inference
inference = simpler models that are easier to interpret, predictions not as accurate. prediction = complex models that are harder to interpret, but make better predictions
what are the classes of statistical learning methods.
parametric and non-parametric.
facts about parametric statistical learning methods
parametric is where you make an assumption about f before you begin, and then you select a procedure (e.g., OLS) to fit the model to the training data. Parametric is easier because its easier to fit model coefficients once you have a model assumption (rather than finding a completely new f). Unfortunately, the assumption we make about the model form is usually wrong. we can make the model more flexible, but that could lead to over-fitting. parametric methods reduce the problem of estimating f to estimating a few coefficients
why do we want to estimate f in Y = f(x) + e?
prediction or inference
facts about nonparametric statistical learning methods
no assumption made about the form of f, allows a wide range of possible shapes for f. drawback of nonparametric methods is that they require a large number of observations to get an accurate approx of f.
unsupervised learning
data matrix, goal is to find meaningful relationships between variables (correlation), find low variable representations of the data (PCA), find meaningful groupings (clustering)
supervised learning
input variables + output variables. if output variable is quantitative this is a regression problem, if not its a classification problem. our goal is to learn f (the true function of the data) using the training set. Y = f(X) + epsilon
prediction error
goal is supervised learning is to minimize prediction error. for regression problems this is usually MSE = E(Y - fhat(X))^2. We have to compute the training MSE because we don’t know the true MSE.
Bias Variance Decomposition
MSE = E(Y - fhat(X))^2 = Var(fhat(X)) + [Bias(fhat(X))]^2 + Var(epsilon). Var(epsilon) is the irreducible error. Other parameters are the variance of the estimate of Y, the squared bias of the estimate. Both variance and squared bias are always positive. High variance implies more flexibility implies less bias (goal is to minimize both sources of error simultaneously). You only know bias/variance if you know the true curve. Variance is the measure of how much the estimate of fhat at x0 changes when we sample new training data
Classification problems
The output takes values in a discrete set. Y is not necessarily real values so we use a different notation. and we use the training error rate (i.e., mis-classification rate)
Bayes Classifier
The test error rate is minimized by a simple classifier that that assigns each observation to the most likely class, given its predictor value. yhat i = argmax j P(Y = j | X = x_i). The error rate of the bays classifier (i.e., the Bayes Error rate) is 1 - E(max_j P(Y = j | X)). the decision you would make if you had an infinite amount of data
KNN (comment on decision boundary shape wrt K)
imagine our blue and yellow classification problem (purple dashed line is bayes boundary, known distribution of (X,Y)). To assign a color to point x0, you look at its k-nearest neighbors and decide. the decision you come up with can be number within a certain radius or distance weighted. KNN has a decision boundary - the higher the K the smoother the decision boundary.