Statistical Learning Flashcards

1
Q

Technical Definition: Statistical Learning

A

A set of approaches for estimating the relationship f between predictors (X) and an output (Y) using data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Synonyms for X (input)

A
  • Predictors
  • Independent variables
  • Features
  • Variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Synonyms for Y (output)

A

Y can be called the response + dependent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

General Form for Y

A

Y = f(X) + ε, where f is an unknown function and ε is a random error term with mean zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is systematic information in the context of Y = f(X) + ε?

A

The portion of Y explained by f(X), i.e., the non-random component driven by the predictors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why Estimate f?

A

Prediction and inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why Estimate f?

What does prediction focus on?

A

Obtaining an accurate Ŷ for new observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why Estimate f?

What does inference aim to achieve?

A

Understand how each predictor impacts Y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Definition: Reducible Error

A

Error introduced because our estimate of f, f_hat, is not perfect. It can potentially be reduced by improving the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Equation for Reducible Error

A

[f(X) - f_hat(X)]^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Definition: Irreducible Error

A

Error that cannot be eliminated, even with a perfect model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Equation for Irreducible Error

A

Var(ε)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why is the irreducible error larger than zero?

A

Unmeasured variables that are useful in predicting Y or inherent randomness in Y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some example questions one may be interested in answering in the case of inference?

3 questions

A
  • Which predictors are associated with the response?
  • What is the relationship between the response and each predictor?
  • Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Questions Related to Inference

Which predictors are associated with the response?

Please explain.

A

Identifying the few important predictors among a large set of possible variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Questions Related to Inference

What is the relationship between the response and each predictor?

What is the overall goal?

A

Evaluating each predictor’s effect on Y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Questions Related to Inference

What is the relationship between the response and each predictor?

What are some examples of the types of relationships?

A
  • Positive
  • Negative
  • More complex (e.g., it may depend on other variables via interactions)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Questions Related to Inference

Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?

Please explain.

2 points.

A
  • In some situations, assuming a linear relationship is reasonable or even desirable.
  • However, the true relationship can be non-linear or more complex, in which case a linear model may not provide an accurate representation of the relationship between the input and output variables.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Is this a prediction or inference problem?

Consider a company that is interested in conducting a direct-marketing campaign. The goal is to identify individuals who will respond positively to a mailing, based on observations of demographic variables measured on each individual. In this case, the demographic variables serve as predictors, and response to the marketing campaign (either positive or negative) serves as the outcome.

A

Prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Prediction problem

Consider a company that is interested in conducting a direct-marketing campaign. The goal is to identify individuals who will respond positively to a mailing, based on observations of demographic variables measured on each individual. In this case, the demographic variables serve as predictors, and response to the marketing campaign (either positive or negative) serves as the outcome.

Why is this a prediction problem?

A

Want an accurate model to predict the response using the predictors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Prediction problem

Consider a company that is interested in conducting a direct-marketing campaign. The goal is to identify individuals who will respond positively to a mailing, based on observations of demographic variables measured on each individual. In this case, the demographic variables serve as predictors, and response to the marketing campaign (either positive or negative) serves as the outcome.

Why is this not an inference problem?

A

Not interested in obtaining a deep understanding of the relationships between each individual predictor and the response.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Is this a prediction or inference problem?

A

Inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Inference problem

Why is this an inference problem?

3 points

A

The company may want to understand:
* Which media contributes to sales
* Which media generate the biggest boost in sales
* How much increase in sales is associated with a given increase in TV advertising

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Is this a prediction or inference problem?

Modeling the brand of a product that a customer might purchase based on variables such as price, store location, discount levels, competition price, and so forth. In this situation one might really be most interested in how each of the individual variables affects the probability of purchase.

A

Inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Inference problem

Modeling the brand of a product that a customer might purchase based on variables such as price, store location, discount levels, competition price, and so forth. In this situation one might really be most interested in how each of the individual variables affects the probability of purchase.

Why is this an inference problem?

A

Examining relationship between predictors and response.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Is this a prediction or inference problem?

In a real estate setting, one may seek to relate values of homes to inputs such as crime rate, zoning, distance from a river, air quality, schools, income level of community, size of houses, and so forth.

A

Both

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Prediction + inference problem

In a real estate setting, one may seek to relate values of homes to inputs such as crime rate, zoning, distance from a river, air quality, schools, income level of community, size of houses, and so forth.

Why is this an prediction problem specifically?

A

Predicting the value of a home given its characteristics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Prediction + inference problem

In a real estate setting, one may seek to relate values of homes to inputs such as crime rate, zoning, distance from a river, air quality, schools, income level of community, size of houses, and so forth.

Why is this an inference problem specifically?

A

Examining how individual input variables affect the prices.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Parametric Methods (Overview)

A

They assume a specific functional form for f, estimate a finite set of parameters (e.g. using linear regression).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Non-Parametric Methods (Overview)

A

They make fewer assumptions about f, can fit a wide range of shapes, but generally require more data to avoid overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are the advantages of parametric methods versus non-parametric methods?

3 points

A
  • Simplified problem: much easier to estimate a set of parameters than fit an entirely arbitrary function f.
  • Less prone to overfitting.
  • Less observations necessary given the smaller number of parameters.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What are the disadvantages of parametric methods versus non-parametric methods?

2 points

A
  • The model we choose will usually not match the true unknown form of f. If the chosen model is too far from the true f, then our estimate will be poor.
  • More rigid. Parametric methods follow a fixed functional form whereas non-parametric methods do not assume a form for f, allowing for more flexible models.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Non-Technical Definition: Overfitting

A

Modeling random noise in the training data so closely that performance on new data suffers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Interpretability vs. Flexibility Trade-Off

A

Highly flexible methods can fit data more accurately but are often less interpretable; simpler methods are more interpretable but may fit less accurately.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Why might we choose a more restrictive (less flexible) model?

2 points

A
  • Restrictive models (e.g., linear regression) tend to be more interpretable; valuable when inference about how predictors affect the response is important.
  • In cases where a more flexible model yields overfitting.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Why might we choose a more flexible model?

A

Flexible models often capture complex relationships more accurately, which can improve prediction performance but reduce interpretability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Technical Definition: Supervised Learning

A

A setting in which each observation has both predictors (X) and a known response (Y), allowing us to train models for prediction or inference.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Techical Definition: Unsupervised Learning

A

A setting in which observations have predictors (X) but no associated response (Y), so we can only find structure or groupings in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Definition: Clustering

A

An unsupervised technique that groups observations into clusters based on similarities among their measured variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is Semi-Supervised Learning?

A

A scenario where some observations have a response while others do not, mixing both supervised and unsupervised elements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Definition: Quantitative Variables

A

Can be measured on a numeric scale.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Definition: Qualitative Variables

A

Qualitative (categorical) variables fall into distinct classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Method Selection Based on Response Type

A
  • Regression for quantitative response
  • Classification for qualitative response
44
Q

What is the ‘No Free Lunch’ principle in statistics?

A

It states that no single learning method is guaranteed to dominate all others across every possible data set; method performance depends on the specific problem.

45
Q

What is the most commonly-used goodness-of-fit measure in regression problems?

A

Mean Squared Error (MSE)

46
Q

Mean Squared Error (MSE) Equation

A

MSE = (1/n) Σᵢ (yᵢ − f̂(xᵢ))²

47
Q

Difference Between Training MSE and Test MSE

A

Training MSE measures how well the model fits the data it was trained on; test MSE measures how well the model predicts new, unseen data.

48
Q

Why Does Minimizing Training MSE Not Guarantee Minimizing Test MSE?

A

Because a model might overfit the training data, capturing noise rather than the true underlying relationship, resulting in poor performance on unseen data.

49
Q

Technical Definition: Overfitting

A

A model fits the training data so closely that it fails to generalize to new data, typically showing low training error but high test error.

50
Q

Non-Technical Definition: Degrees of Freedom

A

A way to quantify model complexity or flexibility; more degrees of freedom usually means a more complex model that can fit data more closely.

51
Q

Which method (orange, blue, or green) is optimal?

52
Q

Explain why the blue method is optimal.

A

Minimizes the test MSE.

53
Q

Which method (orange, blue, or green) exhibits overfitting?

54
Q

Explain why the green method exhibits overfitting.

A

Exhibits a small training MSE but a large test MSE.

55
Q

Equation for Bias-Variance Decomposition of Expected Test MSE

A

E[(Y₀ − f̂(X₀))²] = Var(f̂(X₀)) + [Bias(f̂(X₀))]² + Var(ε)

56
Q

E[(Y₀ − f̂(X₀))²] = Var(f̂(X₀)) + [Bias(f̂(X₀))]² + Var(ε)

What does the Equation for Bias-Variance Decomposition of Expected Test MSE tell us?

2 points

A
  • We aim for low variance and low bias to minimize overall error.
  • The expected test MSE can never lie below Var(ε), the irreducible error.
57
Q

Definition: Bias (in model fitting)

A

Error introduced by approximating a potentially complex real‐world process with an overly simple model, causing systematic under‐ or over‐estimation.

58
Q

Definition: Variance (in model fitting)

A

Sensitivity of the model to the particular training set used. A high‐variance model changes drastically when the training data change.

59
Q

What happen to model variance and model bias as we use more flexible methods?

A

Variance will increase and bias will decrease.

60
Q

More flexible methods = variance increases and bias decreases

Given this is the case, what determines whether the test MSE increases or decreases?

A

The relative rate of change of variance and bias. If bias decreases faster than variance increases, test MSE declines. If bias declines slowely or is unchanged and variance increases significantly, test MSE increases.

61
Q

Classification

Error Rate Equation

A

1/n Σᵢ I(yᵢ ≠ ŷᵢ), , i.e. the fraction of observations for which the predicted class differs from the true class.

62
Q

Classification

Difference Between Training Error Rate and Test Error Rate

A

Training error rate measures how well the model classifies the data it was trained on; test error rate measures how well the model classifies new, unseen data.

63
Q

What is the Bayes Classifier?

A

A theoretical classifier that assigns each observation to the most likely class, given its predictor values. It achieves the lowest possible test error rate on average but is generally unachievable in practice because the true distribution of Y given X is unknown.

64
Q

Bayes Classifier Equation

A

Pr(Y = j | X = X₀)

65
Q

Pr(Y = j | X = X₀)

How would a Bayes Classifier work, based on the equation above?

A

Assign a test observation with predictor vector X₀ to the class j for which the probability is the largest.

66
Q

Definition: Bayes Decision Boundary

A

The boundary in feature space where Pr = 0.5 (in a two-class example). Points on one side are assigned to one class, points on the other side to the other class.

67
Q

Assuming a Bayes classifier was used, What does the purple dashed line represent here?

A

Bayes Decision Boundary

68
Q

Bayes Error Rate Equation

A

1 − E(max_j Pr(Y = j | X))

69
Q

1 − E(max_j Pr(Y = j | X))

What does the Bayes Error Rate tell us?

A

It is the lowest achievable test error rate. It is analogous to the irreducible error in regression settings.

70
Q

What is the K-Nearest Neighbors (KNN) Classifier?

A

Classifies a test observation by first finding the K closest training points and then assigning the majority class among those neighbors.

71
Q

K-Nearest Neighbors (KNN) Equation

A

Pr(Y = j | X = x0) = (1/K) * ∑ over i in N0 [ I(yi = j ) ]
* K = number of nearest neighbors
* N0 = indices of those K neighbors closest to x0
* I(condition) = 1 if condition is true, else 0

72
Q

What class will the black cross be assigned to in a KNN classifier (K = 3)?

73
Q

How Does a Small K Influence Bias and Variance in KNN?

A

A small K (e.g. K=1) produces a highly flexible model that can overfit (low bias, high variance).

74
Q

How Does a Large K Influence Bias and Variance in KNN?

A

A large K (e.g. K=100) yields a smoother, more biased but lower-variance model.

75
Q

With a KNN classifier, how does the training error rate vary with K?

A

As 1/K increases (K decreases), the training error rate decreases.

76
Q

With a KNN classifier, how does the test error rate vary with K?

A

As 1/K increases (K decreases), the test error rate first declines as flexibility increases before increasing as the method becomes excessively flexible and overfits.

77
Q

Why Doesn’t Low Training Error Always Mean Low Test Error in KNN?

A

Because with small K, the method can memorize training data noise (overfitting). This leads to low training error but potentially high test error.

78
Q

How Does KNN Compare with the Bayes Classifier?

A

KNN can approximate the Bayes rule if K is chosen well and we have enough data. However, the Bayes classifier remains a gold standard that typically requires full knowledge of the true distribution.

79
Q

Why Is Choosing the Correct Level of Model Flexibility Important?

A

Both in regression and classification, we want a balance between bias and variance. Too much flexibility can overfit; too little can underfit. Optimal flexibility minimizes test error, typically in a U-shaped pattern.

80
Q

What does ls() do?

A

Lists the names of objects currently stored in the R environment.

81
Q

What does rm(list=ls()) do?

A

Removes all objects in the environment, effectively clearing your workspace.

82
Q

What does matrix() do in R?

A

Creates a matrix from a given set of values, arranged in rows and columns.

83
Q

What does matrix(data=c(1,2,3,4), nrow=2, ncol=2) output in R?

84
Q

What does matrix(c(1,2,3,4), 2, 2, byrow=TRUE) output in R?

85
Q

What does rnorm() do in R?

A

Generates random values from a normal (Gaussian) distribution.

86
Q

How do you create a PDF with a plot in R?

A
pdf("PlotName.pdf")
plot(...)
dev.off()
87
Q

How do you create a JPEG with a plot in R?

A
jpeg("PlotName.jpg")
plot(...)
dev.off()
88
Q

What does contour() do in R?

A

Creates a contour plot to visualize 3D data; draws contour lines of a 3D surface on a 2D plot.

89
Q

What does image() do in R?

A

Creates a heatmap to visualize 3D data; displays a 2D image of a matrix of values using different colors to represent magnitude.

90
Q

What does persp() do in R?

A

Creates a 3D perspective plot of a surface for 3D data.

91
Q

What does negative indexing (e.g., x[-1]) mean in R?

A

Selects all elements except the ones at the specified indices. For instance, x[-1] means ‘all but the first element.’

92
Q

What does dim() do in R?

A

Returns the dimensions of an object (e.g., rows and columns for a matrix or data frame).

93
Q

What does na.omit() do?

A

Removes rows or observations with missing values (NA) from an object like a data frame.

94
Q

What does attach() do?

A

Makes the variables of a data set or environment accessible by name without using the $ operator or indexing.

95
Q

What does pairs() do in R?

A

Creates a matrix of scatterplots for every pair of variables in a data frame or matrix.

96
Q

What does identify() do in R plots?

A

Lets you click on plotted points to label or retrieve their coordinates (or row indices). Helpful for interactive identification.

97
Q

Indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

The sample size n is extremely large, and the number of predictors p is small.

A

Better - a more flexible approach will fit the data closer and with the large sample size a better fit than an inflexible approach would be obtained

98
Q

Indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

The number of predictors p is extremely large, and the number of observations n is small.

A

Worse - a flexible method would overfit the small number of observations

99
Q

Indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

The variance of the error terms, i.e. σ2 = Var(e), is extremely high.

A

Worse - flexible methods fit to the noise in the error terms and increase variance

100
Q

What are the disadvantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)?

A

The disadvantages of a parametric approach to regression or classification are a potential to inaccurately estimate f if the form of f assumed is wrong or to overfit the observations if more flexible models are used.

101
Q

How do you find the range of the first seven columns in a data.table called auto in R?

A

sapply(auto[, 1:7], range)

102
Q

Do any of the suburbs of Boston appear to have particularly high crime rates? Comment on the range of this predictor.

A

Most cities have low crime rates, but there is a long tail: 15-20 suburbs appear to have a crime rate > 20, reaching to above 80.

103
Q

Do any of the suburbs of Boston appear to have particularly high tax rates? Comment on the range of this predictor.

A

There is a large divide between suburbs with low tax rates and a peak at 660-680.

104
Q

Do any of the suburbs of Boston appear to have particularly high pupil-teach ratios? Comment on the range of this predictor.

A

A skew towards high ratios, but no particularly high ratios.

105
Q

Given a data set, boston, where the median value of owner-occupied homes column is called medv, how would you answer this question?

Which suburb of Boston has lowest median value of owner-occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

A
  1. Subset the data to find the observation where medv equals the minimum medv: boston[medv == min(boston$medv)]
  2. Compare this to the distribution of each variable, using summary: summary(boston)
  3. Look at where each variable value falls within the variable’s distribution.
106
Q

Given a data set, boston, where the average number of rooms per dwelling is called rm, how would you answer this question?

Comment on the suburbs that average more than eight rooms per dwelling.

A
  1. Compare the distribution in variables for the subset of data where rm > 8: summary(boston[rm > 8])
  2. Compare the distribution in variables for the overall data set: summary(boston)
  3. Compare the distribution/range for (1) to (2) for each variable.