Statistical Learning Flashcards
Technical Definition: Statistical Learning
A set of approaches for estimating the relationship f between predictors (X) and an output (Y) using data.
Synonyms for X (input)
- Predictors
- Independent variables
- Features
- Variables
Synonyms for Y (output)
Y can be called the response + dependent variable
General Form for Y
Y = f(X) + ε, where f is an unknown function and ε is a random error term with mean zero.
What is systematic information in the context of Y = f(X) + ε?
The portion of Y explained by f(X), i.e., the non-random component driven by the predictors.
Why Estimate f?
Prediction and inference
Why Estimate f?
What does prediction focus on?
Obtaining an accurate Ŷ for new observations.
Why Estimate f?
What does inference aim to achieve?
Understand how each predictor impacts Y.
Definition: Reducible Error
Error introduced because our estimate of f, f_hat, is not perfect. It can potentially be reduced by improving the model.
Equation for Reducible Error
[f(X) - f_hat(X)]^2
Definition: Irreducible Error
Error that cannot be eliminated, even with a perfect model.
Equation for Irreducible Error
Var(ε)
Why is the irreducible error larger than zero?
Unmeasured variables that are useful in predicting Y or inherent randomness in Y.
What are some example questions one may be interested in answering in the case of inference?
3 questions
- Which predictors are associated with the response?
- What is the relationship between the response and each predictor?
- Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?
Questions Related to Inference
Which predictors are associated with the response?
Please explain.
Identifying the few important predictors among a large set of possible variables.
Questions Related to Inference
What is the relationship between the response and each predictor?
What is the overall goal?
Evaluating each predictor’s effect on Y.
Questions Related to Inference
What is the relationship between the response and each predictor?
What are some examples of the types of relationships?
- Positive
- Negative
- More complex (e.g., it may depend on other variables via interactions)
Questions Related to Inference
Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?
Please explain.
2 points.
- In some situations, assuming a linear relationship is reasonable or even desirable.
- However, the true relationship can be non-linear or more complex, in which case a linear model may not provide an accurate representation of the relationship between the input and output variables.
Is this a prediction or inference problem?
Consider a company that is interested in conducting a direct-marketing campaign. The goal is to identify individuals who will respond positively to a mailing, based on observations of demographic variables measured on each individual. In this case, the demographic variables serve as predictors, and response to the marketing campaign (either positive or negative) serves as the outcome.
Prediction
Prediction problem
Consider a company that is interested in conducting a direct-marketing campaign. The goal is to identify individuals who will respond positively to a mailing, based on observations of demographic variables measured on each individual. In this case, the demographic variables serve as predictors, and response to the marketing campaign (either positive or negative) serves as the outcome.
Why is this a prediction problem?
Want an accurate model to predict the response using the predictors.
Prediction problem
Consider a company that is interested in conducting a direct-marketing campaign. The goal is to identify individuals who will respond positively to a mailing, based on observations of demographic variables measured on each individual. In this case, the demographic variables serve as predictors, and response to the marketing campaign (either positive or negative) serves as the outcome.
Why is this not an inference problem?
Not interested in obtaining a deep understanding of the relationships between each individual predictor and the response.
Is this a prediction or inference problem?
Inference
Inference problem
Why is this an inference problem?
3 points
The company may want to understand:
* Which media contributes to sales
* Which media generate the biggest boost in sales
* How much increase in sales is associated with a given increase in TV advertising
Is this a prediction or inference problem?
Modeling the brand of a product that a customer might purchase based on variables such as price, store location, discount levels, competition price, and so forth. In this situation one might really be most interested in how each of the individual variables affects the probability of purchase.
Inference
Inference problem
Modeling the brand of a product that a customer might purchase based on variables such as price, store location, discount levels, competition price, and so forth. In this situation one might really be most interested in how each of the individual variables affects the probability of purchase.
Why is this an inference problem?
Examining relationship between predictors and response.
Is this a prediction or inference problem?
In a real estate setting, one may seek to relate values of homes to inputs such as crime rate, zoning, distance from a river, air quality, schools, income level of community, size of houses, and so forth.
Both
Prediction + inference problem
In a real estate setting, one may seek to relate values of homes to inputs such as crime rate, zoning, distance from a river, air quality, schools, income level of community, size of houses, and so forth.
Why is this an prediction problem specifically?
Predicting the value of a home given its characteristics.
Prediction + inference problem
In a real estate setting, one may seek to relate values of homes to inputs such as crime rate, zoning, distance from a river, air quality, schools, income level of community, size of houses, and so forth.
Why is this an inference problem specifically?
Examining how individual input variables affect the prices.
Parametric Methods (Overview)
They assume a specific functional form for f, estimate a finite set of parameters (e.g. using linear regression).
Non-Parametric Methods (Overview)
They make fewer assumptions about f, can fit a wide range of shapes, but generally require more data to avoid overfitting.
What are the advantages of parametric methods versus non-parametric methods?
3 points
- Simplified problem: much easier to estimate a set of parameters than fit an entirely arbitrary function f.
- Less prone to overfitting.
- Less observations necessary given the smaller number of parameters.
What are the disadvantages of parametric methods versus non-parametric methods?
2 points
- The model we choose will usually not match the true unknown form of f. If the chosen model is too far from the true f, then our estimate will be poor.
- More rigid. Parametric methods follow a fixed functional form whereas non-parametric methods do not assume a form for f, allowing for more flexible models.
Non-Technical Definition: Overfitting
Modeling random noise in the training data so closely that performance on new data suffers.
Interpretability vs. Flexibility Trade-Off
Highly flexible methods can fit data more accurately but are often less interpretable; simpler methods are more interpretable but may fit less accurately.
Why might we choose a more restrictive (less flexible) model?
2 points
- Restrictive models (e.g., linear regression) tend to be more interpretable; valuable when inference about how predictors affect the response is important.
- In cases where a more flexible model yields overfitting.
Why might we choose a more flexible model?
Flexible models often capture complex relationships more accurately, which can improve prediction performance but reduce interpretability.
Technical Definition: Supervised Learning
A setting in which each observation has both predictors (X) and a known response (Y), allowing us to train models for prediction or inference.
Techical Definition: Unsupervised Learning
A setting in which observations have predictors (X) but no associated response (Y), so we can only find structure or groupings in the data.
Definition: Clustering
An unsupervised technique that groups observations into clusters based on similarities among their measured variables.
What is Semi-Supervised Learning?
A scenario where some observations have a response while others do not, mixing both supervised and unsupervised elements.
Definition: Quantitative Variables
Can be measured on a numeric scale.
Definition: Qualitative Variables
Qualitative (categorical) variables fall into distinct classes.
Method Selection Based on Response Type
- Regression for quantitative response
- Classification for qualitative response
What is the ‘No Free Lunch’ principle in statistics?
It states that no single learning method is guaranteed to dominate all others across every possible data set; method performance depends on the specific problem.
What is the most commonly-used goodness-of-fit measure in regression problems?
Mean Squared Error (MSE)
Mean Squared Error (MSE) Equation
MSE = (1/n) Σᵢ (yᵢ − f̂(xᵢ))²
Difference Between Training MSE and Test MSE
Training MSE measures how well the model fits the data it was trained on; test MSE measures how well the model predicts new, unseen data.
Why Does Minimizing Training MSE Not Guarantee Minimizing Test MSE?
Because a model might overfit the training data, capturing noise rather than the true underlying relationship, resulting in poor performance on unseen data.
Technical Definition: Overfitting
A model fits the training data so closely that it fails to generalize to new data, typically showing low training error but high test error.
Non-Technical Definition: Degrees of Freedom
A way to quantify model complexity or flexibility; more degrees of freedom usually means a more complex model that can fit data more closely.
Which method (orange, blue, or green) is optimal?
Blue
Explain why the blue method is optimal.
Minimizes the test MSE.
Which method (orange, blue, or green) exhibits overfitting?
Green
Explain why the green method exhibits overfitting.
Exhibits a small training MSE but a large test MSE.
Equation for Bias-Variance Decomposition of Expected Test MSE
E[(Y₀ − f̂(X₀))²] = Var(f̂(X₀)) + [Bias(f̂(X₀))]² + Var(ε)
E[(Y₀ − f̂(X₀))²] = Var(f̂(X₀)) + [Bias(f̂(X₀))]² + Var(ε)
What does the Equation for Bias-Variance Decomposition of Expected Test MSE tell us?
2 points
- We aim for low variance and low bias to minimize overall error.
- The expected test MSE can never lie below Var(ε), the irreducible error.
Definition: Bias (in model fitting)
Error introduced by approximating a potentially complex real‐world process with an overly simple model, causing systematic under‐ or over‐estimation.
Definition: Variance (in model fitting)
Sensitivity of the model to the particular training set used. A high‐variance model changes drastically when the training data change.
What happen to model variance and model bias as we use more flexible methods?
Variance will increase and bias will decrease.
More flexible methods = variance increases and bias decreases
Given this is the case, what determines whether the test MSE increases or decreases?
The relative rate of change of variance and bias. If bias decreases faster than variance increases, test MSE declines. If bias declines slowely or is unchanged and variance increases significantly, test MSE increases.
Classification
Error Rate Equation
1/n Σᵢ I(yᵢ ≠ ŷᵢ), , i.e. the fraction of observations for which the predicted class differs from the true class.
Classification
Difference Between Training Error Rate and Test Error Rate
Training error rate measures how well the model classifies the data it was trained on; test error rate measures how well the model classifies new, unseen data.
What is the Bayes Classifier?
A theoretical classifier that assigns each observation to the most likely class, given its predictor values. It achieves the lowest possible test error rate on average but is generally unachievable in practice because the true distribution of Y given X is unknown.
Bayes Classifier Equation
Pr(Y = j | X = X₀)
Pr(Y = j | X = X₀)
How would a Bayes Classifier work, based on the equation above?
Assign a test observation with predictor vector X₀ to the class j for which the probability is the largest.
Definition: Bayes Decision Boundary
The boundary in feature space where Pr = 0.5 (in a two-class example). Points on one side are assigned to one class, points on the other side to the other class.
Assuming a Bayes classifier was used, What does the purple dashed line represent here?
Bayes Decision Boundary
Bayes Error Rate Equation
1 − E(max_j Pr(Y = j | X))
1 − E(max_j Pr(Y = j | X))
What does the Bayes Error Rate tell us?
It is the lowest achievable test error rate. It is analogous to the irreducible error in regression settings.
What is the K-Nearest Neighbors (KNN) Classifier?
Classifies a test observation by first finding the K closest training points and then assigning the majority class among those neighbors.
K-Nearest Neighbors (KNN) Equation
Pr(Y = j | X = x0) = (1/K) * ∑ over i in N0 [ I(yi = j ) ]
* K = number of nearest neighbors
* N0 = indices of those K neighbors closest to x0
* I(condition) = 1 if condition is true, else 0
What class will the black cross be assigned to in a KNN classifier (K = 3)?
Blue
How Does a Small K Influence Bias and Variance in KNN?
A small K (e.g. K=1) produces a highly flexible model that can overfit (low bias, high variance).
How Does a Large K Influence Bias and Variance in KNN?
A large K (e.g. K=100) yields a smoother, more biased but lower-variance model.
With a KNN classifier, how does the training error rate vary with K?
As 1/K increases (K decreases), the training error rate decreases.
With a KNN classifier, how does the test error rate vary with K?
As 1/K increases (K decreases), the test error rate first declines as flexibility increases before increasing as the method becomes excessively flexible and overfits.
Why Doesn’t Low Training Error Always Mean Low Test Error in KNN?
Because with small K, the method can memorize training data noise (overfitting). This leads to low training error but potentially high test error.
How Does KNN Compare with the Bayes Classifier?
KNN can approximate the Bayes rule if K is chosen well and we have enough data. However, the Bayes classifier remains a gold standard that typically requires full knowledge of the true distribution.
Why Is Choosing the Correct Level of Model Flexibility Important?
Both in regression and classification, we want a balance between bias and variance. Too much flexibility can overfit; too little can underfit. Optimal flexibility minimizes test error, typically in a U-shaped pattern.
What does ls()
do?
Lists the names of objects currently stored in the R environment.
What does rm(list=ls())
do?
Removes all objects in the environment, effectively clearing your workspace.
What does matrix()
do in R?
Creates a matrix from a given set of values, arranged in rows and columns.
What does matrix(data=c(1,2,3,4), nrow=2, ncol=2)
output in R?
What does matrix(c(1,2,3,4), 2, 2, byrow=TRUE)
output in R?
What does rnorm()
do in R?
Generates random values from a normal (Gaussian) distribution.
How do you create a PDF with a plot in R?
pdf("PlotName.pdf") plot(...) dev.off()
How do you create a JPEG with a plot in R?
jpeg("PlotName.jpg") plot(...) dev.off()
What does contour()
do in R?
Creates a contour plot to visualize 3D data; draws contour lines of a 3D surface on a 2D plot.
What does image()
do in R?
Creates a heatmap to visualize 3D data; displays a 2D image of a matrix of values using different colors to represent magnitude.
What does persp()
do in R?
Creates a 3D perspective plot of a surface for 3D data.
What does negative indexing (e.g., x[-1]
) mean in R?
Selects all elements except the ones at the specified indices. For instance, x[-1]
means ‘all but the first element.’
What does dim()
do in R?
Returns the dimensions of an object (e.g., rows and columns for a matrix or data frame).
What does na.omit()
do?
Removes rows or observations with missing values (NA) from an object like a data frame.
What does attach()
do?
Makes the variables of a data set or environment accessible by name without using the $ operator or indexing.
What does pairs()
do in R?
Creates a matrix of scatterplots for every pair of variables in a data frame or matrix.
What does identify()
do in R plots?
Lets you click on plotted points to label or retrieve their coordinates (or row indices). Helpful for interactive identification.
Indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
The sample size n is extremely large, and the number of predictors p is small.
Better - a more flexible approach will fit the data closer and with the large sample size a better fit than an inflexible approach would be obtained
Indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
The number of predictors p is extremely large, and the number of observations n is small.
Worse - a flexible method would overfit the small number of observations
Indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
The variance of the error terms, i.e. σ2 = Var(e), is extremely high.
Worse - flexible methods fit to the noise in the error terms and increase variance
What are the disadvantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)?
The disadvantages of a parametric approach to regression or classification are a potential to inaccurately estimate f if the form of f assumed is wrong or to overfit the observations if more flexible models are used.
How do you find the range of the first seven columns in a data.table
called auto
in R?
sapply(auto[, 1:7], range)
Do any of the suburbs of Boston appear to have particularly high crime rates? Comment on the range of this predictor.
Most cities have low crime rates, but there is a long tail: 15-20 suburbs appear to have a crime rate > 20, reaching to above 80.
Do any of the suburbs of Boston appear to have particularly high tax rates? Comment on the range of this predictor.
There is a large divide between suburbs with low tax rates and a peak at 660-680.
Do any of the suburbs of Boston appear to have particularly high pupil-teach ratios? Comment on the range of this predictor.
A skew towards high ratios, but no particularly high ratios.
Given a data set, boston
, where the median value of owner-occupied homes column is called medv
, how would you answer this question?
Which suburb of Boston has lowest median value of owner-occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.
- Subset the data to find the observation where
medv
equals the minimummedv
:boston[medv == min(boston$medv)]
- Compare this to the distribution of each variable, using
summary
:summary(boston)
- Look at where each variable value falls within the variable’s distribution.
Given a data set, boston
, where the average number of rooms per dwelling is called rm
, how would you answer this question?
Comment on the suburbs that average more than eight rooms per dwelling.
- Compare the distribution in variables for the subset of data where
rm
> 8:summary(boston[rm > 8])
- Compare the distribution in variables for the overall data set:
summary(boston)
- Compare the distribution/range for (1) to (2) for each variable.