3.1 Introduction to Statistical Learning Flashcards
Describe each group of models below and explain how they are different from each other.
- Multiple linear regression (MLR)
- Stepwise selection and regularization models
- Generalized linear models (GLM)
- Single decision tree models
- Ensemble decision tree models
- MLR models the linear relationship between a continuous target variable and multiple predictors, assuming constant variance and a linear form between predictors and the target.
- Stepwise selection and regularization models include lasso and ridge, which improve MLR by selecting important predictors and/or shrinking coefficients to prevent overfitting.
- GLMs extend MLR by allowing the target variable to follow distributions other than the normal distribution, making them suitable for binary, count, or skewed data.
- Decision trees split the data based on feature values, creating a sequence of decisions to predict the target.
- Ensemble models, like random forests or boosting, combine multiple decision trees to improve accuracy and reduce overfitting by averaging or weighting individual tree predictions.
What is the key distinction between supervised and unsupervised learning?
Supervised learning focuses on data with a target variable, while unsupervised learning analyses data without the precense of a target variable, in order to find patterns and relationships between the variables
What is the main difference between regression and classification problems?
Regression problems have a continuous or count target variable, while classification problems have a categorical target variable.
Describe the systematic and random components of a target variable.
- The systematic component represents the value the target gravitates toward as a function of the predictors.
- The random component accounts for the unexplained variation not captured by the predictors.
What are the main differences between parametric and non-parametric methods?
Parametric methods specify a functional form for the systematic component of the target variable, with a set of parameters to estimate, while non-parametric methods do not assume a fixed form and rely entirely on the data to model the relationship.
What are model flexibility and interpretability, and how do they relate?
Model flexibility refers to the capacity of the predictions of the target to follow the datapoints, while interpretability our capacity of undertand and interpret that relationship between the target and the predictors
Explain the concepts of overfitting and underfitting, and discuss how they relate to model flexibility and the behavior of the training and test RMSE’s.
- MY TERMS:
Overfitting arises when a model generates estimations that follow the training data too closely and fail to generalize to unseen data. Underfitting occurs when a model produces overly general predictions, oversimplifying the relationship between the target and predictors.
A model to flexible tends to overfitting, lowering the training RMSE but giving a u-shape to the test RMSE
- SOLUTION CA
Overfitting occurs when the model fits the training data too closely and fails to generalize. Underfitting occurs when the model is not flexible enough to capture patterns.
In terms of flexibility and RMSE, training RMSE decreases as flexibility increases, whereas test RMSE follows a U-shaped curve, reaching its minimum at a moderate level of flexibility. Overfitting is indicated by a large difference between the training and test RMSE.
What is the bias-variance trade-off, and how does it impact the test RMSE and the selection of an optimal model?
- Variance: measures how much the shape of
changes with different training data. Higher flexibility models tend to have higher variance. It is the expected error that arises from
being too flexible or complex. - Bias measures the average closeness between estimation and datapoint of the target variable. Higher flexibility models tend to have lower bias. It is the expected error that arises from
not being flexible or complex enough.
The bias-variance trade-off refers to the balance between bias, which causes error due to an overly simple model, and variance, which leads to error from a model that is too complex. The test RMSE depends on both bias and variance.
An optimal model minimizes test RMSE by balancing bias and variance, ensuring the model captures patterns without overfitting to noise.
Compare and contrast the properties and use cases of the error metrics RMSE, MSE, and MAE in the context of regression problems.
- RMSE: in units of y
- MSE: in squared units of y
- MAE: robust to outliers
MSE and RMSE are sensitive to outliers because squaring a large difference greatly magnifies its impact on the overall metric. In contrast, MAE does not alter the magnitude of the differences; it simply captures their absolute values.
Describe the key accuracy metrics used in classification problems and explain how they are computed.
- Classification error rate is the percentage of observations with wrong predictions.
- Accuracy is the percentage of observations with correct predictions.
- Sensitivity / Recall / True positive rate (TPR): The percentage of positive observations with correct predictions.
- Specificity / True negative rate (TNR): The percentage of negative observations with correct predictions.
- False positive rate (FPR): The percentage of negative observations with wrong predictions. It is the complement of specificity.
FPR = 1 - Specificity
- Precision / Positive predictive value: The percentage of positive predictions that are correct.
Discuss the factors that are important to consider before modeling, while modeling, and after modeling.
- Before modeling, it is important to clearly define the problem, assess the suitability of predictive modeling for the given task, and understand the types of analyses that can be applied.
- During modeling, it may be necessary to refine the problem statement, consult subject matter experts for deeper insights, collect additional data if needed, and experiment with different models to find the best fit for the problem.
- After modeling, the results should be carefully evaluated to determine whether to implement the model (possibly through field testing) or abandon it if it does not meet expectations or add value.
What is stratified sampling, and why is it the preferred method for dividing a dataset into training and test sets?
Stratified sampling divides data into distinct strata and samples from each, ensuring representation of different groups. It is preferred when creating training and test sets to preserve the distribution of the target variable in each set.
Explain the differences between training, validation, and test sets, as well as the ambiguity surrounding their usage.
The training set is used to fit the model, the validation set helps with model selection and tuning, and the test set is for final performance evaluation.
There is often ambiguity, as the terms “validation set” and “test set” are sometimes used interchangeably, especially in this exam, where only training and test sets may be used.