SRM Chapter 1 Flashcards
Response/Output/Dependent Variable
- Y
- Variable we want to predict using other (explanatory) variables
Explanatory/Input/Independent Variable
- Xj’s
- Variables we use to predict the dependent variable
- We want to study the relationship between these and the dependent variable
- Aka predictor, feature
Count Variable
- Quantitative
- Variable that takes on non-negative integers (discrete)
Continuous Variable
- Quantitative
- Takes on continuous values within an interval
Categorical Variable
- Qualitative
- Takes on different categories
- Aka class, level
- Each category is given a number
Nominal Variable
- Categorical variable that has no logical order
- The numbers don’t have any meaning, they just differentiate/label the categories
- e.g. seasons numbered 1-4 alphabetically (there is no correspondence with the numbers to the meanings of the actual season)
Ordinal Variable
- Categorical variable that has a logical order
- Numbers used to label the categories have meaning/there is an order
- e.g. seasons numbered 1-4 in order of the calendar year (there is a meaning to the order)
Notation: j
- Denotes the specific predictor (xj), if there is more than one
- Up to p predictors
Notation: i
- For a predictor xj, denotes the specific observation of that predictor
- Up to n observations
Supervised Learning
- We have a y (dependent variable)
- Focus on predicting y based on x’s (predictors)
Unsupervised Learning
- No y (dependent variable)
- Focus is not on predicting but on finding and explaining patterns/relationships between the x’s and across observations
Regression Problem
- Y is quantitative
Classification Problem
- Y is qualitative (categorical)
Parametric
- There is a functional form of F specified
- i.e. the relationship between the x’s (predictors) and y (the dependent variable) can be expressed as a function
Non-Parametric
- There is no specified functional form of f
- i.e. there is no function that describes the relationship between the x’s and y
- F-hat is algorithmic rather than functional because there are no parameters to estimate
- Need a lot of observations
Supervised Models (13)
- SLR (Single Linear Regression)
- MLR (Multiple Linear Regression)
- (GLM) Generalized Linear Model
- Ridge
- Lasso
- Weighted Least Squares
- Partial Least Squares
- KNN (K-Nearest Neighbours)
- Decision Trees
- Bagging
- Random Forest
- Boosting
- PCR (Principal Components Regression)
Unsupervised Models (2)
- Cluster Analysis
- PCA (Principal Components Analysis)
Parametric Models (8)
- SLR (Single Linear Regression)
- MLR (Multiple Linear Regression)
- (GLM) Generalized Linear Model
- Ridge
- Lasso
- Weighted Least Squares
- Partial Least Squares
- PCR (Principal Components Regression)
Non-Parametric Models (5)
- KNN (K-Nearest Neighbours)
- Decision Trees
- Bagging
- Random Forest
- Boosting
Training Data
- Data (observations) used to train/formulate f-hat
f vs f-hat
For the relationship between y (dependent variable) and x’s (its predictors):
- f is the function itself (the relationship itself, that we don’t necessarily know)
- f-hat is our estimation of this function
e
- Error term variable
- Expected value of 0
Two components of an observation from the response variable
- Systematic - expected value of the response variable (our function f)
- Random - error term
- Aka signal plus noise
Signal Plus Noise
- Each observation of Y is made up of two parts:
- Systematic (our function f)
- Random (error term e)
Bayes Classifier
- Best decision function
- When the Bayes Classifier is used, test error rate is minimized and this is the best decision function
Decision Function
- Function (f) for classification problems that decides which category Y (dependent variable) belongs to
Objectives to supervised learning (2)
- Prediction - predicting values of y based on x’s
- Inference - understanding the impact of changes in x’s on the value of y
Flexibility
- Describes how closely f-hat can follow the data
- Related to prediction (more flexible model means more accurate predictions)
- Rougher fit = more flexible f-hat
- Smoother fit = less flexible f-hat
Interpretability
- Ability to understand what the model is doing (components, parameters)
- Related to inference (easier to explain the specifics in the relationship between x’s and y if we understand what the model is doing)
Flexibility: Rougher Fit
- More flexible f-hat
- Often more parameters
Flexibility: Smoother Fit
- Less flexible f-hat
- Often less parameters (simpler function)
Flexibility vs Interpretability
- Inverse relationship
- As flexibility increases, we are able to make more accurate predictions but more parameters means the model might be harder to understand/interpret
Flexibility vs Accuracy
- More flexibility doesn’t always mean more accurate predictions in general
- It means more accurate predictions on the training data only
MSE
- Mean squared error
- Measures error in regression models
- We want this number to be small (smaller MSE means more accurate)
Training MSE vs Flexibility of f-hat
- Inverse relationship
- Training MSE decreases as flexibility of f-hat increases
Overfitting
- Happens when f-hat fits the training data too closely
- Won’t carry over well to new data (test data) so predictions on the test data won’t be as accurate
- Often happens when f-hat is too flexible, modelled too closely to the training data)
- Too rough fit, too flexible
Underfitting
- f-hat is not robust (flexible) enough to capture relationships between the y and x’s
- Too smooth fit, not flexible enough
Training vs Test MSE
- Training MSE is not always a good indicator of model accuracy because minimizing the training MSE only means that accuracy is maximized on the training data, not the testing data.
- So, test MSE is a better indicator of model accuracy
Training MSE
- Mean squared error based on the training data
- Goes down as flexibility increases
- Not the best indicator of model accuracy because based only on the training data
Test MSE
- Mean squared error based on the test data, not based on past observations
- This makes it a better indicator of model accuracy
- U shaped as flexibility increases
- Not flexible enough means that it’s too smooth of a fit (underfitting); the relationship between x’s and y is not captured enough
- Too flexible means that it’s too rough of a fit (overfitting); f-hat is too closely fitted to the training data but on the test data accuracy declines
- So the best test MSE is usually produced by a moderately flexible model
Bias-Variance Tradeoff
- We want both variance and bias to be low
- Increasing flexibility increases variance though it decreases bias.
- Decreasing flexibility decreases variance, but it increases bias.
Irreducible error
- Variance in y (dependent variable) that can’t be explained by f-hat
Reducible error
- Var(f-hat) + (Bias(f-hat))^2
- The variance in y that can be reduced by choosing the best model
- Want to balance: want low variance and low bias though there is a tradeoff between the two
Variance
- How f-hat changes when different training data is used
- Want this to be low (little variability between sets of training data)
- Bigger variance means f-hat changes more depending on the training data used
Bias
- How close f-hat is to the actual shape of f
- Want this to be low (close)
Flexibility-Variance-Squared Bias Relationship
- F low - V low - B high
- As flexibility decreases, variance also decreases but bias increases (underfitting)
- F high - V high - B low
- As flexibility increases, variance also increases but bias decreases
Flexibility-Variance Relationship
- As flexibility increases, so does variance
- Because as flexibility increases, the model gets more specifically fit to that particular set of training data, so there is more variance in the shape of f-hat when using different training data.
Flexibility-Bias Relationship
- As flexibility increases, bias decreases
- By increasing flexibility we are able to get f-hat closer to the actual shape of f, which means squared bias decreases.
- Bias happens/grows when f-hat is not flexible enough/ too simple to catch the patterns and shape of f (underfitting).
Test Error Rate
- Measure for classification model error
- Uses I (indicator function). 1 if correct, 0 if otherwise
Bayes Error Rate
- Using Bayes classifier in place of Y-hat in the test error rate indicator function
- When this is used, test error rate is at a minimum and the Bayes Indicator is the best decision function.
k-Nearest Neighbours Steps
- Find the location of the observation in the domain of X1,…,Xp. This is the centre.
- Identify the k nearest training observations to the centre.
- The most frequent category of the k training observations is the prediction y-hat.
Distance used for k-Nearest Neighbours Method
- Euclidean distance
k-Nearest Neighbours: Size of k
- k too large: observations are too far away from the centre of the neighbourhood so predictions are too general.
- k too small: observations are unstable/volatile (dependent on a small few).
- Want a middle-sized k because of the bias-variance tradeoff.
k-Nearest Neighbours: k vs. Flexibility Relationship
- k is inversely related to flexibility.
- A small k means y-hat is very dependent on a small number of observations, so flexibility is high (very tailored to those few observations).
- A large k means y-hat is very generalized, so flexibility is low.
Smooth fit = ? flexibility
Less flexibility
Rough fit = ? flexibility
More flexibility