SRM Chapter 1 Flashcards
1
Q
Response/Output/Dependent Variable
A
- Y
- Variable we want to predict using other (explanatory) variables
2
Q
Explanatory/Input/Independent Variable
A
- Xj’s
- Variables we use to predict the dependent variable
- We want to study the relationship between these and the dependent variable
- Aka predictor, feature
3
Q
Count Variable
A
- Quantitative
- Variable that takes on non-negative integers (discrete)
4
Q
Continuous Variable
A
- Quantitative
- Takes on continuous values within an interval
5
Q
Categorical Variable
A
- Qualitative
- Takes on different categories
- Aka class, level
- Each category is given a number
6
Q
Nominal Variable
A
- Categorical variable that has no logical order
- The numbers don’t have any meaning, they just differentiate/label the categories
- e.g. seasons numbered 1-4 alphabetically (there is no correspondence with the numbers to the meanings of the actual season)
7
Q
Ordinal Variable
A
- Categorical variable that has a logical order
- Numbers used to label the categories have meaning/there is an order
- e.g. seasons numbered 1-4 in order of the calendar year (there is a meaning to the order)
8
Q
Notation: j
A
- Denotes the specific predictor (xj), if there is more than one
- Up to p predictors
9
Q
Notation: i
A
- For a predictor xj, denotes the specific observation of that predictor
- Up to n observations
10
Q
Supervised Learning
A
- We have a y (dependent variable)
- Focus on predicting y based on x’s (predictors)
11
Q
Unsupervised Learning
A
- No y (dependent variable)
- Focus is not on predicting but on finding and explaining patterns/relationships between the x’s and across observations
12
Q
Regression Problem
A
- Y is quantitative
13
Q
Classification Problem
A
- Y is qualitative (categorical)
14
Q
Parametric
A
- There is a functional form of F specified
- i.e. the relationship between the x’s (predictors) and y (the dependent variable) can be expressed as a function
15
Q
Non-Parametric
A
- There is no specified functional form of f
- i.e. there is no function that describes the relationship between the x’s and y
- F-hat is algorithmic rather than functional because there are no parameters to estimate
- Need a lot of observations
16
Q
Supervised Models (13)
A
- SLR (Single Linear Regression)
- MLR (Multiple Linear Regression)
- (GLM) Generalized Linear Model
- Ridge
- Lasso
- Weighted Least Squares
- Partial Least Squares
- KNN (K-Nearest Neighbours)
- Decision Trees
- Bagging
- Random Forest
- Boosting
- PCR (Principal Components Regression)
17
Q
Unsupervised Models (2)
A
- Cluster Analysis
- PCA (Principal Components Analysis)
18
Q
Parametric Models (8)
A
- SLR (Single Linear Regression)
- MLR (Multiple Linear Regression)
- (GLM) Generalized Linear Model
- Ridge
- Lasso
- Weighted Least Squares
- Partial Least Squares
- PCR (Principal Components Regression)
19
Q
Non-Parametric Models (5)
A
- KNN (K-Nearest Neighbours)
- Decision Trees
- Bagging
- Random Forest
- Boosting
20
Q
Training Data
A
- Data (observations) used to train/formulate f-hat
21
Q
f vs f-hat
A
For the relationship between y (dependent variable) and x’s (its predictors):
- f is the function itself (the relationship itself, that we don’t necessarily know)
- f-hat is our estimation of this function
22
Q
e
A
- Error term variable
- Expected value of 0
23
Q
Two components of an observation from the response variable
A
- Systematic - expected value of the response variable (our function f)
- Random - error term
- Aka signal plus noise
24
Q
Signal Plus Noise
A
- Each observation of Y is made up of two parts:
- Systematic (our function f)
- Random (error term e)
25
Bayes Classifier
- Best decision function
- When the Bayes Classifier is used, test error rate is minimized and this is the best decision function
26
Decision Function
- Function (f) for classification problems that decides which category Y (dependent variable) belongs to
27
Objectives to supervised learning (2)
1. Prediction - predicting values of y based on x's
2. Inference - understanding the impact of changes in x's on the value of y
28
Flexibility
- Describes how closely f-hat can follow the data
- Related to prediction (more flexible model means more accurate predictions)
- Rougher fit = more flexible f-hat
- Smoother fit = less flexible f-hat
29
Interpretability
- Ability to understand what the model is doing (components, parameters)
- Related to inference (easier to explain the specifics in the relationship between x's and y if we understand what the model is doing)
30
Flexibility: Rougher Fit
- More flexible f-hat
- Often more parameters
31
Flexibility: Smoother Fit
- Less flexible f-hat
- Often less parameters (simpler function)
32
Flexibility vs Interpretability
- Inverse relationship
- As flexibility increases, we are able to make more accurate predictions but more parameters means the model might be harder to understand/interpret
33
Flexibility vs Accuracy
- More flexibility doesn't always mean more accurate predictions *in general*
- It means more accurate predictions *on the training data* only
34
MSE
- Mean squared error
- Measures error in regression models
- We want this number to be small (smaller MSE means more accurate)
35
Training MSE vs Flexibility of f-hat
- Inverse relationship
- Training MSE decreases as flexibility of f-hat increases
36
Overfitting
- Happens when f-hat fits the training data too closely
- Won't carry over well to new data (test data) so predictions on the test data won't be as accurate
- Often happens when f-hat is too flexible, modelled too closely to the training data)
- Too rough fit, too flexible
37
Underfitting
- f-hat is not robust (flexible) enough to capture relationships between the y and x's
- Too smooth fit, not flexible enough
38
Training vs Test MSE
- Training MSE is not always a good indicator of model accuracy because minimizing the training MSE only means that accuracy is maximized on the training data, not the testing data.
- So, test MSE is a better indicator of model accuracy
39
Training MSE
- Mean squared error based on the training data
- Goes down as flexibility increases
- Not the best indicator of model accuracy because based only on the training data
40
Test MSE
- Mean squared error based on the test data, not based on past observations
- This makes it a better indicator of model accuracy
- U shaped as flexibility increases
- Not flexible enough means that it's too smooth of a fit (underfitting); the relationship between x's and y is not captured enough
- Too flexible means that it's too rough of a fit (overfitting); f-hat is too closely fitted to the training data but on the test data accuracy declines
- So the best test MSE is usually produced by a moderately flexible model
41
Bias-Variance Tradeoff
- We want both variance and bias to be low
- Increasing flexibility increases variance though it decreases bias.
- Decreasing flexibility decreases variance, but it increases bias.
42
Irreducible error
- Variance in y (dependent variable) that can't be explained by f-hat
43
Reducible error
- Var(f-hat) + (Bias(f-hat))^2
- The variance in y that can be reduced by choosing the best model
- Want to balance: want low variance and low bias though there is a tradeoff between the two
44
Variance
- How f-hat changes when different training data is used
- Want this to be low (little variability between sets of training data)
- Bigger variance means f-hat changes more depending on the training data used
45
Bias
- How close f-hat is to the actual shape of f
- Want this to be low (close)
46
Flexibility-Variance-Squared Bias Relationship
- F low - V low - B high
- As flexibility decreases, variance also decreases but bias increases (underfitting)
- F high - V high - B low
- As flexibility increases, variance also increases but bias decreases
47
Flexibility-Variance Relationship
- As flexibility increases, so does variance
- Because as flexibility increases, the model gets more specifically fit to that particular set of training data, so there is more variance in the shape of f-hat when using different training data.
48
Flexibility-Bias Relationship
- As flexibility increases, bias decreases
- By increasing flexibility we are able to get f-hat closer to the actual shape of f, which means squared bias decreases.
- Bias happens/grows when f-hat is not flexible enough/ too simple to catch the patterns and shape of f (underfitting).
49
Test Error Rate
- Measure for classification model error
- Uses I (indicator function). 1 if correct, 0 if otherwise
50
Bayes Error Rate
- Using Bayes classifier in place of Y-hat in the test error rate indicator function
- When this is used, test error rate is at a minimum and the Bayes Indicator is the best decision function.
51
k-Nearest Neighbours Steps
1. Find the location of the observation in the domain of X1,...,Xp. This is the centre.
2. Identify the k nearest training observations to the centre.
3. The most frequent category of the k training observations is the prediction y-hat.
52
Distance used for k-Nearest Neighbours Method
- Euclidean distance
53
k-Nearest Neighbours: Size of k
- k too large: observations are too far away from the centre of the neighbourhood so predictions are too general.
- k too small: observations are unstable/volatile (dependent on a small few).
- Want a middle-sized k because of the bias-variance tradeoff.
54
k-Nearest Neighbours: k vs. Flexibility Relationship
- k is inversely related to flexibility.
- A small k means y-hat is very dependent on a small number of observations, so flexibility is high (very tailored to those few observations).
- A large k means y-hat is very generalized, so flexibility is low.
55
Smooth fit = ? flexibility
Less flexibility
56
Rough fit = ? flexibility
More flexibility