Topic 2: Machine Learning: Introduction to Algorithms Flashcards
Explain why we estimate a function with data
For prediction or inference reasons
What is the role of error terms (reducible and
irreducible) and why is the irreducible error larger than zero?
Reducible error occurs when the estimate for f can be improved (e.g. using a better statistical model.
Irreducible error is the part of the error that cannot be reduced (because Y is also a function of the error)
The error is larger than zero because the error may contain unmeasured variables that are useful in predicting Y.
Difference between prediction and inference
Prediction uses X to predict Y (f is treated as black box)
Inference is estimating f but not necessarily make predictions for Y (f cannot be tread as a black box)
Difference between a parametric and non-parametric approach when applying a statistical learning method to the training data.
Parametric approach involves a two-step model-based approach.
- Make an assumption about the functional form, or shape.
- Select procedure that uses the training data to fit or train the model (e.g. OLS).
Non-parametric approach do not make an explicit assumption about the functional form or shape of f. Downside is that you need a large dataset.
Describe the trade-offs between prediction accuracy, flexibility, and model interpretability, including the role of overfitting.
As the flexibility of a method increases its interpretability decreases. Highly flexible methods have a greater potential for overfitting.
Reason that we might prefer a more restrictive model
When we are mainly interested in inference restrictive models are more interpretable.
When is a supervised learning model preferable to unsupervised?
Supervised learning models are used when you want to fit a model that relates the response to predictors.
With unsupervised learning models there is no response variable that can supervise the variable (e.g. clustering).
Difference between quantitative and qualitative problems?
Quantitative -> regression problems (numerical values)
Qualitative -> categorical problems (classes)
Interpret the Mean Squared Error (MSE)
MSE will be smaller if the predicted responses are very close to the true responses.
Explain the goal of measuring the quality of fit by minimizing training and test mean square errors (MSEs)
quality of fit -> how well predictions match observed data
The quality of fit is measured by MSE, you want to choose the method that has the lowest test MSE.
Implications of different levels of flexibility
(degrees of freedom) for both training and test MSEs.
As model flexibility increases, training MSE will decrease, but the test MSE may not.
Overfitting the data (small training MSE but large test MSE)
What does it mean when a method is overfitting the data?
When the model is working too hard to find patterns in the training data, and may be picking up some spurious patterns.
Explain the purpose of cross-validation.
A method for estimating test MSE using the training data
Explain the bias-variance trade-off with an MSE decomposition into three fundamental quantities.
Expected test MSE can be decomposed into:
- variance of f
- the squared bias of f
- the variance of the error
Lower bias (better fit of training data) can lead to higher variance (in the testing data)
Ideal machine learning algorithm characteristics
The algorithm has low bias (can model the true relationship accurately) and low variance (by producing consistent predictions across different datasets)
Describe the features of a Bayes classifier (two classes)
A Bayes classifier assigns each observation to the most likely class given its predictor values.
The Bayes decision boundary is the line which represents the points where the probably is exactly 50%.
What is the Bayes error rate?
It is the lowest possible test error rate in classification which is produced by the Bayes classifier
Apply/calculate the Bayes error rate
1 - [the class with the highest probability of Y belonging to class J]
or 1-E[maxj PR(Y = j|X)]
How is K-nearest neighbor classifier related to the Bayes classifier?
KNN attempts to estimate the conditional distribution of Y given X, then classifies a given observation with the highest estimated probability.
What are the effects of a low K and high K in K-nearest neighbor?
Lower K on training data gives higher flexibility (or low bias) but has a very high variance when applied to test data.
High K will give lower flexibility (high bias) but lower variance when applied to test data (decision boundary will become more linear with higher Ks)