ML Flashcards
What is bias in ML models?
Bias is the difference between the average model prediction and the ground truth. A model with high bias makes strong assumptions about the underlying patterns in the data, and these assumptions may not accurately reflect the true relationship between the input features and the target variable.
Think of the bullseye visualization. High bias means the data points are off center, low bias means they’re centered.
What is variance in ML models?
Model variance refers to the sensitivity of a model to the variations in the training data. Specifically, model variance measures how much the model’s predictions fluctuate when trained on different subsets of the training data.
Think of the bullseye visualization. High variance means the data points are spread out, low variance means they’re tightly clustered.
Describe the bias-variance tradeoff.
First, decreasing bias increases variance and vice versa, i.e. low bias models have very high variance and vice versa. Total error is a function of bias, variance, and irreducible error. To get the minimum error, we need to find the right compromise between bias and variance (think of the graph with the bias and variance curves and the sum of them. The sum is a U).
What are the downsides of high bias / high variance ML models?
High bias models underfit the data and tend to be too simple, while high variance models overfit the data and tend to be too complex (have too many parameters).
How do you identify high bias and high variance models?
High bias models have low accuracy on both train and test sets. High variance models have high accuracy on train set but low accuracy on test (because they overfit and fail to generalize).
How do you select the model (high bias or high variance) based on the training data size?
For small data, I would prefer a high bias model. This avoids overfitting to the noise in the small dataset (which is more problematic than in a large dataset, see below). Your high bias model basically “says less” about the patterns in the data (in the sense that a line of best fit says less than a cubic curve of best fit).
For large data, I would prefer a high variance model. Overfitting is less of a concern than in the small data case: with more data points, there is more opportunity for noise to cancel itself out. The high variance models “says more” about the data, and it should! There’s more data!
What does the bias of an estimated model tell us about model capacity?
The bias of an estimated model tells us about the capacity of the model to capture the relationships in our data, i.e., high bias models tend to be too simple and don’t have the capacity to represent the data.
What is ensemble learning?
Combining many weak learners into a single model. It helps balance bias and variance and avoid overfitting.
The two main types are bagging and boosting.
What is bagging?
When you train multiple instances of the same weak learner on different subsets of the training data, then make predictions by taking an average (regression) or “vote” (classification) from outputs of the ensemble of models.
What is boosting?
When you train a sequence of the same weak learners so that each learner more heavily weights the training examples that the previous learner got wrong.
What is stacking?
Partition the training set in two. Train a heterogenous group of weak learners (base models) on the first partition, then train a meta-learner on the outputs of the base models and the second partition, i.e. let the weak learners make predictions from the second partition, feed those to the meta-learner, and train the meta-learner on those.
What is regularization?
Regularization is the penalization of complexity in models. When models are low bias/high variance or the training set is small, models are subject to overfitting, i.e., capturing noise or patterns specific to the training data that don’t generalize well to other examples.
In short, regularization reduces model variance in order to prevent overfitting.
What is L1 regularization?
L1 regularization (Lasso) adds the absolute values of the coefficients to the loss function, encouraging sparsity by driving some coefficients to exactly zero.
What is L2 regularization?
L2 regularization (Ridge) adds the squared values of the coefficients to the loss function, penalizing large weights and promoting a more even distribution of weights.
What are three general types of regularization?
1) Modifying the loss function: L1 or L2 regularization
2) Modify data sampling: Data augmentation, K-Fold Crossvalidation
3) Modify training process: Add noise, dropout