Model training and tuning Flashcards
Validation data set
- Best use of a validation training subset is to compare models to find the best one. When we choose our model, we then training on all the training data and test on the testing data
- Use the hold-out method (k-fold cross validation) to ensure the training data isn’t too small when split twice
Bias - Variance tradeoff
- Total Error = Bias^2 + Variance + Irreducible Error
- Bias is difference between our estimated model (f hat) and the real (unknown) model (f)
- Variance is the variability - high variance models will give us different responses for the same prediction. I.e. small change in the input results in large change in the output
- Best case model is low bias and low variance
High variance = overfitting. High bias = underfitting - When we plot Model Error against Model Complexity:
Increase in complexity = increase in error (variance)
Increase in complexity = decrease in error (bias)
We want to find the “sweet spot” where error and complexity are minimised
Using the learning curve to evaluate the model
- To detect if the model is underfitting or overfitting, and the impact of the training data size on the error
- Learning curves: plot training dataset and validation dataset error or accuracy against training set size
High bias model
- Our validation and training predictions accuracy are well below what we need (underfitting)
- The model is too simple, not capturing all the features of the true f
- Solutions can be to increase the number of features or decrease the degree of regularisation
High variance model
- Training accuracy is high but and validation accuracy is much lower (i.e. overfitting)
- Model is too complex
- Solutions can be to add more training data, or reduce model complexity by decreasing the number of features or increasing regularisation
Regularisation
- A way to reduce overfitting by adding a penalty score for complexity to the cost function
- Allows us to find a balance between the important features and the overfitting of the model
Regularisation in Linear Regression
- Cost function = sum of squared errors
- Add a penalty based on the number of features
- Alpha represents regularisation strength - the larger the alpha, the larger the penalty
- We try to minimise both the SSE and alpha
- First scale all variables (i.e. standardise, normalise)
- Weights are attributed to each of the features based on importance in the model. - When we sum up groups of weights, large weights correspond to higher complexity - we can regularise by penalising large weights
- Strength of regularisation: c = 1/alpha
- Small c = stronger regularisation
L1 (lasso) regularisation
After running the model, we have a coefficient for each feature. The L1 regularisation penalty is just the sum of the absolute coefficients for all features
- Can reduce certain feature weights to zero (i.e. feature selection)
L2 (ridge) regularisation
As per L1 but we take the squared absolute of all features and sum together
- Cannot carry out feature selection as not possible to reduce feature weights to zero (can only get very close to zero)
Hyperparameter tuning methods
Grid Search
- Allows you to search for the best parameter combination over a set of parameters
- If we plot parameter 1 against parameter 2 and consider each combination of parameters (i.e. the grid), then we try out every possibility to find the best
- Very computationally intensive
RandomSearchCV
- Automatically generates random values for parameters from a distribution
Model tuning: training data too small
Sample and label more data if possible
- Consider creating synthetic data (duplication, or techniques like SMOTE)
- Synthetic Minority Over-sampling Technique:
take a sample from the dataset, and consider its k nearest neighbors (in feature space). To create a synthetic data point, take the vector between one of those k neighbors, and the current data point. Multiply this vector by a random number x which lies between 0, and 1. Add this to the current data point to create the new, synthetic data point
- Training data doesn’t need to be exactly representative but test data does
Model tuning: Feature set tuning
- Adding features that help capture patterns for classes of errors
- We may add new features if we have too few features resulting in high bias or low flexibility
- This may include different transformations of the same feature (e.g. squared, log, sqrt etc.)
- Add interaction terms
Model tuning: Dimensionality reduction
- Too many features, high variance, too complex/flexible, tends to overfit
- We can reduce the number of features while trying to keep the majority of info in the dataset
- Curse of dimensionality: for some algorithms (such as k nearest neighbour), with too many dimensions, the distance between each data point is so sparse so that the algorithm doesn’t give us a good grouping in this high dimensional space
- Dimensionality reduction also occurs through L1 Regularisation (as discussed above) where the weight of the coefficient is reduced to 0
Model tuning: Feature selection
Removing some of the features from the model such that the selected features will enable the model to have better performance, but there is no change in the selected features themselves
Model tuning: Feature extraction
- Combine features to generate a set of new set of features which is used for the model. This new set is generally less than the number of features from the original model
- Maps data into smaller feature space that captures the bulk of the information in the data (aka data compression)
- Improves computational efficiency
- Reduces the curse of dimensionality
- PCA (patterns in the correlations between features)