Model training and tuning Flashcards

Question 1

Q

Validation data set

Answer

A

Best use of a validation training subset is to compare models to find the best one. When we choose our model, we then training on all the training data and test on the testing data
Use the hold-out method (k-fold cross validation) to ensure the training data isn’t too small when split twice

Question 2

Q

Bias - Variance tradeoff

Answer

A

Total Error = Bias^2 + Variance + Irreducible Error
Bias is difference between our estimated model (f hat) and the real (unknown) model (f)
Variance is the variability - high variance models will give us different responses for the same prediction. I.e. small change in the input results in large change in the output
Best case model is low bias and low variance
High variance = overfitting. High bias = underfitting
When we plot Model Error against Model Complexity:
Increase in complexity = increase in error (variance)
Increase in complexity = decrease in error (bias)
We want to find the “sweet spot” where error and complexity are minimised

Question 3

Q

Using the learning curve to evaluate the model

Answer

A

To detect if the model is underfitting or overfitting, and the impact of the training data size on the error
Learning curves: plot training dataset and validation dataset error or accuracy against training set size

Question 4

Q

High bias model

Answer

A

Our validation and training predictions accuracy are well below what we need (underfitting)
The model is too simple, not capturing all the features of the true f
Solutions can be to increase the number of features or decrease the degree of regularisation

Question 5

Q

High variance model

Answer

A

Training accuracy is high but and validation accuracy is much lower (i.e. overfitting)
Model is too complex
Solutions can be to add more training data, or reduce model complexity by decreasing the number of features or increasing regularisation

Question 6

Q

Regularisation

Answer

A

A way to reduce overfitting by adding a penalty score for complexity to the cost function
Allows us to find a balance between the important features and the overfitting of the model

Question 7

Q

Regularisation in Linear Regression

Answer

A

Cost function = sum of squared errors
Add a penalty based on the number of features
Alpha represents regularisation strength - the larger the alpha, the larger the penalty
We try to minimise both the SSE and alpha
First scale all variables (i.e. standardise, normalise)
Weights are attributed to each of the features based on importance in the model. - When we sum up groups of weights, large weights correspond to higher complexity - we can regularise by penalising large weights
Strength of regularisation: c = 1/alpha
Small c = stronger regularisation

Question 8

Q

L1 (lasso) regularisation

Answer

A

After running the model, we have a coefficient for each feature. The L1 regularisation penalty is just the sum of the absolute coefficients for all features
- Can reduce certain feature weights to zero (i.e. feature selection)

Question 9

Q

L2 (ridge) regularisation

Answer

A

As per L1 but we take the squared absolute of all features and sum together
- Cannot carry out feature selection as not possible to reduce feature weights to zero (can only get very close to zero)

Question 10

Q

Hyperparameter tuning methods

Answer

A

Grid Search
- Allows you to search for the best parameter combination over a set of parameters
- If we plot parameter 1 against parameter 2 and consider each combination of parameters (i.e. the grid), then we try out every possibility to find the best
- Very computationally intensive
RandomSearchCV
- Automatically generates random values for parameters from a distribution

Question 11

Q

Model tuning: training data too small

Answer

A

Sample and label more data if possible
- Consider creating synthetic data (duplication, or techniques like SMOTE)
- Synthetic Minority Over-sampling Technique:
take a sample from the dataset, and consider its k nearest neighbors (in feature space). To create a synthetic data point, take the vector between one of those k neighbors, and the current data point. Multiply this vector by a random number x which lies between 0, and 1. Add this to the current data point to create the new, synthetic data point
- Training data doesn’t need to be exactly representative but test data does

Question 12

Q

Model tuning: Feature set tuning

Answer

A

Adding features that help capture patterns for classes of errors
We may add new features if we have too few features resulting in high bias or low flexibility
This may include different transformations of the same feature (e.g. squared, log, sqrt etc.)
Add interaction terms

Question 13

Q

Model tuning: Dimensionality reduction

Answer

A

Too many features, high variance, too complex/flexible, tends to overfit
We can reduce the number of features while trying to keep the majority of info in the dataset
Curse of dimensionality: for some algorithms (such as k nearest neighbour), with too many dimensions, the distance between each data point is so sparse so that the algorithm doesn’t give us a good grouping in this high dimensional space
Dimensionality reduction also occurs through L1 Regularisation (as discussed above) where the weight of the coefficient is reduced to 0

Question 14

Q

Model tuning: Feature selection

Answer

A

Removing some of the features from the model such that the selected features will enable the model to have better performance, but there is no change in the selected features themselves

Question 15

Q

Model tuning: Feature extraction

Answer

A

Combine features to generate a set of new set of features which is used for the model. This new set is generally less than the number of features from the original model
Maps data into smaller feature space that captures the bulk of the information in the data (aka data compression)
Improves computational efficiency
Reduces the curse of dimensionality
PCA (patterns in the correlations between features)

Question 16

Q

Linear Discriminant Analysis (LDA)

Answer

Study These Flashcards

A

A supervised linear approach to feature extraction
Transforms so subspace that maximises class separability
Assumes data is normally (Guassian) distribution
Can be used to reduce the number of features to a maximum of n - 1

Question 17

Q

Use of Bagging/Boosting

Answer

Study These Flashcards

A

Ways to automate the relatively manual processes of feature selection and feature extraction
Ensemble methods that use a random subset of features and runs the model several time before taking a survey of the results

Question 18

Q

Bagging (also known as Bootstrap Aggregating)

Answer

Study These Flashcards

A

Motivation: generate a group of weak learners that, when combined together, generate higher accuracy
Create x datasets of size m by randomly sampling the original dataset with replacement
As we randomly choose the data subsets, the non-chosen observations form the validation data set
Train weak learners (i.e. different models like decision stumps, logistic regression) on varying datasets to generate predictions
Take a vote (classification) or average (regression) of the results from each model
Use bagging for instances of high variance and low bias (variance is reduced, bias stays the same)

Question 19

Q

Boosting

Answer

Study These Flashcards

A

Assign strengths to each weak learner
Iteratively train learners using misclassified examples of the previous weak learner
Algorithm focuses on previous failures upstream (using weights)
Trains a sequence of samples to get a strong model
Very good for forecasting
Use boosting for models with high bias
Xgboost is a good choice for tabular datasets (but use NNs for more difficult problems like image classification and language)

Model training and tuning Flashcards

(19 cards)