Model training and tuning Flashcards

1
Q

Validation data set

A
  • Best use of a validation training subset is to compare models to find the best one. When we choose our model, we then training on all the training data and test on the testing data
  • Use the hold-out method (k-fold cross validation) to ensure the training data isn’t too small when split twice
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Bias - Variance tradeoff

A
  • Total Error = Bias^2 + Variance + Irreducible Error
  • Bias is difference between our estimated model (f hat) and the real (unknown) model (f)
  • Variance is the variability - high variance models will give us different responses for the same prediction. I.e. small change in the input results in large change in the output
  • Best case model is low bias and low variance
    High variance = overfitting. High bias = underfitting
  • When we plot Model Error against Model Complexity:
    Increase in complexity = increase in error (variance)
    Increase in complexity = decrease in error (bias)
    We want to find the “sweet spot” where error and complexity are minimised
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Using the learning curve to evaluate the model

A
  • To detect if the model is underfitting or overfitting, and the impact of the training data size on the error
  • Learning curves: plot training dataset and validation dataset error or accuracy against training set size
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

High bias model

A
  • Our validation and training predictions accuracy are well below what we need (underfitting)
  • The model is too simple, not capturing all the features of the true f
  • Solutions can be to increase the number of features or decrease the degree of regularisation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

High variance model

A
  • Training accuracy is high but and validation accuracy is much lower (i.e. overfitting)
  • Model is too complex
  • Solutions can be to add more training data, or reduce model complexity by decreasing the number of features or increasing regularisation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Regularisation

A
  • A way to reduce overfitting by adding a penalty score for complexity to the cost function
  • Allows us to find a balance between the important features and the overfitting of the model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Regularisation in Linear Regression

A
  • Cost function = sum of squared errors
  • Add a penalty based on the number of features
  • Alpha represents regularisation strength - the larger the alpha, the larger the penalty
  • We try to minimise both the SSE and alpha
  • First scale all variables (i.e. standardise, normalise)
  • Weights are attributed to each of the features based on importance in the model. - When we sum up groups of weights, large weights correspond to higher complexity - we can regularise by penalising large weights
  • Strength of regularisation: c = 1/alpha
  • Small c = stronger regularisation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

L1 (lasso) regularisation

A

After running the model, we have a coefficient for each feature. The L1 regularisation penalty is just the sum of the absolute coefficients for all features
- Can reduce certain feature weights to zero (i.e. feature selection)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

L2 (ridge) regularisation

A

As per L1 but we take the squared absolute of all features and sum together
- Cannot carry out feature selection as not possible to reduce feature weights to zero (can only get very close to zero)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Hyperparameter tuning methods

A

Grid Search
- Allows you to search for the best parameter combination over a set of parameters
- If we plot parameter 1 against parameter 2 and consider each combination of parameters (i.e. the grid), then we try out every possibility to find the best
- Very computationally intensive
RandomSearchCV
- Automatically generates random values for parameters from a distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Model tuning: training data too small

A

Sample and label more data if possible
- Consider creating synthetic data (duplication, or techniques like SMOTE)
- Synthetic Minority Over-sampling Technique:
take a sample from the dataset, and consider its k nearest neighbors (in feature space). To create a synthetic data point, take the vector between one of those k neighbors, and the current data point. Multiply this vector by a random number x which lies between 0, and 1. Add this to the current data point to create the new, synthetic data point
- Training data doesn’t need to be exactly representative but test data does

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Model tuning: Feature set tuning

A
  • Adding features that help capture patterns for classes of errors
  • We may add new features if we have too few features resulting in high bias or low flexibility
  • This may include different transformations of the same feature (e.g. squared, log, sqrt etc.)
  • Add interaction terms
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Model tuning: Dimensionality reduction

A
  • Too many features, high variance, too complex/flexible, tends to overfit
  • We can reduce the number of features while trying to keep the majority of info in the dataset
  • Curse of dimensionality: for some algorithms (such as k nearest neighbour), with too many dimensions, the distance between each data point is so sparse so that the algorithm doesn’t give us a good grouping in this high dimensional space
  • Dimensionality reduction also occurs through L1 Regularisation (as discussed above) where the weight of the coefficient is reduced to 0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Model tuning: Feature selection

A

Removing some of the features from the model such that the selected features will enable the model to have better performance, but there is no change in the selected features themselves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Model tuning: Feature extraction

A
  • Combine features to generate a set of new set of features which is used for the model. This new set is generally less than the number of features from the original model
  • Maps data into smaller feature space that captures the bulk of the information in the data (aka data compression)
  • Improves computational efficiency
  • Reduces the curse of dimensionality
  • PCA (patterns in the correlations between features)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Linear Discriminant Analysis (LDA)

A
  • A supervised linear approach to feature extraction
  • Transforms so subspace that maximises class separability
  • Assumes data is normally (Guassian) distribution
  • Can be used to reduce the number of features to a maximum of n - 1
17
Q

Use of Bagging/Boosting

A
  • Ways to automate the relatively manual processes of feature selection and feature extraction
  • Ensemble methods that use a random subset of features and runs the model several time before taking a survey of the results
18
Q

Bagging (also known as Bootstrap Aggregating)

A
  • Motivation: generate a group of weak learners that, when combined together, generate higher accuracy
  • Create x datasets of size m by randomly sampling the original dataset with replacement
  • As we randomly choose the data subsets, the non-chosen observations form the validation data set
  • Train weak learners (i.e. different models like decision stumps, logistic regression) on varying datasets to generate predictions
  • Take a vote (classification) or average (regression) of the results from each model
  • Use bagging for instances of high variance and low bias (variance is reduced, bias stays the same)
19
Q

Boosting

A
  • Assign strengths to each weak learner
    Iteratively train learners using misclassified examples of the previous weak learner
  • Algorithm focuses on previous failures upstream (using weights)
    Trains a sequence of samples to get a strong model
  • Very good for forecasting
  • Use boosting for models with high bias
  • Xgboost is a good choice for tabular datasets (but use NNs for more difficult problems like image classification and language)