Week 3 - Re-sampling and regularisation Flashcards
Cross-validation
Cross-validation works by splitting the dataset into multiple subsets, training the model on some subsets while testing it on others
(Reduces the size of our sample data)
Bootstrap
Resampling technique that generates multiple datasets by sampling with replacement from the original data, where each sample will have the same size as the original data set
- Bootstrap is preferable compared to cross-validation when sample size is small or model is complex
- Used for estimating variance of parameter estimates, so we can get a distribution for the parameter
Permutation
Resampling technique that reorders the values of one or more variables to test the significance of relationships between variables
Careful about permuting with small sample size, because it may return similar samples to original data
K-fold cross validation
- So we split our data up and locked up the test set away
- We only working with the training data now
Steps for k-fold cross validation:
1. Divide training data in ‘k’ different parts
- Remove one part, and fit model using the remaining k-1 parts and evaluate using that removed part as testing
- Repeat k times taking out a different part each time and getting an average of the results
- We cannot completely trust k-fold cross validation, due to overfitting (that’s why we locked up our test set away)
- Another issue is you have to fit the model ‘k’ times (could be time consuming)
- K-fold is independent, thus can be done in parallel
K-fold cross validation uses
- How a model performs out-of-sample (data that was not used during training of model)
- Uses to select model hyperparameters
- Evaluate cross-validated accuracy for different hyperparameters and proceed with the ‘best’ out of sample performance
- You would test between two models which got better results from cross-validation and then use the actual test set for prediction
Leave-one-out cross validation (LOOCV)
A k-fold cross validation, where k = n. There are n CV sets, each with one observation left out
- Useful when sample size is very small
- Gives best measure of performance
- But Model needs to be fit k=n times and n may be very large
Cross-validation uses
- Model evaluation: Checking how well a model works on unseen data.
- Parameter tuning: Finding the best settings (e.g., number of trees, polynomial terms)
- Variable selection: Identifying important features and dropping unnecessary ones.
Bootstrap steps
- Generate a number of bootstraps
- For each bootstrap, compute the PCA, record the loadings
- Then we can construct a 95% bootstrap confidence interval
- 2.5 and 97.5 is lower and upper bounds for confidence interval for each PC
Comparison between Cross-Validation, Bootstrap and Permutation
Cross-validation: creates smaller training and testing sets to evaluate model’s out-of sample performance
Bootstrap: creates training data of the same size to understand the distribution of training parameter estimates
- Samples whole observation rows
- Samples with replacement
Permutation: creates training data of the same size to understand relationships between variables (without a model)
- Samples subsets of rows
- Samples without replacement
Regularisation
A technique used in model fitting to prevent overfitting by adding a penalty term to the model’s objective function. This penalty discourages overly complex models by shrinking some parameter estimates, potentially forcing some to zero
Simulation
Process of generating artificial data based on known statistical distributions or models to study and analyse the behaviour of a system under controlled conditions