Week 3 - Re-sampling and regularisation Flashcards

Question 1

Q

Cross-validation

Answer

A

Cross-validation works by splitting the dataset into multiple subsets, training the model on some subsets while testing it on others

(Reduces the size of our sample data)

Question 2

Q

Bootstrap

Answer

A

Resampling technique that generates multiple datasets by sampling with replacement from the original data, where each sample will have the same size as the original data set

Bootstrap is preferable compared to cross-validation when sample size is small or model is complex
Used for estimating variance of parameter estimates, so we can get a distribution for the parameter

Question 3

Q

Permutation

Answer

A

Resampling technique that reorders the values of one or more variables to test the significance of relationships between variables

Careful about permuting with small sample size, because it may return similar samples to original data

Question 4

Q

K-fold cross validation

Answer

A

So we split our data up and locked up the test set away
We only working with the training data now

Steps for k-fold cross validation:
1. Divide training data in ‘k’ different parts

Remove one part, and fit model using the remaining k-1 parts and evaluate using that removed part as testing
Repeat k times taking out a different part each time and getting an average of the results

We cannot completely trust k-fold cross validation, due to overfitting (that’s why we locked up our test set away)
Another issue is you have to fit the model ‘k’ times (could be time consuming)
K-fold is independent, thus can be done in parallel

Question 5

Q

K-fold cross validation uses

Answer

A

How a model performs out-of-sample (data that was not used during training of model)
Uses to select model hyperparameters
Evaluate cross-validated accuracy for different hyperparameters and proceed with the ‘best’ out of sample performance
You would test between two models which got better results from cross-validation and then use the actual test set for prediction

Question 6

Q

Leave-one-out cross validation (LOOCV)

Answer

A

A k-fold cross validation, where k = n. There are n CV sets, each with one observation left out

Useful when sample size is very small
Gives best measure of performance
But Model needs to be fit k=n times and n may be very large

Question 7

Q

Cross-validation uses

Answer

A

Model evaluation: Checking how well a model works on unseen data.
Parameter tuning: Finding the best settings (e.g., number of trees, polynomial terms)
Variable selection: Identifying important features and dropping unnecessary ones.

Question 8

Q

Bootstrap steps

Answer

A

Generate a number of bootstraps
For each bootstrap, compute the PCA, record the loadings
Then we can construct a 95% bootstrap confidence interval
2.5 and 97.5 is lower and upper bounds for confidence interval for each PC

Question 9

Q

Comparison between Cross-Validation, Bootstrap and Permutation

Answer

A

Cross-validation: creates smaller training and testing sets to evaluate model’s out-of sample performance

Bootstrap: creates training data of the same size to understand the distribution of training parameter estimates
- Samples whole observation rows
- Samples with replacement

Permutation: creates training data of the same size to understand relationships between variables (without a model)
- Samples subsets of rows
- Samples without replacement

Question 10

Q

Regularisation

Answer

A

A technique used in model fitting to prevent overfitting by adding a penalty term to the model’s objective function. This penalty discourages overly complex models by shrinking some parameter estimates, potentially forcing some to zero

Question 11

Q

Simulation

Answer

A

Process of generating artificial data based on known statistical distributions or models to study and analyse the behaviour of a system under controlled conditions

Week 3 - Re-sampling and regularisation Flashcards

(11 cards)