Week 3 - Re-sampling and regularisation Flashcards

1
Q

Cross-validation

A

Cross-validation works by splitting the dataset into multiple subsets, training the model on some subsets while testing it on others

(Reduces the size of our sample data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Bootstrap

A

Resampling technique that generates multiple datasets by sampling with replacement from the original data, where each sample will have the same size as the original data set

  • Bootstrap is preferable compared to cross-validation when sample size is small or model is complex
  • Used for estimating variance of parameter estimates, so we can get a distribution for the parameter
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Permutation

A

Resampling technique that reorders the values of one or more variables to test the significance of relationships between variables

Careful about permuting with small sample size, because it may return similar samples to original data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

K-fold cross validation

A
  • So we split our data up and locked up the test set away
  • We only working with the training data now

Steps for k-fold cross validation:
1. Divide training data in ‘k’ different parts

  1. Remove one part, and fit model using the remaining k-1 parts and evaluate using that removed part as testing
  2. Repeat k times taking out a different part each time and getting an average of the results
  • We cannot completely trust k-fold cross validation, due to overfitting (that’s why we locked up our test set away)
  • Another issue is you have to fit the model ‘k’ times (could be time consuming)
  • K-fold is independent, thus can be done in parallel
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

K-fold cross validation uses

A
  • How a model performs out-of-sample (data that was not used during training of model)
  • Uses to select model hyperparameters
  • Evaluate cross-validated accuracy for different hyperparameters and proceed with the ‘best’ out of sample performance
  • You would test between two models which got better results from cross-validation and then use the actual test set for prediction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Leave-one-out cross validation (LOOCV)

A

A k-fold cross validation, where k = n. There are n CV sets, each with one observation left out

  • Useful when sample size is very small
  • Gives best measure of performance
  • But Model needs to be fit k=n times and n may be very large
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Cross-validation uses

A
  • Model evaluation: Checking how well a model works on unseen data.
  • Parameter tuning: Finding the best settings (e.g., number of trees, polynomial terms)
  • Variable selection: Identifying important features and dropping unnecessary ones.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Bootstrap steps

A
  1. Generate a number of bootstraps
  2. For each bootstrap, compute the PCA, record the loadings
  3. Then we can construct a 95% bootstrap confidence interval
  4. 2.5 and 97.5 is lower and upper bounds for confidence interval for each PC
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Comparison between Cross-Validation, Bootstrap and Permutation

A

Cross-validation: creates smaller training and testing sets to evaluate model’s out-of sample performance

Bootstrap: creates training data of the same size to understand the distribution of training parameter estimates
- Samples whole observation rows
- Samples with replacement

Permutation: creates training data of the same size to understand relationships between variables (without a model)
- Samples subsets of rows
- Samples without replacement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Regularisation

A

A technique used in model fitting to prevent overfitting by adding a penalty term to the model’s objective function. This penalty discourages overly complex models by shrinking some parameter estimates, potentially forcing some to zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Simulation

A

Process of generating artificial data based on known statistical distributions or models to study and analyse the behaviour of a system under controlled conditions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly