! S8: Under- & Oversampling Flashcards
1
Q
Sampling
A
= statistical process where subset of data taken taken (to estimate characteristics of the whole population)
2
Q
Undersampling
A
- taking a subset of the over-represented category in the data
3
Q
Ways of handling unbalanced data sets
A
- collecting more data
- modifying class weights
- oversampling
- under sampling
4
Q
Oversampling
A
- uses synthetic data generation to increase the number of under-represented category in data
5
Q
SMOTE - Steps
A
- Find kNNs for each sample in minority class
- Select samples randomly from kNN
- Find new samples = original sample + distance_difference *number_btw0&1
- Add new samples to minority class
6
Q
Oversampling techniques
A
- SMOTE
- ADASYNE
7
Q
ADASYNE
A
- extension of SMOTE
- Adaptive Synthetic Sampling
- adaptively generating minority data samples according to their density using KNN
- adaptively updates the distribution
- no assumptions made for the underlying distribution of the data
- dynamic adjustment of weight
8
Q
SMOTE - Application
A
- takes entire dataset but just increases minority class examples
- SMOTE percentage parameter = 100 -> doubles minority class examples
9
Q
SMOTE vs ADASYN
A
- SMOTE: generated arbitrary nr of new minority eg. -> shift classifier learning bias toward minority class
- ADASYN: generated new minority eg via linear interpolation btw existing minority class -> shift classifier decision boundary to be more focused on those difficult to learn -> improves learning performance
- ADASYN: density distribution = criterion to automatically decide nr of snythetic samples for each minority sample
- SMOTE: generates the same nr of synthetic samples for each sample