! S8: Under- & Oversampling Flashcards

1
Q

Sampling

A

= statistical process where subset of data taken taken (to estimate characteristics of the whole population)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Undersampling

A
  • taking a subset of the over-represented category in the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Ways of handling unbalanced data sets

A
  • collecting more data
  • modifying class weights
  • oversampling
  • under sampling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Oversampling

A
  • uses synthetic data generation to increase the number of under-represented category in data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

SMOTE - Steps

A
  1. Find kNNs for each sample in minority class
  2. Select samples randomly from kNN
  3. Find new samples = original sample + distance_difference *number_btw0&1
  4. Add new samples to minority class
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Oversampling techniques

A
  • SMOTE
  • ADASYNE
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

ADASYNE

A
  • extension of SMOTE
  • Adaptive Synthetic Sampling
  • adaptively generating minority data samples according to their density using KNN
  • adaptively updates the distribution
  • no assumptions made for the underlying distribution of the data
  • dynamic adjustment of weight
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

SMOTE - Application

A
  • takes entire dataset but just increases minority class examples
  • SMOTE percentage parameter = 100 -> doubles minority class examples
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

SMOTE vs ADASYN

A
  • SMOTE: generated arbitrary nr of new minority eg. -> shift classifier learning bias toward minority class
  • ADASYN: generated new minority eg via linear interpolation btw existing minority class -> shift classifier decision boundary to be more focused on those difficult to learn -> improves learning performance
  • ADASYN: density distribution = criterion to automatically decide nr of snythetic samples for each minority sample
  • SMOTE: generates the same nr of synthetic samples for each sample
How well did you know this?
1
Not at all
2
3
4
5
Perfectly