lecture 10 imbalanced data Flashcards
random undersampling
Drop data
Fast training
Loses data
random oversampling
Repeat sample
Much slower
class weight
Reweight the loss function
Same effect as oversampling , not as expensive
ensemble resampling
Random resampling separate for each instance in an ensemble
edited nearest neighbors
Reducing dataset for knn
Remove Al samples that are mis classified by knn from training
Cleans up outliers and boundaries
condensed nearest neighbors
Add linings to the data that are mis classified by knn
Focus on the boundaries
Removes many
synthetic sample generator
Add synthetic interpolated data to smaller class
For each sample in minority class
Pick random neighbor from k neighbors
Pick point on line connecting the two uniformly
Large dataset
Combined with under sampling strategies