Techniques for handling class imbalance Flashcards
What is class imbalance
Class imbalance occurs when one class significantly outnumbers another in a dataset, leading to biased model predictions.
Why does class imbalance cause issues in classification?
It leads to models being biased toward the majority class, reducing the ability to correctly classify the minority class.
What is an example of extreme class imbalance?
A dataset with 998 negative samples and only 2 positive samples; predicting only negatives yields 99.8% accuracy but is not useful.
What is binary cross-entropy loss?
A loss function used in binary classification that measures the difference between predicted probabilities and actual class labels.
How does cross-entropy loss behave in imbalanced datasets?
It favors the majority class, making it difficult for the model to learn the minority class boundaries.
What is weighted cross-entropy?
A modification of cross-entropy loss where the minority class is assigned a higher weight to improve model performance.
How do you apply class weights in TensorFlow for binary classification?
Use class_weight={0: 1.0, 1: 5.0} when calling model.fit().
What is categorical cross-entropy?
A loss function used for multi-class classification that compares predicted probabilities with actual class labels.
What is weighted categorical cross-entropy?
A variant of categorical cross-entropy where different class weights are assigned to balance learning across classes.
Name three potential solutions for handling class imbalance.
Collect more data, oversampling the minority class, and undersampling the majority class.
What is random oversampling?
Duplicating minority class samples to balance the dataset.
What is random undersampling?
Removing samples from the majority class to create a balanced dataset
What is SMOTE?
Synthetic Minority Over-sampling Technique, a method that generates synthetic minority samples by interpolating between existing instances.
What are the drawbacks of random oversampling and undersampling?
Oversampling can lead to overfitting, and undersampling may cause loss of valuable information.
What is data augmentation?
A technique that generates new training examples by applying transformations like rotation, flipping, and zooming.
How does data augmentation help with class imbalance?
It increases the diversity of minority class samples without altering class distribution.
What is an example of an image data augmentation parameter?
rotation_range=40 randomly rotates images up to 40 degrees.
What are generative models used for handling class imbalance?
Autoencoders and GANs can generate synthetic samples to improve balance.
Why should synthetic data generation be done only on the training set?
To avoid data leakage and ensure the test set remains an unbiased evaluation benchmark.
What are three common performance metrics besides accuracy?
Precision, recall, and F1-score.
Why is accuracy misleading in imbalanced datasets?
A model can achieve high accuracy by predicting only the majority class while failing to classify the minority class correctly.
What is recall?
The fraction of actual positives correctly identified by the model, calculated as TP / (TP + FN).
What is precision?
The fraction of predicted positives that are actually correct, calculated as TP / (TP + FP).
What paper introduced SMOTE?
“SMOTE: Synthetic Minority Over-sampling Technique” by Chawla et al., 2002.