Class Imbalance Flashcards
What is class imbalance in machine learning
a situation where a dataset’s predictor variable contains more instances of one outcome than another.
What are the majority and minority classes
refers to the class with more instances while the minority class has fewer instances.
What are the two techniques to fix potential issues with class imbalance
upsampling and downsampling.
What does downsampling do?
altering the majority class by using less of the original dataset to produce a more even split.
What is upsampling?
It artificially increases the frequency of the minority class.
How common is it for a dataset to have a perfectly balanced split of classes?
It’s extremely rare. Most datasets tend to have some degree of imbalance, where one class has significantly more examples than others.
Can a dataset with some imbalance still be useful for training a machine learning model?
Absolutely! In many cases, a moderate imbalance like 70/30 or 80/20 is perfectly acceptable for training. The model can still learn effectively from the data
When does class imbalance become a major concern for machine learning models?
Major issues arise when the majority class makes up 90% or more of the dataset. This can cause the model to become biased towards the majority class and perform poorly on the minority class
For which variable should class imbalance be considered?
for categorical variables
Which model is class balance applicable to?
classification models
What is class balancing in machine learning?
Class balancing refers to techniques that adjust the number of samples in a dataset to make the proportions of different classes more even.
Why is class balancing important?
Imbalanced datasets can lead models to be biased towards the majority class and perform poorly on the minority class.
Which datasets is downsampling suitable for?
Downsampling is suitable for large datasets (tens of thousands of observations or more).
It’s important to ensure model performance doesn’t suffer due to reduced data.
How is downsampling done?
By randomly selecting and removing observations from the majority class.
When to use upsampling?
Upsampling is used for smaller datasets where removing data from the majority class is not feasible.