Bias and Imbalance Flashcards
What is class imbalance?
Class imbalance is where our data is skewed to have much more of one class than another. For example, a dataset where we have 988 negative values and 12 positive values.
What problem does class imbalance cause?
Bias in the model. If we have a severely biased dataset - for example, one that is 99% negative - if the model simply always predicts negative, it will be 99% correct!
What is oversampling and undersampling?
Over- and under-sampling are techniques we can use to reduce bias in a dataset by duplicating and deleting data points respectively.
What are two of the problems with over/undersampling?
Deleting data from the majority class in the case of under sampling means we may lose important data.
Duplicating data from the minority class means the dataset could overfit on the same minority data.
What is the meaning of overfitting?
Overfitting is caused by having too few distinct samples to learn from, leaving you unable to train a model that can generalise to new data.
What is the meaning of bias?
Bias is caused by class imbalance, leaving us unable to train a model that can learn boundaries between the classes.
What is regularisation?
Regularisation is a method of applying ‘penalties’ to a model to make learning harder, and therefore allowing the model to generalise better. One example is dropout.
What is data augmentation?
Data augmentation is a method of applying transformations to existing training data, allowing the model to become more robust and generalisable.