ML - Over/Under fitting, Bias and Variance Flashcards
What are the two major sources of error in ML?
Bias - An algorithms error rate on the training set.
Variance - How much worse a algorithm does on the test set than the training set
Explain Bias
Overall inaccuracy of the model caused by erroneous assumptions which occurred during training.
Why should we aim to reduce Bias?
Try to reduce bias because it leads to a situation where the algorithm that is building the model fails to capture relationships between the features and the ideal output - Underfitting
Explain Variance
Variance is the error we get as a result of sensitivity to small, unrepresentative, fluctuations in the training data set (NOISE). Variance describes the case where random fluctuations in the test data become part of the model
Why should we aim to reduce Variance?
Try to reduce Variance because the algorithm is failing to generalise from the train to the test set - possible overfitting.
What is Overfitting?
Overfitting occurs when a model has captured some of the random ‘noise’ in the data as well as (or instead of) the ‘real’ underlying relationships.
As a model becomes more complex, the danger of overfitting increases.
What is Underfitting?
Underfitting if algorithm too insensitive and overlooks underlying patterns. Likely to neglect significant trends, and causes model to yield less accurate predictions for current and future data.
As complexity increases does Bias decrease/Variance increase or Bias increase/Variance decrease?
As a general rule, as the complexity of the model increases, the bias decreases, however the variance increases.
There is a sweet spot.
If your training set has an error rate of 15% and you require it to be5%, should you add more training data?
No. Adding more data by itself will not make the rate better. You should focus on other changes.
Suppose your algorithm has error rates as follows:
Training error = 1%
Test error = 11%
Overfitting
Low Bias = 1%
High Variance = 11%
Suppose your algorithm has error rates as follows:
Training error = 15%
Test error = 16%
Underfitting
High Bias = 15%
Low Variance = 1%
Suppose your algorithm has error rates as follows:
Training error = 15%
Test error = 30%
Both Overfitting and Underfitting - hard to classify type of error
High Bias = 15%
High Variance = 15%
Suppose your algorithm has error rates as follows:
Training error = 0.5%
Test error = 1%
It is doing very well, as low for both Bias and Variance.
What is the Optimal Error Rate?
This is unavoidable Bias. It refers to the error rate that is inherent in something we are trying to model and even the best algorithm in the world could not get an error rate lower than this.
How do you address Bias?
Increase the size of your model