4_Generalization Flashcards
Peril of Overfitting
Machine learning’s goal is to predict well on new data drawn from a (hidden) true probability distribution. Unfortunately, the model can’t see the whole truth; the model can only sample from a training data set. If a model fits the current examples well, how can you trust the model will also make good predictions on never-before-seen examples?
William of Ockham, a 14th century friar and philosopher, loved simplicity. He believed that scientists should prefer simpler formulas or theories over more complex ones. To put Ockham’s razor in machine learning terms:
The less complex an ML model, the more likely that a good empirical result is not just due to the peculiarities of the sample.
In modern times, we’ve formalized Ockham’s razor into the fields of statistical learning theory and computational learning theory. These fields have developed generalization bounds–a statistical description of a model’s ability to generalize to new data based on factors such as:
- the complexity of the model
- the model’s performance on training data
While the theoretical analysis provides formal guarantees under idealized assumptions, they can be difficult to apply in practice. Machine Learning Crash Course focuses instead on empirical evaluation to judge a model’s ability to generalize to new data.
A machine learning model aims to make good predictions on new, previously unseen data. But if you are building a model from your data set, how would you get the previously unseen data? Well, one way is to divide your data set into two subsets:
- training set—a subset to train a model.
- test set—a subset to test the model.
Good performance on the test set is a useful indicator of good performance on the new data in general, assuming that:
- The test set is large enough.
- You don’t cheat by using the same test set over and over.
The Machine Learning Fine Print
The following three basic assumptions guide generalization:
- We draw examples independently and identically (i.i.d) at random from the distribution. In other words, examples don’t influence each other. (An alternate explanation: i.i.d. is a way of referring to the randomness of variables.)
- The distribution is stationary; that is the distribution doesn’t change within the data set.
- We draw examples from partitions from the same distribution.