Data Science MODULE 3 Flashcards
What does it mean when a statistical model has been overfit?
It mirrors the training data extremely well, but doesn’t perform well with unseen data
Hoekom is overfitting so n groot probleem
So die doelwit van n data scientist is om responses te predict met n geleerde model. As j overfit, het dit n direkte negatiewe impak op jou vermoe om te predict
In tree-based models, the algorithm is programmed to?(2)
- reduce the impurity of partitions in a classification problem
- reduce the mean squared error in regression problems
When fitting data we need to find the balance between?
Bias and variance
What is variance?
It indicates an error related to how robust the model is when unseen data is used
What does it mean when a model has high variance?
Performs well on training data, but not the test set - i.e the model is overfitted
The ideal model in terms of variance and bias
Low variance and low bias
Models with a high bias, tends to be?
Too simple, they don’t capture the shape of the data
Kan jy die variance en bias meet?
Nope, omdat ons nie weet hoe die onderliggende data werklik lyk nie
Wat is n validation error?
Soos ek verstaan, train jy die model. Maar jy hou ook test data eenkant. As hy klaar is, toets jy dit met die unseen data en bepaal dan die fout. Hierdie staan bekend as die validation set approach
As die validation error hoog is, maar die training error is small?
You have done it, you have overfitted the data
So daar is eintlik drie stelle data waarmee ons train
Initial training data, validation data en dan test data
So met fitting, hoe word die validation data actually gebruik?
Ons probeer die beste waardenkry vir die hyperparameter (denoted as alpha)
Different hyperparameters are assessed to determine the one that results in?
The lowest validation error
So wanneer sal ons dan nou die test data begin gebruik?
Wanneer die validation error geminimise is, deur die hyperparameter aanpassings
Two of the most common types of validation used?
K-fold cross validation
Standard
Hoe lyk die data split vir standard validation
70% op die training, 20% op die validation, 10% op die test set
Oorhoofs, hoe werk k-fold cross validation
Aanvanklik word data net gesplit in training data en n test set. Die training data word dan verdeel in k hoeveelheid “folds”. Een word gebruik vir die validasie. Die data word randomly assign aan elke fold, so die oorhoofse verspreiding behoort redelik eweredig te wees
So as jy five folds het, hoe werk die training?
Elke fold kry n kans om die validation set te wees. As die data klaar gebruik was vir die validasie, word dit weer gebruik vir training. So die model word vyfkeer getrain. Die final validation error is dan die gemiddeld van die 5 errors.
2 Voordele kan k-fold validation?
- Jy kry n beter idee van die final error
- Gee jou n beter idee van die variasie in die validation error
So we do we manage complexity with tree-based models?
Via pruning and validation sets
Wat is die hoof doel van cost complexity pruning?
To reduce overfitting by penalising large trees
How does the penalising work with cost complexity pruning
So n penalty term word ingesluit. Hierdie penalty term sluit in die hyperparameter, sowel as |T|, wat die aantal nodes is op die terminale vlak
Oorhoofs, wat is die drie stappe wat gevolg word om n decision tree te prune?
- Generate die groot boom
- K-fold cross validation to find the optimum hyperparameter
- Apply cost complexity pruning to the large tree with the optimum alpha value