Data Science MODULE 3 Flashcards
What does it mean when a statistical model has been overfit?
It mirrors the training data extremely well, but doesn’t perform well with unseen data
Hoekom is overfitting so n groot probleem
So die doelwit van n data scientist is om responses te predict met n geleerde model. As j overfit, het dit n direkte negatiewe impak op jou vermoe om te predict
In tree-based models, the algorithm is programmed to?(2)
- reduce the impurity of partitions in a classification problem
- reduce the mean squared error in regression problems
When fitting data we need to find the balance between?
Bias and variance
What is variance?
It indicates an error related to how robust the model is when unseen data is used
What does it mean when a model has high variance?
Performs well on training data, but not the test set - i.e the model is overfitted
The ideal model in terms of variance and bias
Low variance and low bias
Models with a high bias, tends to be?
Too simple, they don’t capture the shape of the data
Kan jy die variance en bias meet?
Nope, omdat ons nie weet hoe die onderliggende data werklik lyk nie
Wat is n validation error?
Soos ek verstaan, train jy die model. Maar jy hou ook test data eenkant. As hy klaar is, toets jy dit met die unseen data en bepaal dan die fout. Hierdie staan bekend as die validation set approach
As die validation error hoog is, maar die training error is small?
You have done it, you have overfitted the data
So daar is eintlik drie stelle data waarmee ons train
Initial training data, validation data en dan test data
So met fitting, hoe word die validation data actually gebruik?
Ons probeer die beste waardenkry vir die hyperparameter (denoted as alpha)
Different hyperparameters are assessed to determine the one that results in?
The lowest validation error
So wanneer sal ons dan nou die test data begin gebruik?
Wanneer die validation error geminimise is, deur die hyperparameter aanpassings