Path3.Mod1.c - Automated Machine Learning - Overfitting Flashcards

Question 1

Q

How Overfitting occurs

Answer

A

When a model fits the training data too well, ergo cannot infer against unseen/new test data. To put another way, the model has “memorized” the specific patterns and noise of the training set and has become inflexible to real data

See Prevent Overfitting and Imbalanced Data with AutoML

Question 2

Q

Consider the following data:

Describe A, B and C w.r.t. Overfitting vs Underfitting

Model | Train Accuracy | Test Accuracy |
| ——- | —————— | —————— |
| A | 99.9% | 95% |
| B | 87% | 87% |
| C | 99.9% | 45% |

Answer

A

A exhibits near perfect accuracy with minimal Test error. This is normal for ML model training as we usually seek to minimize error (greater discrepancies == overfitting).
B indicates the training data and the test data accuracies are too close together, which is good but may indicate data leakage (tested with training data)
C exhibits higher liklihood of overfitting w.r.t. training accuracy.

Question 3

Q

Best practices the user implements to protect from Overfitting

Answer

A

Use more training data and eliminate statistical bias - Increasing training data means increasing accuracy, harder for the model to memorize exact patterns, resulting in more flexibility. W.r.t. bias, ensure data doesn’t have isolated patterns
Prevent target leakage - When your model “cheats” during training by using data that’s intended for prediction-time (non-training data). Characterized by abnormally high accuracy
Use fewer features (the Curse of Dimensionality) - Less features means more flexibility. Remember the Curse of Dimensionality; too many features and performance starts to degrade down to zero.

Question 4

Q

Reg HpOp MCL CV

Best practices Automated ML implements to protect from Overfitting

Answer

A

Regularization - Mimizing a cost function to penalize complex and overfitted models.
Hyperparameter optimization - Adjusting hyperparameter values until your model exhibits a consistent desired output
Model Complexity Limitations - Mostly for decision trees or forest algorithms, certain runtime properties are limited for these models.
Cross-Validation - Splitting training data into training and validation sets. Specify how many n-splits/subsets to create. Drawback is the more n-splits, the more time and cost it takes to train your model (you train and validate it n-times),

Question 5

Q

Imbalanced Data.
- What it is
- Commonly found in…
- Leads to this result

Answer

A

Data that contains a disproportionate ratio of observations in each class
ML Classification scenarios
Leads to falsely perceived positive effects in model accuracy (the input data is biased toward one class leading to the model mimicing that bias)

Question 6

Q

CM P-RC ROCC

Three charts generated by AutoML to help identify models that have imbalanced data

Answer

A

Confusion Matrix - Evaluates predicted vs expected labels for their correctness and incorrectness
Precision-Recall Curve - Evaluates ratio of correct labels against possible labels. Good Models say within the boundaries of Precision=1, Recall=100%. Anything else is considered bad (see image)
ROC Curves - Receiver Operator Characteristic. Evaluates ratio of correct labels vs ratio of false-positive labels (true positive rate vs false positive rate). You want this curve to approach y=1 == 100% TPR and 0% FPR

Question 7

Q

WC 20%CW PM Re

Built-in mechanisms AutoML uses to handle imbalanced data

Answer

A

Weight columns: weighting rows of data to increase/decrease their importance
Algorithms to determine when samples of the minority class are less than 20% of the samples of the majority class. If so, class weights are applied as remediation should it yield better performance
Use performance metrics that work better to identify imbalanced data
Resampling; reduce/increase samples of majority/minority classes respectively

Path3.Mod1.c - Automated Machine Learning - Overfitting Flashcards

(7 cards)