Path3.Mod1.c - Automated Machine Learning - Overfitting Flashcards

1
Q

How Overfitting occurs

A

When a model fits the training data too well, ergo cannot infer against unseen/new test data. To put another way, the model has “memorized” the specific patterns and noise of the training set and has become inflexible to real data

See Prevent Overfitting and Imbalanced Data with AutoML

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Consider the following data:

Describe A, B and C w.r.t. Overfitting vs Underfitting

Model | Train Accuracy | Test Accuracy |
| ——- | —————— | —————— |
| A | 99.9% | 95% |
| B | 87% | 87% |
| C | 99.9% | 45% |

A

A exhibits near perfect accuracy with minimal Test error. This is normal for ML model training as we usually seek to minimize error (greater discrepancies == overfitting).
B indicates the training data and the test data accuracies are too close together, which is good but may indicate data leakage (tested with training data)
C exhibits higher liklihood of overfitting w.r.t. training accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Best practices the user implements to protect from Overfitting

A
  • Use more training data and eliminate statistical bias - Increasing training data means increasing accuracy, harder for the model to memorize exact patterns, resulting in more flexibility. W.r.t. bias, ensure data doesn’t have isolated patterns
  • Prevent target leakage - When your model “cheats” during training by using data that’s intended for prediction-time (non-training data). Characterized by abnormally high accuracy
  • Use fewer features (the Curse of Dimensionality) - Less features means more flexibility. Remember the Curse of Dimensionality; too many features and performance starts to degrade down to zero.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Reg HpOp MCL CV

Best practices Automated ML implements to protect from Overfitting

A
  • Regularization - Mimizing a cost function to penalize complex and overfitted models.
  • Hyperparameter optimization - Adjusting hyperparameter values until your model exhibits a consistent desired output
  • Model Complexity Limitations - Mostly for decision trees or forest algorithms, certain runtime properties are limited for these models.
  • Cross-Validation - Splitting training data into training and validation sets. Specify how many n-splits/subsets to create. Drawback is the more n-splits, the more time and cost it takes to train your model (you train and validate it n-times),
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Imbalanced Data.
- What it is
- Commonly found in…
- Leads to this result

A
  • Data that contains a disproportionate ratio of observations in each class
  • ML Classification scenarios
  • Leads to falsely perceived positive effects in model accuracy (the input data is biased toward one class leading to the model mimicing that bias)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

CM P-RC ROCC

Three charts generated by AutoML to help identify models that have imbalanced data

A
  • Confusion Matrix - Evaluates predicted vs expected labels for their correctness and incorrectness
  • Precision-Recall Curve - Evaluates ratio of correct labels against possible labels. Good Models say within the boundaries of Precision=1, Recall=100%. Anything else is considered bad (see image)
  • ROC Curves - Receiver Operator Characteristic. Evaluates ratio of correct labels vs ratio of false-positive labels (true positive rate vs false positive rate). You want this curve to approach y=1 == 100% TPR and 0% FPR
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

WC 20%CW PM Re

Built-in mechanisms AutoML uses to handle imbalanced data

A
  • Weight columns: weighting rows of data to increase/decrease their importance
  • Algorithms to determine when samples of the minority class are less than 20% of the samples of the majority class. If so, class weights are applied as remediation should it yield better performance
  • Use performance metrics that work better to identify imbalanced data
  • Resampling; reduce/increase samples of majority/minority classes respectively
How well did you know this?
1
Not at all
2
3
4
5
Perfectly