8.3 Model Training and Evaluation Flashcards
Supervised or unsupervised?: * Regression; * Ensemble trees; * Support vector machines (SVMs); * Neural networks;
Supervised;
Supervised or unsupervised?: * Clustering; * Dimension reduction; * Anomaly detection; deep learning networks;
Unsupervised;
What are the three steps of model training?
- Method selection; 2. Performance evaluation; 3. Tuning;
If you have _______ data you can use supervised learning; if you have ___________ data, you would use unsupervised learning.
labeled; unlabeled;
What types of data should you be familiar with in the context of machine learning?
continuous, numerical or categorical; image data; text data; speech data;
Is “bias error” due to overfitting or underfitting? Is it from the training data or from the out-of-sample data?
underfitting (not enough variables); from the training data;
Is “variance error” due to overfitting or underfitting? Is it from the training data or in the out-of-sample data?
overfitting (too many variables); out-of-sample data;
What do the X and Y axes stand for in a fitting curve for a learning model. Where on the curve do “bias error” and “variance error” lie?
Which curve represents in-sample (training sample) error?
X = Error; Y = Model Complexity;
In order to validate the fit of a machine learning algorithm, you can create a confusion matrix and then calculate various metrics from it. Using the following column headings “Actual: Default” and “Actual: No Default” and the following row titles “Prediction: Default” and “Prediction: No Default”, create a confusion matrix showing the results of the example classification problem “Classification of Defaulters”.
Using a confusion matrix, what is the formula for calculating the metric “precision”? Explain it verbally.
Precision (P) = TP / (TP + FP)
The ratio of true positives to all predicted positives. The sum of all predicted positives is the sum of the first row of results in the confusion matrix.
Using a confusion matrix, what is the formula for the metric “recall”?
Explain it verbally.
Recall (R) = TP / (TP + FN)
the ratio of true positives to all actual positives. The sum of all actual positives is the sum of the values in the first column.
Using a confusion matrix, what is the formula for the metric “accuracy”?
Explain it verbally.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
The proportion of correct forecasts out of a total number of forecasts.
numerator = the sum of the value in the upper left quadrant plus the value in the lower right quadrant; divided by the sum of all four values.
What does TPR stand for, what axis represents it, and what is the formula for calculating it?
TPR = true positive ratio;
Y-axis;
TPR = TP / (TP + FN) (same as the formula for “recall”)
What does FPR stand for, what axis represents it, and what is the formula for calculating it?
FPR = False Positive Ratio;
x-axis;
FPR = FP / (FP + TN)
What is the definition of the “F1 Score”?
What is the formula for the F1 Score?
The F1 Score is the harmonic mean of precision and recall.
F1 Score = (2 x P x R) / (P + R)
Calculated from a confusion matrix:
(1) What does “ROC” stand for, and
(2) What does it show?
(3) What does the x-axis represent and what does the Y-axis represent?
ROC = Receiver operating characteristic;
The ROC plots a curve showing the tradeoff between FPs and TPs;
Y-axis = TPR (true positive ratio)
x-axis = FPR (false positive ratio)
Using the following ROC curves, answer these questions:
(1) What does AUC stand for?
(2) Which line represents the best model?
(3) What would a curve representing 100% accuracy look like?
(4) What would a curve representing 0% accuracy look like?
(5) What is implied by the results of Model #2?
(1) Area under the curve;
(2) Model 3;
(3) A flat line going across the top of the graph;
(4) A flat line going across the bottom of the graph;
(5) 50% accuracy, therefore what one might expect from random guessing.