ML Flashcards
Instance Based Vs. Model Based
Instance = Learns examples by heart before generalizing to new cases using a metric of similarity
Model = Generalizes from a set of examples to build a model. Uses the model to make predictions
Model Based Learning is much more common
Main Instance Based Algorithms
K-Nearest Neighbor (KNN)
Learning Vector Quantization (LVQ)
Self Organizing Map (SOM)
Locally Weighted Learning (LQL)
Sampling Noise Vs. Sampling Bias
If the sample is too small, you will have sampling noise, which is the non-representative data as a result of chance,
Even large samples can be non-representative if the sampling method is flawed. This is called Sampling Bias.
Accuracy Paradox
Where your accuracy measures tell the story that you have excellent accuracy (such as 90%), but the accuracy is only reflecting the underlying class distribution in a dataset with imbalances classes.
Sometimes you’ll see high accuracy but one class is being ignored completely.
Tactics for Imbalanced Training Data
- Collect more data
- Change the performance metric - Confusion matrix etc
- Try resampling your data set - add or subtract rows for training
- Generate synthetic samples - SMOTE technique
- Try different algorithms - Decision trees often do well
- Try penalized models
- Try a different perspective on the problem
Rescaling
If you have features with very different scales (Km and nanometers. You may need to rescale the feature to make the comparisons easier.
Typically this is done by dividing the scale of each dimension by the variance of the dimension. This preserves the distances between feature data points, but changes the values to be more comparable.
The need for scaling depends on the problem at hand.
Classification Models - Accuracy
Ratio of correctly predicted instances to the total number of instances in the dataset.
Can be misleading in cases of class imbalance
Classification Models - Precision
Measures accuracy of positive predictions
Tells you how many instances predicted as positive are actually positive.
High precision = low rate of false positives
This is important when the cost of false positives is high.
Classification Models - Recall
AKA Sensitivity or True positive rate
Tells you how many of the actual positive instances were correctly predicted by the model
High recall = lower rate of false negatives
This is important when the cost of false negatives is high
Classification Models - Specificity
AKA True Negative rate
How many instances predicted as negative are actually negative
Tells you how well the model predicts negative cases.
High specificity = lower false positives b/c negative cases are correctly predicted
Classification Models - F1
Combines Precision and Recall into a single score - harmonic mean
If precision and recall are very different, F1 will be closer to the lower value.
Useful when you want to balance precision and recall esp. if the data is imbalanced
Scores range from 0-1
* 0: No positive predictions or all predictions are wrong.
* 1: All positive predictions are correct.
* 0.5 - 0.75: Moderate performance.
* 0.75 - 1: Strong performance.
High F1 indicates generally good performance
Classification Models - F2
F2 places more weight on recall (4x more)
useful when recall is more critical than precision
Classification Models - F0.5
F0.5 places more weight on precision (2x more)
Useful when you need to minimize false positives
Classification Metrics - Matthews Correlation Coefficient (MCC)
Used to evaluate binary classification models
Provides a balances score that works well even with class imbalance.
MCC scale is 1 to -1
1 = perfect classification
0 = random guessing
-1= classification 100% wrong
PR curve
X = Recall
Y = Precision
If the line is above the diagonal, the model is doing better than chance.
The shape of the curve is influenced by the decision threshold.
Lower threshold = more instances classified as positive = higher recall and lower precision (more false positives)
AUC-PR - A higher AUC indicates better performance.