ML Flashcards

Question 1

Q

Instance Based Vs. Model Based

Answer

A

Instance = Learns examples by heart before generalizing to new cases using a metric of similarity

Model = Generalizes from a set of examples to build a model. Uses the model to make predictions

Model Based Learning is much more common

Question 2

Q

Main Instance Based Algorithms

Answer

A

K-Nearest Neighbor (KNN)

Learning Vector Quantization (LVQ)

Self Organizing Map (SOM)

Locally Weighted Learning (LQL)

Question 3

Q

Sampling Noise Vs. Sampling Bias

Answer

A

If the sample is too small, you will have sampling noise, which is the non-representative data as a result of chance,

Even large samples can be non-representative if the sampling method is flawed. This is called Sampling Bias.

Question 4

Q

Accuracy Paradox

Answer

A

Where your accuracy measures tell the story that you have excellent accuracy (such as 90%), but the accuracy is only reflecting the underlying class distribution in a dataset with imbalances classes.

Sometimes you’ll see high accuracy but one class is being ignored completely.

Question 5

Q

Tactics for Imbalanced Training Data

Answer

A

Collect more data
Change the performance metric - Confusion matrix etc
Try resampling your data set - add or subtract rows for training
Generate synthetic samples - SMOTE technique
Try different algorithms - Decision trees often do well
Try penalized models
Try a different perspective on the problem

Question 6

Q

Rescaling

Answer

A

If you have features with very different scales (Km and nanometers. You may need to rescale the feature to make the comparisons easier.

Typically this is done by dividing the scale of each dimension by the variance of the dimension. This preserves the distances between feature data points, but changes the values to be more comparable.

The need for scaling depends on the problem at hand.

Question 7

Q

Classification Models - Accuracy

Answer

A

Ratio of correctly predicted instances to the total number of instances in the dataset.

Can be misleading in cases of class imbalance

Question 8

Q

Classification Models - Precision

Answer

A

Measures accuracy of positive predictions

Tells you how many instances predicted as positive are actually positive.

High precision = low rate of false positives

This is important when the cost of false positives is high.

Question 9

Q

Classification Models - Recall

Answer

A

AKA Sensitivity or True positive rate

Tells you how many of the actual positive instances were correctly predicted by the model

High recall = lower rate of false negatives

This is important when the cost of false negatives is high

Question 10

Q

Classification Models - Specificity

Answer

A

AKA True Negative rate

How many instances predicted as negative are actually negative

Tells you how well the model predicts negative cases.

High specificity = lower false positives b/c negative cases are correctly predicted

Question 11

Q

Classification Models - F1

Answer

A

Combines Precision and Recall into a single score - harmonic mean

If precision and recall are very different, F1 will be closer to the lower value.

Useful when you want to balance precision and recall esp. if the data is imbalanced

Scores range from 0-1
* 0: No positive predictions or all predictions are wrong.
* 1: All positive predictions are correct.
* 0.5 - 0.75: Moderate performance.
* 0.75 - 1: Strong performance.

High F1 indicates generally good performance

Question 12

Q

Classification Models - F2

Answer

A

F2 places more weight on recall (4x more)

useful when recall is more critical than precision

Question 13

Q

Classification Models - F0.5

Answer

A

F0.5 places more weight on precision (2x more)

Useful when you need to minimize false positives

Question 14

Q

Classification Metrics - Matthews Correlation Coefficient (MCC)

Answer

A

Used to evaluate binary classification models

Provides a balances score that works well even with class imbalance.

MCC scale is 1 to -1
1 = perfect classification
0 = random guessing
-1= classification 100% wrong

Question 15

Q

PR curve

Answer

A

X = Recall
Y = Precision

If the line is above the diagonal, the model is doing better than chance.

The shape of the curve is influenced by the decision threshold.

Lower threshold = more instances classified as positive = higher recall and lower precision (more false positives)

AUC-PR - A higher AUC indicates better performance.

Question 16

Q

AUC curve

Answer

Study These Flashcards

A

AUC of the ROC (Receiver operating Characteristic) curve

Single value that summarizes overall performance across all possible classification thresholds.

X = FPR
Y = TPR

AUC ranges from 0 to 1
1 = perfect classificaton
0.5 = random guessing

An AUC of 0.88 suggests that the model correctly ranks positive instances higher than negative instances 88% of the time across all possible classification thresholds.
This means that if you randomly select one positive instance and one negative instance, the model will predict the positive instance as more likely to be positive than the negative instance 88% of the time.

ML Flashcards

(16 cards)