ML Flashcards

1
Q

Instance Based Vs. Model Based

A

Instance = Learns examples by heart before generalizing to new cases using a metric of similarity

Model = Generalizes from a set of examples to build a model. Uses the model to make predictions

Model Based Learning is much more common

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Main Instance Based Algorithms

A

K-Nearest Neighbor (KNN)

Learning Vector Quantization (LVQ)

Self Organizing Map (SOM)

Locally Weighted Learning (LQL)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Sampling Noise Vs. Sampling Bias

A

If the sample is too small, you will have sampling noise, which is the non-representative data as a result of chance,

Even large samples can be non-representative if the sampling method is flawed. This is called Sampling Bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Accuracy Paradox

A

Where your accuracy measures tell the story that you have excellent accuracy (such as 90%), but the accuracy is only reflecting the underlying class distribution in a dataset with imbalances classes.

Sometimes you’ll see high accuracy but one class is being ignored completely.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Tactics for Imbalanced Training Data

A
  1. Collect more data
  2. Change the performance metric - Confusion matrix etc
  3. Try resampling your data set - add or subtract rows for training
  4. Generate synthetic samples - SMOTE technique
  5. Try different algorithms - Decision trees often do well
  6. Try penalized models
  7. Try a different perspective on the problem
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Rescaling

A

If you have features with very different scales (Km and nanometers. You may need to rescale the feature to make the comparisons easier.

Typically this is done by dividing the scale of each dimension by the variance of the dimension. This preserves the distances between feature data points, but changes the values to be more comparable.

The need for scaling depends on the problem at hand.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Classification Models - Accuracy

A

Ratio of correctly predicted instances to the total number of instances in the dataset.

Can be misleading in cases of class imbalance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Classification Models - Precision

A

Measures accuracy of positive predictions

Tells you how many instances predicted as positive are actually positive.

High precision = low rate of false positives

This is important when the cost of false positives is high.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Classification Models - Recall

A

AKA Sensitivity or True positive rate

Tells you how many of the actual positive instances were correctly predicted by the model

High recall = lower rate of false negatives

This is important when the cost of false negatives is high

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Classification Models - Specificity

A

AKA True Negative rate

How many instances predicted as negative are actually negative

Tells you how well the model predicts negative cases.

High specificity = lower false positives b/c negative cases are correctly predicted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Classification Models - F1

A

Combines Precision and Recall into a single score - harmonic mean

If precision and recall are very different, F1 will be closer to the lower value.

Useful when you want to balance precision and recall esp. if the data is imbalanced

Scores range from 0-1
* 0: No positive predictions or all predictions are wrong.
* 1: All positive predictions are correct.
* 0.5 - 0.75: Moderate performance.
* 0.75 - 1: Strong performance.

High F1 indicates generally good performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Classification Models - F2

A

F2 places more weight on recall (4x more)

useful when recall is more critical than precision

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Classification Models - F0.5

A

F0.5 places more weight on precision (2x more)

Useful when you need to minimize false positives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Classification Metrics - Matthews Correlation Coefficient (MCC)

A

Used to evaluate binary classification models

Provides a balances score that works well even with class imbalance.

MCC scale is 1 to -1
1 = perfect classification
0 = random guessing
-1= classification 100% wrong

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

PR curve

A

X = Recall
Y = Precision

If the line is above the diagonal, the model is doing better than chance.

The shape of the curve is influenced by the decision threshold.

Lower threshold = more instances classified as positive = higher recall and lower precision (more false positives)

AUC-PR - A higher AUC indicates better performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

AUC curve

A

AUC of the ROC (Receiver operating Characteristic) curve

Single value that summarizes overall performance across all possible classification thresholds.

X = FPR
Y = TPR

AUC ranges from 0 to 1
1 = perfect classificaton
0.5 = random guessing

An AUC of 0.88 suggests that the model correctly ranks positive instances higher than negative instances 88% of the time across all possible classification thresholds.
This means that if you randomly select one positive instance and one negative instance, the model will predict the positive instance as more likely to be positive than the negative instance 88% of the time.