Machine Learning Technologies Flashcards
What are the 4 types of ML techniques?
Supervised
Semi-Supervised
Unsupervised
Reinforcement
What is error rate?
The proportion of incorrectly classified samples to total no. samples
What is empirical error?
Error calculated on training set
What is generalised error?
Error calculated on unseen samples
What are the 4 reasons for underfitting happening?
Model too simple
Insufficient training
Uninformative dataset
Over-regularised
What are the 4 reasons for overfitting happening?
Too complex
Excessive training
Small dataset
Lacking regularisation
How to fix overfitting?
Change model and/or change data
How to fix underfitting?
Update model and/or add more data
Why is overfitting unavoidable?
Because P≠NP - there are some problems for which we can verify a solution quickly but finding that solution efficiently is computationally infeasible
What’s the hold-out method?
Where dataset is split into two disjoint subsets (training set & testing set)
Why do we use stratified sampling?
To prevent biased error
What are the 2 difficulties in choosing the data split?
More data in training set -> better model approximation but less reliable evaluation
More data is testing set -> better evaluation but weaker approximation
What is LOO (Leave-One-Out)?
A case of k-fold cross-validation where k = n-1. So the test set is 1 and the training set is the rest
Close to ideal evaluation of training but computation cost is prohibitive for large datasets
What are the 5 steps of bootstrapping?
For dataset D containing n samples
1) Randomly pick a sample from D
2) Copy to D’
3) Put it back in D
4) Repeat n times
5) Use D’ as training set and D\D’ as testing set
What proportion of the data ends up in the testing set in bootstrapping?
Chance of not being picked in m rounds: (1-1/m)^m
As m -> infinity, chance -> 1/e = 0.368
So 36.8% of original samples don’t appear in D’ (this remaining data is called OOB (out-of-bag) data
What is out-of-bag estimate?
The evaluation result obtained by bootstrapping
Parameters vs hyperparameters
Parameters are internal variables, learned automatically (>10 billion)
Hyperparameters are external variables defined by the user (<10)
What is accuracy?
Correctly predicted instances / all instances
What is error?
Incorrectly predicted instances / all instances
What is precision?
Correctly predicted positives / predicted positives
What is recall?
Correctly predicted positives / actual positives
What is specificity?
Correctly predicted negatives / actual negatives
What is a P-R curve?
Precision-recall curve
A tool for evaluating effectiveness of a classification model
What 3 solutions are there to intersecting lines in a P-R curve?
- Compare areas under curves - not easy to compute
- Break-even point - measure the point on the curves where precision & recall are equal
- F1-Measure - harmonic mean of P & R:
2 x (P * R) / (P + R)
= 2 x TP / (N + TP - TN)