ML_Projects (Coursera) Flashcards
Chain of assumptions - orthogonalization
Concept similar to independent (orthogonalized) TV-set adjustment knobs - one knob, one problem fix
- fit training set well on COST function; knobs: bigger network, different optimization algo
- fit dev set well on COST function; knobs: regularization, bigger training set
- fit test set well on COST function; knobs: bigger dev set
- perform well in live settings; knobs: change dev set, change COST function
Early-stopping cons
It is a ‘knob’ with 2 functions, contradicts orthogonalization:
- adjusts the depth of a DL network
- adjusts regularization
Using metric goals
- use a single metric for project evaluation; in combination with a well-defined DEV set it will speed-up the iterations
i. e. F1 instead of precision, recall; average for error across geo regions - use an optimizing metric subject to one or more satisficing metrics (does better than a threshold)
- change the evaluation metric (and DEV set) when no longer ranks correctly the estimator performance
i. e. add a weight that will increase the error for samples to help the classifier’s discriminative power
Using TRAIN/DEV/TEST sets
- use the DEV set (to optimize the estimator) and the TEST set (to evaluate the generalization error) that come from the SAME distribution (‘aiming at target’ paradigm)
- for big data (i.e. 1 mil samples) split: 98% train, 1%dev, 1% test
- for normal data (i.e. thousands samples) split: 60%-20%-20%
- pick up the size of TEST set to have high confidence in the final evaluation
Orhtogonalization for metric
- define the metric that will correctly capture the information for the estimator
- figure out how to do well (optimize) - is the estimator performing for the metric
- if doing well using DEV test + metric is not translating in doing well using application, change metric and/or DEV set
Human-level performance
- an algorithm may surpass human-level performance => will slow down in its performance approaching but never reaching the Bayes optimal error
- in many tasks (natural data tasks) human-level performance is very close to Bayes optimal error
Improving algo to human-level performance
- get human-labeled data
- do manual error analysis: find out why human does better and incorporate into algo
- better bias/variance analysis
Bias/variance
- bias reduction tactics (diff estimator, larger DL network) when the training error is far from human-level error used as a proxy for Bayes error
- variance reduction tactics (regularization, larger training set) when the DEV set error is far from TRAIN error
- avoidable bias = TRAIN err - HL err
- variance = DEV err - TRAIN err
Human-level performance
The importance of HL performance is in its use as a Bayes error in human perception tasks; also, in other papers the state-of-the-art can also be seen as a proxy for the Bayes error
Once the HL performance is surpased is much more difficult to improve the ML algorithm
Techniques for supervised learning
- doing well on training set (how good are the assumptions): small avoidable bias; if not then:
- train another model
- train longer/better optimization algo (add momentum, RMSprop, adam)
- diff NN architecture, hyperparam search - doing well on DEV/test sets (generalizes well): small variance; if not then:
- more data
- regularization: l2, dropout, data augmentation
- diff NN architecture, hyperparam search
Simple error analysis
- do an analysis on 100 random mislabelled examples, FP and FN, and find the counts for different categories of errors, for example 5 dogs;
- the relative percentages (“ceiling” on performance) will represent how much the actual performance could be improved, for 5 of 100 from 10% to 9.5%, 50 of 100 from 10% to 5%.
- results would suggest which options to pursue for improvements
Incorrectly labeled samples - training set
- DL algos are robust to (near) random errors in the training set
- DL algos are not robust at systematic errors in trainging set i.e. misclassifying the same type of image (white dogs = cats)
Incorrectly labeled samples - dev set
- fix labels when incorrect label errors are a significant percentage from the overall dev set errors i.e. 0.6% ~ 30% of 2%
- having a significant percentage of incorrect labels in dev set is bad for the goal of selecting between 2 models as one cannot trust the dev set
review corrected dev and test sets
- !!! make sure that they are still from the same distribution
- review examples that algo got both right and wrong: 100 or 200 examples
- it’s ok that train set distribution may be slightly different
different training and testing data distributions - bad option
- i.e. sharp (large set, 200k) vs blurred images (small set, 10k)
- bad option: mix and shuffle the original sets followed by random split will generate sets that in expectation will preserve the original ratios i.e. 95% items of the dev set will still be from the original training set to optimize against