ML_Projects (Coursera) Flashcards
Chain of assumptions - orthogonalization
Concept similar to independent (orthogonalized) TV-set adjustment knobs - one knob, one problem fix
- fit training set well on COST function; knobs: bigger network, different optimization algo
- fit dev set well on COST function; knobs: regularization, bigger training set
- fit test set well on COST function; knobs: bigger dev set
- perform well in live settings; knobs: change dev set, change COST function
Early-stopping cons
It is a ‘knob’ with 2 functions, contradicts orthogonalization:
- adjusts the depth of a DL network
- adjusts regularization
Using metric goals
- use a single metric for project evaluation; in combination with a well-defined DEV set it will speed-up the iterations
i. e. F1 instead of precision, recall; average for error across geo regions - use an optimizing metric subject to one or more satisficing metrics (does better than a threshold)
- change the evaluation metric (and DEV set) when no longer ranks correctly the estimator performance
i. e. add a weight that will increase the error for samples to help the classifier’s discriminative power
Using TRAIN/DEV/TEST sets
- use the DEV set (to optimize the estimator) and the TEST set (to evaluate the generalization error) that come from the SAME distribution (‘aiming at target’ paradigm)
- for big data (i.e. 1 mil samples) split: 98% train, 1%dev, 1% test
- for normal data (i.e. thousands samples) split: 60%-20%-20%
- pick up the size of TEST set to have high confidence in the final evaluation
Orhtogonalization for metric
- define the metric that will correctly capture the information for the estimator
- figure out how to do well (optimize) - is the estimator performing for the metric
- if doing well using DEV test + metric is not translating in doing well using application, change metric and/or DEV set
Human-level performance
- an algorithm may surpass human-level performance => will slow down in its performance approaching but never reaching the Bayes optimal error
- in many tasks (natural data tasks) human-level performance is very close to Bayes optimal error
Improving algo to human-level performance
- get human-labeled data
- do manual error analysis: find out why human does better and incorporate into algo
- better bias/variance analysis
Bias/variance
- bias reduction tactics (diff estimator, larger DL network) when the training error is far from human-level error used as a proxy for Bayes error
- variance reduction tactics (regularization, larger training set) when the DEV set error is far from TRAIN error
- avoidable bias = TRAIN err - HL err
- variance = DEV err - TRAIN err
Human-level performance
The importance of HL performance is in its use as a Bayes error in human perception tasks; also, in other papers the state-of-the-art can also be seen as a proxy for the Bayes error
Once the HL performance is surpased is much more difficult to improve the ML algorithm
Techniques for supervised learning
- doing well on training set (how good are the assumptions): small avoidable bias; if not then:
- train another model
- train longer/better optimization algo (add momentum, RMSprop, adam)
- diff NN architecture, hyperparam search - doing well on DEV/test sets (generalizes well): small variance; if not then:
- more data
- regularization: l2, dropout, data augmentation
- diff NN architecture, hyperparam search
Simple error analysis
- do an analysis on 100 random mislabelled examples, FP and FN, and find the counts for different categories of errors, for example 5 dogs;
- the relative percentages (“ceiling” on performance) will represent how much the actual performance could be improved, for 5 of 100 from 10% to 9.5%, 50 of 100 from 10% to 5%.
- results would suggest which options to pursue for improvements
Incorrectly labeled samples - training set
- DL algos are robust to (near) random errors in the training set
- DL algos are not robust at systematic errors in trainging set i.e. misclassifying the same type of image (white dogs = cats)
Incorrectly labeled samples - dev set
- fix labels when incorrect label errors are a significant percentage from the overall dev set errors i.e. 0.6% ~ 30% of 2%
- having a significant percentage of incorrect labels in dev set is bad for the goal of selecting between 2 models as one cannot trust the dev set
review corrected dev and test sets
- !!! make sure that they are still from the same distribution
- review examples that algo got both right and wrong: 100 or 200 examples
- it’s ok that train set distribution may be slightly different
different training and testing data distributions - bad option
- i.e. sharp (large set, 200k) vs blurred images (small set, 10k)
- bad option: mix and shuffle the original sets followed by random split will generate sets that in expectation will preserve the original ratios i.e. 95% items of the dev set will still be from the original training set to optimize against
different training and testing data distributions - good option
- i.e. sharp (large set, 200k) vs blurred images (small set, 10k)
- good option: training set: add half (5k) of the small set to large set (200k), and test set: split the remaining small set into dev and test sets (2.5k and 2.5k) to optimize against
- this ensures that the model is optimized against the ‘target’ (real life app) and dev & test are from the same distribution!!!
handling mismatched data distribution for training and dev sets
- counter 2 effects on the results: generalization (variance) and data mismatch problems
- generalization/variance: use a training-dev (split from training) to ensure that the training-dev results are based on the same distribution
- data mismatch: compare the training-dev results with dev results to spot this problem
performance/error levels - mismatched data distributions
- human level (or state-of-the-art) error: HLE
- training set error: TRE
- training-dev set error: TRDE
- dev set error: DE
- test set error: TSE
bias/variance - mismatched data distributions
avoidable bias = TRE - HLE
variance: TRDE -TRE
data mismatch: DE - TRDE
overfitting to dev set: TSE -DE
table representation - bias/variance analysis data-mismatch
- columns: training distribution data, dev/test distribution data
- rows: human level err, data trained on err, data not trained on err
(1, 1) = HLE, (2, 1) = TRE, (3, 1) = TRDE, (3, 2) = DE or TSE
optional (1, 2) = human level on dev/test
addressing data mismatch training/test sets
- do a manual error analysis to understand the differences
- make the training data more similar i.e. synthetic data - with the caveat that it can overfit to the small synthetic data set - 1 hour of car noise that is a too small sample
- collect training data into conditions similar to dev/test sets
building the first system
- setup a ‘target’: dev + test set and a metric
- build the first model/system quick and dirty
- use Bias/Variance and Error analysis to decide on next steps
- iterate prioritizing and improving the system
- if the system is fairly new, DON’T overthink!! to get it going
- if is an existing body of knowledge, is ok to start with that but still DON’T overthink
transfer learning
- use a pre-trained DL network and re-train only the last one or last couple layers if the dataset is small, keeping all the other layers’ weights fixed
- used from a problem wiht a lot of data to a problem with much less data
multi-task learning
- used much less than transfer learning
- using one DL network to solving several tasks by using multiple outputs for the output layer i.e. detecting car, image, traffic sign in an image
- multi-label problem using logistic-regression loss instead of the softmax for the one label output
incomplete output labels - multi-task learning
- for samples with incomplete outptput labels calculate the loss only using the available output components i.e. only with labels for car and pedestrian if traffic signal is missing
when to use multi-task learning
- it’s beneficial to have shared low-level features and multi-label outputs
- (sometimes): when the amount of data for the datasets candidates for transfer-learning is very similar
- when is possible to train a single bigger DL network that will do well on all tasks
end-to-end DL
- replacing a multiple stages of processing in an ML system with a single, usually huge, DL network
- in practice an intermmediate solution, i.e. 2 steps, most of the time works better i.e.. detect the face in an image, zoom in and detect person identity
- there is more data for each of the 2 subtasks than for an end-to-end DL network
pros and cons - end-to-end DL
pros: - let the data speak - less hand-designed components cons: - requires large amounts of data - filters out potentially useful hand-design components when less data is available; this will compensate with human knowledge for the lack of data
sources of knowledge
- data: can be used exclusively when large amount of data is available
- human: useful hand-designed components when the amount of data is rather limited