ML_Projects (Coursera) Flashcards

1
Q

Chain of assumptions - orthogonalization

A

Concept similar to independent (orthogonalized) TV-set adjustment knobs - one knob, one problem fix

  • fit training set well on COST function; knobs: bigger network, different optimization algo
  • fit dev set well on COST function; knobs: regularization, bigger training set
  • fit test set well on COST function; knobs: bigger dev set
  • perform well in live settings; knobs: change dev set, change COST function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Early-stopping cons

A

It is a ‘knob’ with 2 functions, contradicts orthogonalization:

  • adjusts the depth of a DL network
  • adjusts regularization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Using metric goals

A
  • use a single metric for project evaluation; in combination with a well-defined DEV set it will speed-up the iterations
    i. e. F1 instead of precision, recall; average for error across geo regions
  • use an optimizing metric subject to one or more satisficing metrics (does better than a threshold)
  • change the evaluation metric (and DEV set) when no longer ranks correctly the estimator performance
    i. e. add a weight that will increase the error for samples to help the classifier’s discriminative power
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Using TRAIN/DEV/TEST sets

A
  • use the DEV set (to optimize the estimator) and the TEST set (to evaluate the generalization error) that come from the SAME distribution (‘aiming at target’ paradigm)
  • for big data (i.e. 1 mil samples) split: 98% train, 1%dev, 1% test
  • for normal data (i.e. thousands samples) split: 60%-20%-20%
  • pick up the size of TEST set to have high confidence in the final evaluation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Orhtogonalization for metric

A
  1. define the metric that will correctly capture the information for the estimator
  2. figure out how to do well (optimize) - is the estimator performing for the metric
  3. if doing well using DEV test + metric is not translating in doing well using application, change metric and/or DEV set
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Human-level performance

A
  • an algorithm may surpass human-level performance => will slow down in its performance approaching but never reaching the Bayes optimal error
  • in many tasks (natural data tasks) human-level performance is very close to Bayes optimal error
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Improving algo to human-level performance

A
  • get human-labeled data
  • do manual error analysis: find out why human does better and incorporate into algo
  • better bias/variance analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Bias/variance

A
  • bias reduction tactics (diff estimator, larger DL network) when the training error is far from human-level error used as a proxy for Bayes error
  • variance reduction tactics (regularization, larger training set) when the DEV set error is far from TRAIN error
  • avoidable bias = TRAIN err - HL err
  • variance = DEV err - TRAIN err
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Human-level performance

A

The importance of HL performance is in its use as a Bayes error in human perception tasks; also, in other papers the state-of-the-art can also be seen as a proxy for the Bayes error
Once the HL performance is surpased is much more difficult to improve the ML algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Techniques for supervised learning

A
  1. doing well on training set (how good are the assumptions): small avoidable bias; if not then:
    - train another model
    - train longer/better optimization algo (add momentum, RMSprop, adam)
    - diff NN architecture, hyperparam search
  2. doing well on DEV/test sets (generalizes well): small variance; if not then:
    - more data
    - regularization: l2, dropout, data augmentation
    - diff NN architecture, hyperparam search
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Simple error analysis

A
  • do an analysis on 100 random mislabelled examples, FP and FN, and find the counts for different categories of errors, for example 5 dogs;
  • the relative percentages (“ceiling” on performance) will represent how much the actual performance could be improved, for 5 of 100 from 10% to 9.5%, 50 of 100 from 10% to 5%.
  • results would suggest which options to pursue for improvements
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Incorrectly labeled samples - training set

A
  • DL algos are robust to (near) random errors in the training set
  • DL algos are not robust at systematic errors in trainging set i.e. misclassifying the same type of image (white dogs = cats)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Incorrectly labeled samples - dev set

A
  • fix labels when incorrect label errors are a significant percentage from the overall dev set errors i.e. 0.6% ~ 30% of 2%
  • having a significant percentage of incorrect labels in dev set is bad for the goal of selecting between 2 models as one cannot trust the dev set
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

review corrected dev and test sets

A
  • !!! make sure that they are still from the same distribution
  • review examples that algo got both right and wrong: 100 or 200 examples
  • it’s ok that train set distribution may be slightly different
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

different training and testing data distributions - bad option

A
  • i.e. sharp (large set, 200k) vs blurred images (small set, 10k)
  • bad option: mix and shuffle the original sets followed by random split will generate sets that in expectation will preserve the original ratios i.e. 95% items of the dev set will still be from the original training set to optimize against
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

different training and testing data distributions - good option

A
  • i.e. sharp (large set, 200k) vs blurred images (small set, 10k)
  • good option: training set: add half (5k) of the small set to large set (200k), and test set: split the remaining small set into dev and test sets (2.5k and 2.5k) to optimize against
  • this ensures that the model is optimized against the ‘target’ (real life app) and dev & test are from the same distribution!!!
17
Q

handling mismatched data distribution for training and dev sets

A
  • counter 2 effects on the results: generalization (variance) and data mismatch problems
  • generalization/variance: use a training-dev (split from training) to ensure that the training-dev results are based on the same distribution
  • data mismatch: compare the training-dev results with dev results to spot this problem
18
Q

performance/error levels - mismatched data distributions

A
  • human level (or state-of-the-art) error: HLE
  • training set error: TRE
  • training-dev set error: TRDE
  • dev set error: DE
  • test set error: TSE
19
Q

bias/variance - mismatched data distributions

A

avoidable bias = TRE - HLE
variance: TRDE -TRE
data mismatch: DE - TRDE
overfitting to dev set: TSE -DE

20
Q

table representation - bias/variance analysis data-mismatch

A
  • columns: training distribution data, dev/test distribution data
  • rows: human level err, data trained on err, data not trained on err
    (1, 1) = HLE, (2, 1) = TRE, (3, 1) = TRDE, (3, 2) = DE or TSE
    optional (1, 2) = human level on dev/test
21
Q

addressing data mismatch training/test sets

A
  • do a manual error analysis to understand the differences
  • make the training data more similar i.e. synthetic data - with the caveat that it can overfit to the small synthetic data set - 1 hour of car noise that is a too small sample
  • collect training data into conditions similar to dev/test sets
22
Q

building the first system

A
  • setup a ‘target’: dev + test set and a metric
  • build the first model/system quick and dirty
  • use Bias/Variance and Error analysis to decide on next steps
  • iterate prioritizing and improving the system
  • if the system is fairly new, DON’T overthink!! to get it going
  • if is an existing body of knowledge, is ok to start with that but still DON’T overthink
23
Q

transfer learning

A
  • use a pre-trained DL network and re-train only the last one or last couple layers if the dataset is small, keeping all the other layers’ weights fixed
  • used from a problem wiht a lot of data to a problem with much less data
24
Q

multi-task learning

A
  • used much less than transfer learning
  • using one DL network to solving several tasks by using multiple outputs for the output layer i.e. detecting car, image, traffic sign in an image
  • multi-label problem using logistic-regression loss instead of the softmax for the one label output
25
Q

incomplete output labels - multi-task learning

A
  • for samples with incomplete outptput labels calculate the loss only using the available output components i.e. only with labels for car and pedestrian if traffic signal is missing
26
Q

when to use multi-task learning

A
  • it’s beneficial to have shared low-level features and multi-label outputs
  • (sometimes): when the amount of data for the datasets candidates for transfer-learning is very similar
  • when is possible to train a single bigger DL network that will do well on all tasks
27
Q

end-to-end DL

A
  • replacing a multiple stages of processing in an ML system with a single, usually huge, DL network
  • in practice an intermmediate solution, i.e. 2 steps, most of the time works better i.e.. detect the face in an image, zoom in and detect person identity
  • there is more data for each of the 2 subtasks than for an end-to-end DL network
28
Q

pros and cons - end-to-end DL

A
pros:
- let the data speak
- less hand-designed components
cons:
- requires large amounts of data
- filters out potentially useful hand-design components when less data is available; this will compensate with human knowledge for the lack of data
29
Q

sources of knowledge

A
  • data: can be used exclusively when large amount of data is available
  • human: useful hand-designed components when the amount of data is rather limited