ML_Projects (Coursera) Flashcards

Question 1

Q

Chain of assumptions - orthogonalization

Answer

A

Concept similar to independent (orthogonalized) TV-set adjustment knobs - one knob, one problem fix

fit training set well on COST function; knobs: bigger network, different optimization algo
fit dev set well on COST function; knobs: regularization, bigger training set
fit test set well on COST function; knobs: bigger dev set
perform well in live settings; knobs: change dev set, change COST function

Question 2

Q

Early-stopping cons

Answer

A

It is a ‘knob’ with 2 functions, contradicts orthogonalization:

adjusts the depth of a DL network
adjusts regularization

Question 3

Q

Using metric goals

Answer

A

use a single metric for project evaluation; in combination with a well-defined DEV set it will speed-up the iterations
i. e. F1 instead of precision, recall; average for error across geo regions
use an optimizing metric subject to one or more satisficing metrics (does better than a threshold)
change the evaluation metric (and DEV set) when no longer ranks correctly the estimator performance
i. e. add a weight that will increase the error for samples to help the classifier’s discriminative power

Question 4

Q

Using TRAIN/DEV/TEST sets

Answer

A

use the DEV set (to optimize the estimator) and the TEST set (to evaluate the generalization error) that come from the SAME distribution (‘aiming at target’ paradigm)
for big data (i.e. 1 mil samples) split: 98% train, 1%dev, 1% test
for normal data (i.e. thousands samples) split: 60%-20%-20%
pick up the size of TEST set to have high confidence in the final evaluation

Question 5

Q

Orhtogonalization for metric

Answer

A

define the metric that will correctly capture the information for the estimator
figure out how to do well (optimize) - is the estimator performing for the metric
if doing well using DEV test + metric is not translating in doing well using application, change metric and/or DEV set

Question 6

Q

Human-level performance

Answer

A

an algorithm may surpass human-level performance => will slow down in its performance approaching but never reaching the Bayes optimal error
in many tasks (natural data tasks) human-level performance is very close to Bayes optimal error

Question 7

Q

Improving algo to human-level performance

Answer

A

get human-labeled data
do manual error analysis: find out why human does better and incorporate into algo
better bias/variance analysis

Question 8

Q

Bias/variance

Answer

A

bias reduction tactics (diff estimator, larger DL network) when the training error is far from human-level error used as a proxy for Bayes error
variance reduction tactics (regularization, larger training set) when the DEV set error is far from TRAIN error
avoidable bias = TRAIN err - HL err
variance = DEV err - TRAIN err

Question 9

Q

Human-level performance

Answer

A

The importance of HL performance is in its use as a Bayes error in human perception tasks; also, in other papers the state-of-the-art can also be seen as a proxy for the Bayes error
Once the HL performance is surpased is much more difficult to improve the ML algorithm

Question 10

Q

Techniques for supervised learning

Answer

A

doing well on training set (how good are the assumptions): small avoidable bias; if not then:
- train another model
- train longer/better optimization algo (add momentum, RMSprop, adam)
- diff NN architecture, hyperparam search
doing well on DEV/test sets (generalizes well): small variance; if not then:
- more data
- regularization: l2, dropout, data augmentation
- diff NN architecture, hyperparam search

Question 11

Q

Simple error analysis

Answer

A

do an analysis on 100 random mislabelled examples, FP and FN, and find the counts for different categories of errors, for example 5 dogs;
the relative percentages (“ceiling” on performance) will represent how much the actual performance could be improved, for 5 of 100 from 10% to 9.5%, 50 of 100 from 10% to 5%.
results would suggest which options to pursue for improvements

Question 12

Q

Incorrectly labeled samples - training set

Answer

A

DL algos are robust to (near) random errors in the training set
DL algos are not robust at systematic errors in trainging set i.e. misclassifying the same type of image (white dogs = cats)

Question 13

Q

Incorrectly labeled samples - dev set

Answer

A

fix labels when incorrect label errors are a significant percentage from the overall dev set errors i.e. 0.6% ~ 30% of 2%
having a significant percentage of incorrect labels in dev set is bad for the goal of selecting between 2 models as one cannot trust the dev set

Question 14

Q

review corrected dev and test sets

Answer

A

!!! make sure that they are still from the same distribution
review examples that algo got both right and wrong: 100 or 200 examples
it’s ok that train set distribution may be slightly different

Question 15

Q

different training and testing data distributions - bad option

Answer

A

i.e. sharp (large set, 200k) vs blurred images (small set, 10k)
bad option: mix and shuffle the original sets followed by random split will generate sets that in expectation will preserve the original ratios i.e. 95% items of the dev set will still be from the original training set to optimize against

Question 16

Q

different training and testing data distributions - good option

Answer

A

i.e. sharp (large set, 200k) vs blurred images (small set, 10k)
good option: training set: add half (5k) of the small set to large set (200k), and test set: split the remaining small set into dev and test sets (2.5k and 2.5k) to optimize against
this ensures that the model is optimized against the ‘target’ (real life app) and dev & test are from the same distribution!!!

Question 17

Q

handling mismatched data distribution for training and dev sets

Answer

A

counter 2 effects on the results: generalization (variance) and data mismatch problems
generalization/variance: use a training-dev (split from training) to ensure that the training-dev results are based on the same distribution
data mismatch: compare the training-dev results with dev results to spot this problem

Question 18

Q

performance/error levels - mismatched data distributions

Answer

A

human level (or state-of-the-art) error: HLE
training set error: TRE
training-dev set error: TRDE
dev set error: DE
test set error: TSE

Question 19

Q

bias/variance - mismatched data distributions

Answer

A

avoidable bias = TRE - HLE
variance: TRDE -TRE
data mismatch: DE - TRDE
overfitting to dev set: TSE -DE

Question 20

Q

table representation - bias/variance analysis data-mismatch

Answer

A

columns: training distribution data, dev/test distribution data
rows: human level err, data trained on err, data not trained on err
(1, 1) = HLE, (2, 1) = TRE, (3, 1) = TRDE, (3, 2) = DE or TSE
optional (1, 2) = human level on dev/test

Question 21

Q

addressing data mismatch training/test sets

Answer

A

do a manual error analysis to understand the differences
make the training data more similar i.e. synthetic data - with the caveat that it can overfit to the small synthetic data set - 1 hour of car noise that is a too small sample
collect training data into conditions similar to dev/test sets

Question 22

Q

building the first system

Answer

A

setup a ‘target’: dev + test set and a metric
build the first model/system quick and dirty
use Bias/Variance and Error analysis to decide on next steps
iterate prioritizing and improving the system
if the system is fairly new, DON’T overthink!! to get it going
if is an existing body of knowledge, is ok to start with that but still DON’T overthink

Question 23

Q

transfer learning

Answer

A

use a pre-trained DL network and re-train only the last one or last couple layers if the dataset is small, keeping all the other layers’ weights fixed
used from a problem wiht a lot of data to a problem with much less data

Question 24

Q

multi-task learning

Answer

A

used much less than transfer learning
using one DL network to solving several tasks by using multiple outputs for the output layer i.e. detecting car, image, traffic sign in an image
multi-label problem using logistic-regression loss instead of the softmax for the one label output

Question 25

Q

incomplete output labels - multi-task learning

Answer

A

for samples with incomplete outptput labels calculate the loss only using the available output components i.e. only with labels for car and pedestrian if traffic signal is missing

Question 26

Q

when to use multi-task learning

Answer

A

it’s beneficial to have shared low-level features and multi-label outputs
(sometimes): when the amount of data for the datasets candidates for transfer-learning is very similar
when is possible to train a single bigger DL network that will do well on all tasks

Question 27

Q

end-to-end DL

Answer

A

replacing a multiple stages of processing in an ML system with a single, usually huge, DL network
in practice an intermmediate solution, i.e. 2 steps, most of the time works better i.e.. detect the face in an image, zoom in and detect person identity
there is more data for each of the 2 subtasks than for an end-to-end DL network

Question 28

Q

pros and cons - end-to-end DL

Answer

A

pros:
- let the data speak
- less hand-designed components
cons:
- requires large amounts of data
- filters out potentially useful hand-design components when less data is available; this will compensate with human knowledge for the lack of data

Question 29

Q

sources of knowledge

Answer

A

data: can be used exclusively when large amount of data is available
human: useful hand-designed components when the amount of data is rather limited