Week2 Flashcards
Let’s say we have a Cat classifier
Write an error analysis procedure that can let you know if focusing on the mislabeled outcomes to improve the performance, is worth it.
Carrying out Error Analysis 1:42
1) Get about 100 of mislabeled dev set examples
2) Examine them manually and count how many of the mislabeled examples are actually dogs
3) Suppose only 5 of these mislabeled examples are actually pictures of dogs
This means that even if the classifier is improved to mistake no dog with a cat, then you only have improved 5%, because on average, you are improving 5 out of 100 results
4) Now we can decide if doing so, is a good use of our time or not
What’s “ceiling of performance” in ML?
Carrying out Error Analysis
3:10
Given that we improve some error, best case scenario, how much is the overall improvement we’re going to have. E.g. in the cat classifier examples, we saw if we remove the mislabeling as dogs, we attain 5% improvement at best)
How do we evaluate multiple ideas in error analysis?
Carrying out Error Analysis
5:40
1) Create a table, the left side is the set of images you plan to look at manually
So it goes from 1-100 if we have sampled 100 examples from the dev set (that are misrecognized)
2) Columns of this table are the ideas you evaluate
3) You tick the ideas present in each picture
4) Summing up the ticks to get a sense of which ideas to pursue
What’s the difference between mislabeled example and incorrectly labeled example?
(Cleaning up incorrectly labeled data)
0:35
If the algorithm outputs the wrong label it’s mislabeled
If it’s labeled in wrongly in the dataset it’s incorrectly labeled
Deep Learning algorithm are robust to random errors in the training set, but less so to systematic error. True/False
(Cleaning up incorrectly labeled data)
1:21, 2:21
True
If the errors are reasonably random, then it’s probably ok to leave them as they are
What should we do, if we’re worried about the impact of incorrectly labeled examples in dev or test sets?
(Cleaning up incorrectly labeled data)
2:54
We can add a column for incorrectly labeled examples in the Error Analysis table, and count up the number of examples where the label Y was incorrect
What are the principles to consider if we decide to manually re-examine the labels and correct the incorrectly labeled ones in dev/test sets?
(Cleaning up incorrectly labeled data)
8:22, 10:34
1) Apply the same process on both dev and test sets (to make sure they come from the same distribution)
2) Consider looking at the examples your algorithm got right as well as the ones it got wrong (because there could be some incorrectly labeled examples that the algorithm got right, but it is actually wrong, and if we don’t examine these examples, we might end up with a biased error 9:34)
3) Since the training set has most of the data, it doesn’t make sense to go through it and check for incorrectly labeled data, therefore in the end the training set and dev/test sets may now (after correcting dev/test but not training) come from slightly different distributions but it’s actually ok because learning algorithms are quiet robust to difference in distribution of training set vs dev/test sets
It’s super important that dev and test sets come from the same distributions though
Why examining the correctness of labels in dev/test set for examples the algorithm has gotten right, isn’t always done?
(Cleaning up incorrectly labeled data)
9:59
It’s not easy to do, especially if your classifier has high, e.g. 98% accuracy. It takes a lot of time to go through that much data, so it’s not something that’s always done, but it’s something to consider
What does “Build your first system quickly then iterate” mean?
Build your first system quickly then iterate
1:54
1) Set up a dev/test set and metric and build a ML system quickly
2)Use Avoidableb Bias/Avoidable Variance Analysis and Error Analysis to prioritize the next steps
3) Iterate
Where does “Build your first system quickly then iterate” apply less strongly?
(Build your first system quickly then iterate)
3:37
1) If you’re experienced in the area you’re working
2) If there’s a significant body of academic literature you can draw on for the problem you’re solving
Let’s say we want to build a cat classifier,
We have 10,000 pictures taken by users and 200,000 pictures from the web.
Pictures taken by users usually are not framed well and in general are different from the pictures on the web
What we care about is for the app to do well on user pictures.
What are the ways we can part data into the three sections (train/dev/test)? What’s the advantage and disadvantage of each
(Training and testing on different distributions)
2:43, 9:08
1) One way is to shuffle data, use for example 2500 for each dev and test and the rest for the training set.
- the advantage is that all three sections come from the same distribution, therefore are easier to manage
- the disadvantage which is a huge one, Is that, this way, most of the data in dev/test sets would come from the web, not the users (which is what we care about ultimately)
So this method of sectioning is NOT recommended.
2) The recommended option is that we use all the 200,000 images from the web, for the training set, then add 5000 of the user images, and for dev/test sets, use the remaining 5000 user images. This way, the distribution of training set differs from the dev/test sets, but the dev/test sets are from the distribution we care about (user images)
What problem rises when calculating avoidable bias and avoidable variance for a training set and dev set with different distributions (case of choosing web images and a number of user images for training and the rest of the user images for dev/test set)?
How can we solve it?
(Bias and Variance with mismatched data distributions)
P1: 0:37,1:51
P2: 2:19, 3:34
Part 1
Since the training set and dev set are not from the same distribution, then we can’t compare the avoidable bias and avoidable variance, because the model is tested on a different data distribution than the training set, when it’s used on the dev set.
So maybe it is doing well on the dev set but since the distribution is different, it’s just more difficult to tackle.
So we don’t know how much of the error on dev set is actually because of the variance and how much is because of the difference in distributions
Part 2
We define a subset of training data after shuffling, we call it training-dev set. It has the same distribution as the training data, but the network is not trained on it.
Now for error analysis, we look at the performance of the model on the training set, training-dev set and the dev set. This way we have separated the variance problem part from the distribution difference part.
If the difference between training and training-dev< difference between training and dev, then the model hasn’t learned to do well on a different distribution.
If the difference between training and training-dev> differece between training and dev, then we have a variance problem (because it has done worse on data with similar distribution, so the model has overfitted the training set)
There are systematic solutions to data mismatch problem (data mismatch: when after error analysis we realize the majority of the problem comes from the difference between training set distribution and dev/test set distribution) True/False
(Addressing data mismatch)
0:09
False, there’s no systematic way of dealing with data mismatch
What is a recommended way of dealing with data mismatch?
(Addressing data mismatch)
0:15
1) carry out manual error analysis and try to understand the differences between the training set and the dev/test sets
NOTE To avoid overfitting the test set, technically for error analysis, you should manually only look at a dev set and not a test set
2) After finding the differences, we should try and add data similar to the dev/test sets, to the training set ( one way is to artificially create noises present in the dev/test and add them to training set)
Let’s say we have conducted error analysis on sound data and found out there’s data mismatch (different distribution of training and dev), and it comes from a certain noise that’s present in the dev set but not the training set, we create an artificial noise, use it in the training set, this is one noise, add it to every example in the training, what problem could it cause? What is a suggested way of tackling this problem?
(Addressing data mismatch)
4:51
The NN might overfit the data and learn the noise very well (6:23)
It’s possible that using large amount of unique noise can improve the performance (6:48)