Week2 Flashcards by Mahsa Zamanifard

Let’s say we have a Cat classifier
Write an error analysis procedure that can let you know if focusing on the mislabeled outcomes to improve the performance, is worth it.
Carrying out Error Analysis 1:42

1) Get about 100 of mislabeled dev set examples
2) Examine them manually and count how many of the mislabeled examples are actually dogs
3) Suppose only 5 of these mislabeled examples are actually pictures of dogs
This means that even if the classifier is improved to mistake no dog with a cat, then you only have improved 5%, because on average, you are improving 5 out of 100 results
4) Now we can decide if doing so, is a good use of our time or not

How well did you know this?

Not at all

Perfectly

What’s “ceiling of performance” in ML?
Carrying out Error Analysis
3:10

Given that we improve some error, best case scenario, how much is the overall improvement we’re going to have. E.g. in the cat classifier examples, we saw if we remove the mislabeling as dogs, we attain 5% improvement at best)

How well did you know this?

Not at all

Perfectly

How do we evaluate multiple ideas in error analysis?
Carrying out Error Analysis
5:40

1) Create a table, the left side is the set of images you plan to look at manually
So it goes from 1-100 if we have sampled 100 examples from the dev set (that are misrecognized)
2) Columns of this table are the ideas you evaluate
3) You tick the ideas present in each picture
4) Summing up the ticks to get a sense of which ideas to pursue

How well did you know this?

Not at all

Perfectly

What’s the difference between mislabeled example and incorrectly labeled example?
(Cleaning up incorrectly labeled data)
0:35

If the algorithm outputs the wrong label it’s mislabeled
If it’s labeled in wrongly in the dataset it’s incorrectly labeled

How well did you know this?

Not at all

Perfectly

Deep Learning algorithm are robust to random errors in the training set, but less so to systematic error. True/False
(Cleaning up incorrectly labeled data)
1:21, 2:21

True
If the errors are reasonably random, then it’s probably ok to leave them as they are

How well did you know this?

Not at all

Perfectly

What should we do, if we’re worried about the impact of incorrectly labeled examples in dev or test sets?
(Cleaning up incorrectly labeled data)
2:54

We can add a column for incorrectly labeled examples in the Error Analysis table, and count up the number of examples where the label Y was incorrect

How well did you know this?

Not at all

Perfectly

What are the principles to consider if we decide to manually re-examine the labels and correct the incorrectly labeled ones in dev/test sets?
(Cleaning up incorrectly labeled data)
8:22, 10:34

1) Apply the same process on both dev and test sets (to make sure they come from the same distribution)

2) Consider looking at the examples your algorithm got right as well as the ones it got wrong (because there could be some incorrectly labeled examples that the algorithm got right, but it is actually wrong, and if we don’t examine these examples, we might end up with a biased error 9:34)

3) Since the training set has most of the data, it doesn’t make sense to go through it and check for incorrectly labeled data, therefore in the end the training set and dev/test sets may now (after correcting dev/test but not training) come from slightly different distributions but it’s actually ok because learning algorithms are quiet robust to difference in distribution of training set vs dev/test sets

It’s super important that dev and test sets come from the same distributions though

How well did you know this?

Not at all

Perfectly

Why examining the correctness of labels in dev/test set for examples the algorithm has gotten right, isn’t always done?
(Cleaning up incorrectly labeled data)
9:59

It’s not easy to do, especially if your classifier has high, e.g. 98% accuracy. It takes a lot of time to go through that much data, so it’s not something that’s always done, but it’s something to consider

How well did you know this?

Not at all

Perfectly

What does “Build your first system quickly then iterate” mean?
Build your first system quickly then iterate
1:54

1) Set up a dev/test set and metric and build a ML system quickly

2)Use Avoidableb Bias/Avoidable Variance Analysis and Error Analysis to prioritize the next steps

3) Iterate

How well did you know this?

Not at all

Perfectly

Where does “Build your first system quickly then iterate” apply less strongly?
(Build your first system quickly then iterate)
3:37

1) If you’re experienced in the area you’re working

2) If there’s a significant body of academic literature you can draw on for the problem you’re solving

How well did you know this?

Not at all

Perfectly

Let’s say we want to build a cat classifier,
We have 10,000 pictures taken by users and 200,000 pictures from the web.

Pictures taken by users usually are not framed well and in general are different from the pictures on the web

What we care about is for the app to do well on user pictures.

What are the ways we can part data into the three sections (train/dev/test)? What’s the advantage and disadvantage of each
(Training and testing on different distributions)
2:43, 9:08

1) One way is to shuffle data, use for example 2500 for each dev and test and the rest for the training set.
- the advantage is that all three sections come from the same distribution, therefore are easier to manage
- the disadvantage which is a huge one, Is that, this way, most of the data in dev/test sets would come from the web, not the users (which is what we care about ultimately)
So this method of sectioning is NOT recommended.
2) The recommended option is that we use all the 200,000 images from the web, for the training set, then add 5000 of the user images, and for dev/test sets, use the remaining 5000 user images. This way, the distribution of training set differs from the dev/test sets, but the dev/test sets are from the distribution we care about (user images)

How well did you know this?

Not at all

Perfectly

What problem rises when calculating avoidable bias and avoidable variance for a training set and dev set with different distributions (case of choosing web images and a number of user images for training and the rest of the user images for dev/test set)?

How can we solve it?
(Bias and Variance with mismatched data distributions)
P1: 0:37,1:51
P2: 2:19, 3:34

Part 1
Since the training set and dev set are not from the same distribution, then we can’t compare the avoidable bias and avoidable variance, because the model is tested on a different data distribution than the training set, when it’s used on the dev set.

So maybe it is doing well on the dev set but since the distribution is different, it’s just more difficult to tackle.

So we don’t know how much of the error on dev set is actually because of the variance and how much is because of the difference in distributions

Part 2
We define a subset of training data after shuffling, we call it training-dev set. It has the same distribution as the training data, but the network is not trained on it.
Now for error analysis, we look at the performance of the model on the training set, training-dev set and the dev set. This way we have separated the variance problem part from the distribution difference part.

If the difference between training and training-dev< difference between training and dev, then the model hasn’t learned to do well on a different distribution.

If the difference between training and training-dev> differece between training and dev, then we have a variance problem (because it has done worse on data with similar distribution, so the model has overfitted the training set)

How well did you know this?

Not at all

Perfectly

There are systematic solutions to data mismatch problem (data mismatch: when after error analysis we realize the majority of the problem comes from the difference between training set distribution and dev/test set distribution) True/False
(Addressing data mismatch)
0:09

False, there’s no systematic way of dealing with data mismatch

How well did you know this?

Not at all

Perfectly

What is a recommended way of dealing with data mismatch?
(Addressing data mismatch)
0:15

1) carry out manual error analysis and try to understand the differences between the training set and the dev/test sets
NOTE To avoid overfitting the test set, technically for error analysis, you should manually only look at a dev set and not a test set
2) After finding the differences, we should try and add data similar to the dev/test sets, to the training set ( one way is to artificially create noises present in the dev/test and add them to training set)

How well did you know this?

Not at all

Perfectly

Let’s say we have conducted error analysis on sound data and found out there’s data mismatch (different distribution of training and dev), and it comes from a certain noise that’s present in the dev set but not the training set, we create an artificial noise, use it in the training set, this is one noise, add it to every example in the training, what problem could it cause? What is a suggested way of tackling this problem?
(Addressing data mismatch)
4:51

The NN might overfit the data and learn the noise very well (6:23)
It’s possible that using large amount of unique noise can improve the performance (6:48)

How well did you know this?

Not at all

Perfectly

What’s transfer learning?
(Transfer learning)
0:0

Study These Flashcards

You can take the knowledge a NN has learned from one task and apply it to another task, this is called transfer learning.

Do we have to retrain the whole NN, when using it for transfer learning?
(Transfer learning)
1:32

Study These Flashcards

If the dataset is small, we can just retrain the weights of the last layer and keep the rest the same. 2:14
If you have a lot of data maybe we’ll need to retrain all the parameters. 2:41

What’s pre-training/pre-initializing?
(Transfer learning)
2:42

Study These Flashcards

If you retrain all the parameters in a NN, then the initial phase of training is called pre-training/pre-initializing, it refers to the original training of the model that’s now used for transfer learning

What’s fine tuning in transfer learning?
(Transfer learning)
3:01

Study These Flashcards

When we have lots of data and retrain an already trained model (using the weights as the initial weights for this new training), updates of the weights are called fine-tuning.

When does transfer learning make sense?
(Transfer learning)
5:42

Study These Flashcards

When we have lots of data for the subject we’re transferring from, so we can learn the low level features or learn a lot of useful features in the earlier layers of the NN
And
Usually less data for the problem we’re transferring to

When does using transfer learning not make sense? Give an example
(Transfer learning)
7:08

Study These Flashcards

When we have more data for the specific problem, than data for a general subject. For example, if you have 100 images for image recognition and 100 or 1000 images for radiology diagnosis, assuming you want to do well on the radiology diagnosis images, having radiology pictures is more valuable than pictures of cats and dogs (generic pictures)

What’s the difference between transfer learning and multi-task learning
(Multi-task learning)
0:0

Study These Flashcards

In Transfer learning you have a sequential process where you learn from task A then transfer that to task B.

In Multi-task learning you start off simultaneously, trying to have one NN do several things at the same time

Why in Multi-task learning, we use logloss as the loss function and not softmax?
(Multi-task learning)
3:40

Study These Flashcards

Unlike softmax regression which assigns a single label to each image, in Multi-task learning we are looking for multiple labels in each image, it’s a multi-label problem

Multi-task learning, solves multi-label problems, where each target can be assigned multiple labels, not just a single one. True/False
(Multi-task learning)
4:11

Study These Flashcards

True

Assuming we have 4 possible labels for each image, We can train four separate NNs for each label, instead of one multi-task learner. What is the advantage of a multi-task learner? (Multi-task learning) 5:00

If some of the earlier features in the NN can be shared between these labels, then a multi-task learner performs better that multiple learners

Does a multi-task learner work if all the labels aren't specified? For example two out of four labels are missing, the target is [1,0,Nan,Nan] (Multi-task learning) 5:33

Yes, in 6:21 Andrew says the learner omits the missing labels

When does multi-task learning makes sense? (Multi-task learning) 6:59

When 3 things are true: 1) training on a set of tasks that could benefit from having shared lower-level features 2) **usually** the amount of data you have for each task is quite similar. *This mainly means that other tasks(labels) should have a lot more data than just one of them, so that multi-task learning can boost the performance on each task compared to using individual NNs for each task 9:02* 3) can train a big enough NN to do well on all the tasks

When can training individual NNs for multi-label targets, out perform a multi-task NN? (Multi-task learning) 10:06

When the NN isn't big enough

Transfer learning is used more often than multi-task learning. True/False (Multi-task learning) 11:57

True

What's end-to-end deep learning? (What's end-to-end deep learning?) 0:5

There have been some data processing systems, or learning systems, that require multiple stages of processing. What end-to-end deep learning does is it can take all those multiple stages and replace it, usually with a single NN

When using end-to-end deep learning probably won't work well? (What's end-to-end deep learning?) 2:23

One of the challenges of end-to-end deep learning is that you might need a lot of data before it works well

Why is it called end-to-end deep learning? (Whether to use end-to-end deep learning) 2:40

Because we're learning a direct end-to-end mapping from one end of a system to another

What are the pros and cons of end-to-end deep learning? (Whether to use end-to-end deep learning) 4:16( the completed list of pro-cons)

Pros: -It lets the data speak -Less hand designed components needed Cons: -May need large amount of data -Excludes potentially useful hand designed components

Week2 Flashcards

(33 cards)