Lesson 1-3 Flashcards

1
Q

Is ML a black box?

A

No. Interpretable ML Visualize gradients and activatuions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Does deep learning need to much data?

A

No. Transfer learning; share/use pre-trained nets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does a Union[…] mean in a function signature?

A

One of the items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why do you need to make all the images the same shape and size?

A

Because in order for the GPU to work fast with them, it has to be that way!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What size usually works?

A

square with size=224 :-)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the resnets?

A

Pre-trained, 34 (#layers) and 50. Different sizes. Start with the smaller one. Trained on 1.5mm imagenet pictures. Has pre-trained weigths. Start with a model that knows how to recognize 1000 categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is transfer learning?

A

Take a model that knows how to do something well, and make it learn how to do your thing REALLY well. Train with 1/100th or less (maybe thousands of times) of data and time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is one cycle learning?

A

Better, faster; recent paper (TODO: look this up)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does unfreeze do?

A

Without unfreeze, the fitting is done on the final layers only, leaving the initial pre-trained layers untouched. This makes it very fast and avoids overfitting. Unfreeze means fit all layers,.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Paper on Understanding CNN

A

Visualizing and Understanding Convolutional Networks; Rob Fergus, Matthew Zeiler

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why run prod (“inference”) on CPU instead of GPU?

A

b/c in prod it is unlikely that you want to do many many things at one time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why does train loss < valid loss not mean overfitting?

A

As long as the error rate continues to decrease in a train step, YOU ARE NOT OVERFITTING

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what does the doc function do?

A

This is a fastai function that shows the html docs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What do you do with unbalanced data?

A

Jeremy says nothing. it always works. lol. However you could try to over sample the underrepresented class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does _ mean at the end of a pytorch function?

A

operation happens in place

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you create a tensor of ones in pytortch?

A

x = torch.ones(n,2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do you make a coliumn in a tensor random?

A

x[:,0].uniform_()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is gradient descent in pytorch?

A

a = nn.Parameter(a); a def update(): y_hat = x@a loss = mse(y, y_hat) if t % 10 == 0: print(loss) loss.backward() with torch.no_grad(): a.sub_(lr * a.grad) a.grad.zero_() lr = 1e-1 for t in range(100): update()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the differnce between gradient descent and sgd?

A

sgd is done on mini-batches… instead of running on the entire dataset, we choose a batch of data at random (randomize without replacement; you will see all images)

20
Q

How do we make sure we don’t overfit?

A

Not with parsimonious models! With regularization.

21
Q

How do we make sure we don’t overfit?

A

Not with parsimonious models! With regularization. Use a validation set!

22
Q

How does Python typing work?

A

def greeting(name: str) -> str: return ‘Hello ‘ + name

23
Q

How does Python typing work?

A

def greeting(name: str) -> str: return ‘Hello ‘ + name

24
Q

What is camvid

A

Segmentation mask, labeled dataset

25
Q

What do you do with the learning rate finding graph?

A

In first step of training, I think you find the slice where is has the sharpest negative slopes. When you unfreeze, then you want likely the point where it starts sloping up and 10x before that.

26
Q

What is unet?

A

conv net on input (smaller and smaller layers), then the reverse… learn = Learner.create_unet(data, models.resnet34, metrics)

27
Q

what does fit one cycle do?

A

varies learning rate. Goes up and then goes down. Why is this good? Because when you start away from the minimum you want to jump around a bit; loss surface is not smooth, so you want to jump over bumps; but when you get close to the global minimum you want it to go down (learning rate annealing). Innovation here is that upfront you want it to increase. Gradulaly increasing the learning rate is a really good way to help the model explore the entire loss surface. So when you learn.recorder.plot_losses() you will see loss going up for a while before it goes down a lot!

28
Q

What is mixed precision learning?

A

Add .to_fp16() to your Learner…() then it will train at half precision floating point. Saves GPU RAM and should be faster. When you make things less precise (sometimes) in deep learning, it will generalize better. You need the most recent CUDA drivers and a very recent gpu

29
Q

How do you create NLP classifier?

A

Lesson 3: Segmentation 1:41 … NLP classficiataion 1) data = (TextSplitData.from_csv…) language_model_learner : : Jeremy;s model in this vid is SOTA for IMDB (ULMFit?)

30
Q

Do you need to use ngram in DL?

A

NO! each token is just a word and the DL model figures it out. (TODO: what about mis-spellings?)

31
Q

How do yo udeal with 4 channel images (e.g., some satellite) in a pretrained model?

A

You have to change the model :-)

32
Q

What is the universal approximation theorem?

A

The idea that a neural net can aproxmiate ANY function to arbitrary precision

33
Q

What is transfer learning in NLP?

A

1) Start with a pretrained model which was trained for something else: a “language model”: a model that learns to predict the next word of a sentence. Wikitext 103 dataset, subset of largest articles in wikipedia. About 1bn tokens. 2) use the pretained model to make a model to predict the next word of your domain (don’t need labels! “self supervised labels”). This is fine-tuning. data_lm = … USE THE text in the test set to train your LM. This is a good trick on Kaggle! 3) learn = language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.3) 4) learn.lr_find() 5) learn.recorder.plot(skip_end=15) 6) learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7)) #

34
Q

How to use DL on tabular?

A

1) procs = [FillMissing, Categorify, Normalize] # pre-processing 2) data = TabularList.from_df(df, path, cat_names=cat_names, cont_names=cont_names, procs=procs) 3) learn = get_tabular_learner(data, layers=[200,100]) 4) learn.fit(1, 1e-2)

35
Q

What is collab filtering?

A

three column table: userid, product_id, movie rating very sparse matrix: user on rows movies on columns ratings as valued get_collar_learner(ratings, n_factors=50, min_score=0, max_score=5) learn.fit_one_cycle(5, 5e-3) predict, you give a userid and movie id and it will predict the score

36
Q

What is the cold start problem?

A

You want to be good when you have a new user You want to be good when it is a new user But you have no data! You need a second model – meta data model for new users or new movies

37
Q

How does collab filtering work?

A

linear model

mmult A*B = pred_matrix

A is num_movies by N

B is N by num_users

pred_matrix is num_movies by num_users

Loss is MSE between this matrix and your given ratings data matrix

Minimize the loss with gradient descent

(NOTE: need bias for each user and each item, maybe an item is just naturally popular, etc.)

Map to sigmoid between max and min score.

38
Q

What is a nn.Embedding?

A

A matrix of weights which you can lookup into (index into it as an array) and grab one vector out of it

39
Q

What is a Parameter in PyTorch?

A

Weight matrices are PyTorch Parameters

Also bias are Parameters

Activations are the result of ReLU

Anything that does a calc is a layer

Last layer is likely be a sigmoid, not ReLU b/c you

Loss function, compares output to input in training

40
Q

What is the one line description of gradient descent?

A

paramaters -= learning_rate*parameters.grad()

41
Q

How does fine-tuning work?

A

resnet last layer has 1000 columns

target vector is length 1000

So we throw it away. create_cnn kills this.

Instead it puts two weight matrices with a ReLU in between

The erliaer layer, the more likely you want the weigths to stay the same

Freeze: fastai and PyTorch will NOT backprop the gradients into frozen layers

When unfreeze, give different learning rates to each part of the model; earlier layers, make smaller learning rate, b/c it is already pretty good

“Discriminitive Learning Rates”: fit(1, )

slice(1e-3): final layers get 1e-3, earlier get 1e-3/3

slice(1e-5, 1e-3), final layer gets 1e-3, earlier layers get linear interp learning rate

Zeiler and Fergus paper, Visualizing CNN

42
Q

What is an affine function?

A

Just means a linear function; like matrix multiplication

if you are multipling things together and adding them together then you have an affine function

43
Q

I want to do a matrix multiplication by a one-hot encoded matrix without ever having to create the OHE matrix, what is this?

A

an embedding!

embedding means look something up in an array

Therefore, and embedding can be a kind of layer

“embedding”=an array lookup which is mathematically equivalent to a matrix multiplied by a OHE matrix

44
Q

Latent features?

A

Underlying features that appear. Hidden things that were there all along that appear when we do gradient descent.

45
Q

What is the bias in an embedding?

A

an extra single number per user or per product in the embedding matrix