Lesson 1-3 Flashcards

Question

What do you do with the learning rate finding graph?

Answer 1

In first step of training, I think you find the slice where is has the sharpest negative slopes. When you unfreeze, then you want likely the point where it starts sloping up and 10x before that.

Answer 2

conv net on input (smaller and smaller layers), then the reverse... learn = Learner.create\_unet(data, models.resnet34, metrics)

Answer 3

varies learning rate. Goes up and then goes down. Why is this good? Because when you start away from the minimum you want to jump around a bit; loss surface is not smooth, so you want to jump over bumps; but when you get close to the global minimum you want it to go down (learning rate annealing). Innovation here is that upfront you want it to increase. Gradulaly increasing the learning rate is a really good way to help the model explore the entire loss surface. So when you learn.recorder.plot\_losses() you will see loss going up for a while before it goes down a lot!

Answer 4

Add .to\_fp16() to your Learner...() then it will train at half precision floating point. Saves GPU RAM and should be faster. When you make things less precise (sometimes) in deep learning, it will generalize better. You need the most recent CUDA drivers and a very recent gpu

Answer 5

Lesson 3: Segmentation 1:41 ... NLP classficiataion 1) data = (TextSplitData.from\_csv...) language\_model\_learner : : Jeremy;s model in this vid is SOTA for IMDB (ULMFit?)

Answer 6

NO! each token is just a word and the DL model figures it out. (TODO: what about mis-spellings?)

Answer 7

You have to change the model :-)

Answer 8

The idea that a neural net can aproxmiate ANY function to arbitrary precision

Answer 9

1) Start with a pretrained model which was trained for something else: a "language model": a model that learns to predict the next word of a sentence. Wikitext 103 dataset, subset of largest articles in wikipedia. About 1bn tokens. 2) use the pretained model to make a model to predict the next word of your domain (don't need labels! "self supervised labels"). This is fine-tuning. data\_lm = ... USE THE text in the test set to train your LM. This is a good trick on Kaggle! 3) learn = language\_model\_learner(data\_lm, pretrained\_model=URLs.WT103, drop\_mult=0.3) 4) learn.lr\_find() 5) learn.recorder.plot(skip\_end=15) 6) learn.fit\_one\_cycle(1, 1e-2, moms=(0.8,0.7)) #

Answer 10

1) procs = [FillMissing, Categorify, Normalize] # pre-processing 2) data = TabularList.from\_df(df, path, cat\_names=cat\_names, cont\_names=cont\_names, procs=procs) 3) learn = get\_tabular\_learner(data, layers=[200,100]) 4) learn.fit(1, 1e-2)

Answer 11

three column table: userid, product\_id, movie rating very sparse matrix: user on rows movies on columns ratings as valued get\_collar\_learner(ratings, n\_factors=50, min\_score=0, max\_score=5) learn.fit\_one\_cycle(5, 5e-3) predict, you give a userid and movie id and it will predict the score

Answer 12

You want to be good when you have a new user You want to be good when it is a new user But you have no data! You need a second model -- meta data model for new users or new movies

Answer 13

linear model mmult A\*B = pred\_matrix A is num\_movies by N B is N by num\_users pred\_matrix is num\_movies by num\_users Loss is MSE between this matrix and your given ratings data matrix Minimize the loss with gradient descent (NOTE: need bias for each user and each item, maybe an item is just naturally popular, etc.) Map to sigmoid between max and min score.

Answer 14

A matrix of weights which you can lookup into (index into it as an array) and grab one vector out of it

Answer 15

Weight matrices are PyTorch Parameters Also bias are Parameters Activations are the result of ReLU Anything that does a calc is a layer Last layer is likely be a sigmoid, not ReLU b/c you Loss function, compares output to input in training

Answer 16

paramaters -= learning\_rate\*parameters.grad()

Answer 17

resnet last layer has 1000 columns target vector is length 1000 So we throw it away. create\_cnn kills this. Instead it puts two weight matrices with a ReLU in between The erliaer layer, the more likely you want the weigths to stay the same Freeze: fastai and PyTorch will NOT backprop the gradients into frozen layers When unfreeze, give different learning rates to each part of the model; earlier layers, make smaller learning rate, b/c it is already pretty good "Discriminitive Learning Rates": fit(1, ) slice(1e-3): final layers get 1e-3, earlier get 1e-3/3 slice(1e-5, 1e-3), final layer gets 1e-3, earlier layers get linear interp learning rate Zeiler and Fergus paper, Visualizing CNN

Answer 18

Just means a linear function; like matrix multiplication if you are multipling things together and adding them together then you have an affine function

Answer 19

an embedding! embedding means look something up in an array Therefore, and embedding can be a kind of layer "embedding"=an array lookup which is mathematically equivalent to a matrix multiplied by a OHE matrix

Answer 20

Underlying features that appear. Hidden things that were there all along that appear when we do gradient descent.

Answer 21

an extra single number per user or per product in the embedding matrix