Lesson 5: Foundations of Neural Networks Flashcards

1
Q

what do discrimiitive learning rate api look like?

A

three options:
fit(1, a)
fit(1, slice(a))
git(1, slice(a,b))

1) all (unfrozen) layers get same learning rate
2) last layer is a; all other layers: a/3
3) first layer: a; last layer b; others spread between equally between layer groups; by default you get three layer groups; last layer group is what fastai adds

why div 3? see second half of course on batchnorm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is an affine function?

A

matrix mult but more general: it means a linear function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is embedding layer?

A

OHE matrix mmult times embedding matrix returns the embedding vectors for the right rows

pred is the dot product of the two embedding vectors

embedding is an array lookup

matrix multiply by OHE matrix is the same as an array lookup on the vector

array lookup is much faster and less memory intensive

mult by OHE matrix is identical to an array lookup, therefore we should always do the array lookup version

you can pass in a bunch of ints and pretend they are OHE and that’s called an embedding

since it is same as mmult, it fits into NN model

weight matrix where rows correspond to certain int values of your input

These underlying features are called latent features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How could you do image similarity?

A

look at activations at some layer and then PCA on the activations to reduce dimensionality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is weight decay?

A

a type of regularization.

traditionally, you need less param – you were fed this lie bc it is a convenient fiction v/c you want to make your function less complex. but why can’t we have ltos of params if many are small? you can. if in your head complexity is scored by number of params, then you are doing it wrong. more params = more curvy bits, more interactions. real life is full of curvy bit. Regularization = let’s use lots of params and penalize complexity.

in the loss function we are going to add the sum of the sq of the parameters. but problem b/c the best loss could be to set all the params to zero. so we multiply it by a number we choose. That number is called “wd”. Weight decay. Most of the time it should be 0.10. :-)
Default is 0.01 b/c we prefer to overfit rather than underfit on default. Every learner has a wd argument.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what does a function ending in _ mean in pytorch?

A

the calc is done inplace

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

how do you get weight matrices out of a train pytorch model?

A

[p for p in model.parameters()]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why is it called weight decay?

A

In the loss func you are subtracting wdsum of w^2; when you take dLdw you get the original gradient minus wdweights…. so you are subtracting from weights and making them smaller.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly