Lesson 5: Foundations of Neural Networks Flashcards
what do discrimiitive learning rate api look like?
three options:
fit(1, a)
fit(1, slice(a))
git(1, slice(a,b))
1) all (unfrozen) layers get same learning rate
2) last layer is a; all other layers: a/3
3) first layer: a; last layer b; others spread between equally between layer groups; by default you get three layer groups; last layer group is what fastai adds
why div 3? see second half of course on batchnorm
what is an affine function?
matrix mult but more general: it means a linear function
what is embedding layer?
OHE matrix mmult times embedding matrix returns the embedding vectors for the right rows
pred is the dot product of the two embedding vectors
embedding is an array lookup
matrix multiply by OHE matrix is the same as an array lookup on the vector
array lookup is much faster and less memory intensive
mult by OHE matrix is identical to an array lookup, therefore we should always do the array lookup version
you can pass in a bunch of ints and pretend they are OHE and that’s called an embedding
since it is same as mmult, it fits into NN model
weight matrix where rows correspond to certain int values of your input
These underlying features are called latent features
How could you do image similarity?
look at activations at some layer and then PCA on the activations to reduce dimensionality
what is weight decay?
a type of regularization.
traditionally, you need less param – you were fed this lie bc it is a convenient fiction v/c you want to make your function less complex. but why can’t we have ltos of params if many are small? you can. if in your head complexity is scored by number of params, then you are doing it wrong. more params = more curvy bits, more interactions. real life is full of curvy bit. Regularization = let’s use lots of params and penalize complexity.
in the loss function we are going to add the sum of the sq of the parameters. but problem b/c the best loss could be to set all the params to zero. so we multiply it by a number we choose. That number is called “wd”. Weight decay. Most of the time it should be 0.10. :-)
Default is 0.01 b/c we prefer to overfit rather than underfit on default. Every learner has a wd argument.
what does a function ending in _ mean in pytorch?
the calc is done inplace
how do you get weight matrices out of a train pytorch model?
[p for p in model.parameters()]
Why is it called weight decay?
In the loss func you are subtracting wdsum of w^2; when you take dLdw you get the original gradient minus wdweights…. so you are subtracting from weights and making them smaller.