Lesson 6 - Learning & Optimization Flashcards
Which optimizations can you do prior-training?
- data augmentation
- input normalization
- Xavier/Glorot initialization of weights
Which optimizations can you do during training?
- Dropout
- Batch Normalization
Which optimization can you do when computing the loss?
- training with weighted examples
- focal loss: training with examples of different complexity
- triplet loss: learning representations by comparison
- using multiple loss functions: MinMaxCAM
How can we optimize the training procedure (while searching for the best solution)
By having a variable learning rate
What is input normalization?
It is a prior-training optimization
- remove the mean image
- standardize the input (dividing by standard deviation)
To what problem is the Xavier/Glorot initialization a solution?
When initializing the weights of the network, the common practice was to initialize randomly from a normal distribution.
The problem: large variance - var(z)
What was the Xavier/Glorot solution?
Make the weights smaller by doing var(z) = 1/n
Therefor
weight_i = weight_i x sqrt(1/n)
What problem does using dropout tackle?
The decrease dependence of a given feature
What is batch normalization?
It is an optimization during training technique.
–> Normalize internal activation by considering dataset statistics
–> stochastic optimization - batch-level statistics
What problem does batch normalization tackle?
During training, updates on weights at a later layer should take into account changes at earlier layers (covariance shift)
-> introduce changes in the distribution of internal activations
-> requires careful initialization and a small learning rate
What are the benefits of using batch normalization?
- Less sensitivity to initialization
- Allows using larger learning rates (faster training)
What could be a potential problem/weakness with the gradient descent as how we have seen it so far? And how do we tackle it?
If 80% of the examples are from one class then the model will learn the important features of that class, this is because the update process of the weights is dominated by the majority of examples
Tackle this by having weighted examples
What does Focal Loss do?
- down-weights the loss from well-classified examples
- focusses training on sparse set of hard examples
What problem does focal loss tackle?
When the dataset is balanced (so 50-50 for example) but some class has more difficult features to learn (more details, more small/fine points)
Where could focal loss be usefull?
- dense prediction tasks
- in the presence of outliers
How does focal loss work?
It is an extra parameter that will increase the loss for examples that are harder to classify, and therefor forcing the model to train on those examples.
It is based on the probability that the model guessed the label correctly. The higher that value, the less influence this focal loss has. So for very uncertain examples, the loss is high and the model will be pushed on them
What is Triplet Loss?
- given three examples: Archor, Positive, Negative
- learn a representation that distance(positive, anchor) < distance(negative, anchor)
How is triplet loss different?
With normal loss we compare prediction to ground truth (original label)
With triplet loss, we use three examples, and compare distance.
Anchor and positive should share the same class
What is the idea behind using multiple loss functions?
Object localization
–> regularize a high-performing classifier to enable localization
Why would we opt to use a variable learning rate? (Annealing)
As training progresses, taken steps might be to large to reach the optimum (when using a fixed learning rate)
In self-supervised learning, we have the problem that data annotation is expensive. What could be a solution to this?
Supervise using labels generated from data (without manual annotation)