Lesson 6 - Learning & Optimization Flashcards
Which optimizations can you do prior-training?
- data augmentation
- input normalization
- Xavier/Glorot initialization of weights
Which optimizations can you do during training?
- Dropout
- Batch Normalization
Which optimization can you do when computing the loss?
- training with weighted examples
- focal loss: training with examples of different complexity
- triplet loss: learning representations by comparison
- using multiple loss functions: MinMaxCAM
How can we optimize the training procedure (while searching for the best solution)
By having a variable learning rate
What is input normalization?
It is a prior-training optimization
- remove the mean image
- standardize the input (dividing by standard deviation)
To what problem is the Xavier/Glorot initialization a solution?
When initializing the weights of the network, the common practice was to initialize randomly from a normal distribution.
The problem: large variance - var(z)
What was the Xavier/Glorot solution?
Make the weights smaller by doing var(z) = 1/n
Therefor
weight_i = weight_i x sqrt(1/n)
What problem does using dropout tackle?
The decrease dependence of a given feature
What is batch normalization?
It is an optimization during training technique.
–> Normalize internal activation by considering dataset statistics
–> stochastic optimization - batch-level statistics
What problem does batch normalization tackle?
During training, updates on weights at a later layer should take into account changes at earlier layers (covariance shift)
-> introduce changes in the distribution of internal activations
-> requires careful initialization and a small learning rate
What are the benefits of using batch normalization?
- Less sensitivity to initialization
- Allows using larger learning rates (faster training)
What could be a potential problem/weakness with the gradient descent as how we have seen it so far? And how do we tackle it?
If 80% of the examples are from one class then the model will learn the important features of that class, this is because the update process of the weights is dominated by the majority of examples
Tackle this by having weighted examples
What does Focal Loss do?
- down-weights the loss from well-classified examples
- focusses training on sparse set of hard examples
What problem does focal loss tackle?
When the dataset is balanced (so 50-50 for example) but some class has more difficult features to learn (more details, more small/fine points)
Where could focal loss be usefull?
- dense prediction tasks
- in the presence of outliers