ANN Lecture 6 - Training and Enhancing ANNs Flashcards
Parts of the ANN architecture
- Number of layers (depth)
- Kind of layers (convolution, fully-connected, etc.)
- Neurons/Kernels per layer (width)
If you would have infinit amount of computational power you could use machine learning to find these parameters
The wider and the deeper a Network, the better?
Obviously this would mean more parameters which will slow down your training.
Deeper networks are difficult to train.
Wider networks can lead to overfitting.
Vanishing Gradient
TanH:
- If there are a lot of high drives and thus small derivatives the magnitude of the gradients decrease towards the earlier layers.
- If the gradients for the first layer are too small the weights stay random. Therefore there is not any learning!
Exploding/ Dying Gradients
- For each drive that is larger than 0 the derivative is one. This may lead to very large gradients in the early layers and is called exploding gradients.
- -> Random weights
- If the drive of a neuron is below zero the activation and the derivative is zero, thus the gradients are also zero.
- -> Weights will never change.
Solution: Activation centered around zero.
How do we center the drive around zero?
Input Normalization:
(Input - Mean) / Standard Deviation
(For images Normalize each image on its own)
Weights Initialization:
Random Normal Initialization with a variance dependent on the number of input neurons to one layer (fan-in)
Bias Initialization:
The bias can be initialized with zeros.
Batch Normalization
- Calculate the mean and standard deviation for all drives in each mini batch
Normalized_Drive_i =
(Drive_i - Mean) / Standard Deviation
New_Normalized_Drive =
Scale * Normalized_Drive + Shift
(Scale initialized with 1’s and Shift with 0’s)
–> For testing the is no batch to normalize over, so the implementation is complicated
Data augmentation
A common problem is, that the input data are either too few or too homogenous. By augmenting the data you can produce artificial extra data, which also brings more variance into your dataset.
Class imbalance
Example Skin cancer images:
(More images of no skin cancer than of cancer)
After training on random mini-batches from this data your network will classify each image as ‘no cancer’, because this gives high accuracy and low loss.
Solutions for Class imbalance
- Draw balanced mini-batches. (Can over fit the rare data)
2. Punishing wrong classifications for rare classes stronger.
Training Parameters (Batch size and learningrate)
If the batch size is large the gradients show a clear direction to go → the learning rate can be quite large.
If the batch size is small the gradients do not show a clear irection to go → we have to do careful small step = learning rate should be small.
What does overfitting mean?
Overfitting means, that the training error gets significantly lower than the validation error.
The model overfits the training data and does not generalize to unseen data
What are solutions against overfitting?
- Early Stopping
- Transfer Learning
- L2 Regularization:
- Dropout
Early Stopping
Early Stopping:
Stop training if you see that the validation accuracy decreases again.
Transfer Learning
- Overfitting often occurs due to few training data.
- Use a network that was already trained on enough data (e.g. ImageNet) and only retrain the last layers.
els that we trained.
L2 Regularization
- If the parameters adapt to the data one can often observe that they get really large.
- Thus one common regularizer is punishing large weights.