C4 Flashcards
exploding/vanishing gradients
the deeper you go in the network, the more multiplications you have: around 2xdepth multiplications
products of small numbers are very small, products of big numbers are very big
solution: alternative activation functions
alternative activation functions
- Logistic sigmoid
- TanH
- Linear identity
- ReLu (Rectified Linear Unit)
- LReLU (Leaky Rectified Linear Unit)
- ELU (Exponential Linear Unit)
- SELU (Scaled Exponential Linear Unit)
batch normalization
When training a network with batches of data the network “gets confused” by the fact that statistical properties of batches vary from batch to batch
Idea 1: normalize each batch => subtract the mean and divide by the std deviation
Idea 2: assume that it is beneficial to scale and to shift each batch by a certain gamma and beta, to minimize network loss (error) on the whole training set
Idea 3: Finding optimal gamma and beta can be achieved with SGD (gradient descent)
Batch Normalization allows higher learning rates, reducing the number of epochs; consequently, it is much faster than other training algorithms
advantages Batch Normalization
- superior accuracy
- reduces risk of vanishing/exploding gradients
- much faster than traditional backpropagation
- allows for using “big learning rates” => less epochs needed for convergence
- allows for training much deeper networks
- increases regularization: lower risk of overfitting
regularization
add additional mechanism to prevent overfitting
L1 or L2 regularization: penalty on too big values of weights