Week 5: Deep Discriminative Neural Networks Flashcards
Deep Neural Networks
Neural Networks with more than 3 layers, typically much more than 3 layers. Deep Neural Networks are better at providing a hierarchy of representation with high levels of abstraction. In recurrent neural networks, connections take the output of every hidden neuron and feed it as additional input at the next iteration.
Vanishing Gradient Problem
When doing backpropagation with , error tends to get smaller as we move backward through the hidden layers. This means that neurons in earlier layers learn much more slowly than neurons in later layers.
Exploding Gradient Problem
If weights are initialised to, or learn, large values, then the gradient gets larger as we move back through the hidden layers. As such, neurons in earlier layers often make large, random changes in their weights. Later layers can’t learn due to the constantly changing output of earlier layers.
Activation Functions w/ Non-Vanishing Derivatives
These functions have non-zero derivatives, allowing for better backpropagation.
Backpropagation Stabilisation
To stabilise backpropagation, the following techniques are used:
- Activation functions with non-vanishing gradients.
- Better ways to initialise weights
- Adaptive variations of standard backpropagation
- Batch normalisation
- Skip connections
- Making sure datasets are very large and labelled
- Making sure there are large computational resources
Rectified Linear Unit (ReLU)
\varphi(net_j) = net_j if net_j \ge 0, 0 if net_j < 0
Leaky Rectified Linear Unit (LReLU)
\varphi(net_j) = net_j if net_j \ge 0, a \times net_j if net_j < 0, (0 < a < 1, with a usually being very close to 0)
Parametric Rectified Linear Unit (PReLU)
\varphi(net_j) = net_j if net_j \ge 0, a \times net_j if net_j < 0, a is a learned parameter
Xavier/Glorot Weight Initialisation
Activation Function: sigmoid/tanh
Given m inputs to n outputs, choose weights from a uniform distribution with the range
(- \sqrt{6/(m+n)}, \sqrt{6/(m+n)}), results with a normal distribution with mean = 0, std = \sqrt{2/(m+n)}
Kaiming/He Weight Initialisation
Activation Function: ReLU/ LReLU / PReLU
Given m inputs to n outputs, choose weights from a uniform distribution with the range
(- \sqrt{6/(m)}, \sqrt{6/(m)}), results with a normal distribution with mean = 0, std = \sqrt{2/(m)}
Adaptive Versions of Backpropagation
Regular backpropagation can in the following manners:
- Gradient too low leads to slow learning
- Gradient too large leads to failure to find optimal parameters
- Learning rate too low leads to many iterations for little performance gain.
- Learning rate too large leads to failure to find optimal parameters.
Introducing momentum allows the gradient to climb out of local minimums. This is done by varying the learning rate during training. Increase the learning rate if cost is decreasing, decrease the learning rate when cost is increasing.
Backpropagation Algorithms with Adaptive Learning
AdaGrad, RMSprop
Backpropagation Algorithms with Adaptive Learning and Momentum
ADAM, Nadam
Batch Normalisation
The output of each neuron is scaled so that is has a mean close to 0 and a standard deviation close to 1.
BN(x) = \beta + \gamma \frac{x - E(x)}{\sqrt{Var(x) + \epsilon}}
\beta, \gamma are learnt parameters through backpropagation
\epsilon is a constant used to prevent division-by-0 errors
Batch normalisation can be applied before or after an activation function. The result is the limiting of activation functions to range where gradients are non-zero.
Skip Connections
These are connections that skip 1 or more layers of the network. Also called residual networks. This lets gradients by-pass parts of the network where gradients have vanished. Networks become shallower, but this may be temporary.