Week 5: Deep Discriminative Neural Networks Flashcards

Question 1

Q

Deep Neural Networks

Answer

A

Neural Networks with more than 3 layers, typically much more than 3 layers. Deep Neural Networks are better at providing a hierarchy of representation with high levels of abstraction. In recurrent neural networks, connections take the output of every hidden neuron and feed it as additional input at the next iteration.

Question 2

Q

Vanishing Gradient Problem

Answer

A

When doing backpropagation with , error tends to get smaller as we move backward through the hidden layers. This means that neurons in earlier layers learn much more slowly than neurons in later layers.

Question 3

Q

Exploding Gradient Problem

Answer

A

If weights are initialised to, or learn, large values, then the gradient gets larger as we move back through the hidden layers. As such, neurons in earlier layers often make large, random changes in their weights. Later layers can’t learn due to the constantly changing output of earlier layers.

Question 4

Q

Activation Functions w/ Non-Vanishing Derivatives

Answer

A

These functions have non-zero derivatives, allowing for better backpropagation.

Question 5

Q

Backpropagation Stabilisation

Answer

A

To stabilise backpropagation, the following techniques are used:

Activation functions with non-vanishing gradients.
Better ways to initialise weights
Adaptive variations of standard backpropagation
Batch normalisation
Skip connections
Making sure datasets are very large and labelled
Making sure there are large computational resources

Question 6

Q

Rectified Linear Unit (ReLU)

Answer

A

\varphi(net_j) = net_j if net_j \ge 0, 0 if net_j < 0

Question 7

Q

Leaky Rectified Linear Unit (LReLU)

Answer

A

\varphi(net_j) = net_j if net_j \ge 0, a \times net_j if net_j < 0, (0 < a < 1, with a usually being very close to 0)

Question 8

Q

Parametric Rectified Linear Unit (PReLU)

Answer

A

\varphi(net_j) = net_j if net_j \ge 0, a \times net_j if net_j < 0, a is a learned parameter

Question 9

Q

Xavier/Glorot Weight Initialisation

Answer

A

Activation Function: sigmoid/tanh

Given m inputs to n outputs, choose weights from a uniform distribution with the range

(- \sqrt{6/(m+n)}, \sqrt{6/(m+n)}), results with a normal distribution with mean = 0, std = \sqrt{2/(m+n)}

Question 10

Q

Kaiming/He Weight Initialisation

Answer

A

Activation Function: ReLU/ LReLU / PReLU

Given m inputs to n outputs, choose weights from a uniform distribution with the range

(- \sqrt{6/(m)}, \sqrt{6/(m)}), results with a normal distribution with mean = 0, std = \sqrt{2/(m)}

Question 11

Q

Adaptive Versions of Backpropagation

Answer

A

Regular backpropagation can in the following manners:

Gradient too low leads to slow learning
Gradient too large leads to failure to find optimal parameters
Learning rate too low leads to many iterations for little performance gain.
Learning rate too large leads to failure to find optimal parameters.

Introducing momentum allows the gradient to climb out of local minimums. This is done by varying the learning rate during training. Increase the learning rate if cost is decreasing, decrease the learning rate when cost is increasing.

Question 12

Q

Backpropagation Algorithms with Adaptive Learning

Answer

A

AdaGrad, RMSprop

Question 13

Q

Backpropagation Algorithms with Adaptive Learning and Momentum

Answer

A

ADAM, Nadam

Question 14

Q

Batch Normalisation

Answer

A

The output of each neuron is scaled so that is has a mean close to 0 and a standard deviation close to 1.

BN(x) = \beta + \gamma \frac{x - E(x)}{\sqrt{Var(x) + \epsilon}}

\beta, \gamma are learnt parameters through backpropagation

\epsilon is a constant used to prevent division-by-0 errors

Batch normalisation can be applied before or after an activation function. The result is the limiting of activation functions to range where gradients are non-zero.

Question 15

Q

Skip Connections

Answer

A

These are connections that skip 1 or more layers of the network. Also called residual networks. This lets gradients by-pass parts of the network where gradients have vanished. Networks become shallower, but this may be temporary.

Question 16

Q

Convolutional Neural Network

Answer

Study These Flashcards

A

It’s a Deep Neural Network that has at least 1 layer with a transfer function implemented by convolution/cross-correlation, as opposed to vector multiplication. This allows for better tolerance to spatial location and temporal location. As such, neuron’s weights are defined as an array, called a mask/filter/kernel.

Question 17

Q

Padding

Answer

Study These Flashcards

A

This is when 0’s are added to the mask to allow for better convolution at the edges. This increases the size of the output.

Question 18

Q

Stride

Answer

Study These Flashcards

A

This defines the number of pixels moved between calculations of the output. This decreases the size of the output.

Question 19

Q

Pooling

Answer

Study These Flashcards

A

This is when a collection of locations are merged together, can be done by using the maximum value (Max Pooling) or using the average (Average Pooling). This decreases the size of the output.

Question 20

Q

Output Dimension Formula

Answer

Study These Flashcards

A

output_dim = 1 + (input_dim - mask_dim + 2 \times padding)/stride

Question 21

Q

Gradient Clipping

Answer

Study These Flashcards

A

A technique which involves calculating the norm of the, seeing if it exceeds a predefined threshold \theta, and then scaling the gradients down if necessary. The primary benefit of gradient clipping is the prevention of the exploding gradient problem. \theta is a predefined parameter. If set too low, \theta may hinder the network’s ability to learn effectively by restricting the gradient updates too much.

Week 5: Deep Discriminative Neural Networks Flashcards

(21 cards)