ANN Lecture 6 - Training and Enhancing ANNs Flashcards

Question 1

Q

Parts of the ANN architecture

Answer

A

Number of layers (depth)
Kind of layers (convolution, fully-connected, etc.)
Neurons/Kernels per layer (width)
If you would have infinit amount of computational power you could use machine learning to find these parameters

Question 2

Q

The wider and the deeper a Network, the better?

Answer

A

Obviously this would mean more parameters which will slow down your training.
Deeper networks are difficult to train.
Wider networks can lead to overfitting.

Question 3

Q

Vanishing Gradient

Answer

A

TanH:

If there are a lot of high drives and thus small derivatives the magnitude of the gradients decrease towards the earlier layers.
If the gradients for the first layer are too small the weights stay random. Therefore there is not any learning!

Question 4

Q

Exploding/ Dying Gradients

Answer

A

For each drive that is larger than 0 the derivative is one. This may lead to very large gradients in the early layers and is called exploding gradients.
-> Random weights
If the drive of a neuron is below zero the activation and the derivative is zero, thus the gradients are also zero.
-> Weights will never change.

Solution: Activation centered around zero.

Question 5

Q

How do we center the drive around zero?

Answer

A

Input Normalization:
(Input - Mean) / Standard Deviation
(For images Normalize each image on its own)

Weights Initialization:
Random Normal Initialization with a variance dependent on the number of input neurons to one layer (fan-in)

Bias Initialization:
The bias can be initialized with zeros.

Question 6

Q

Batch Normalization

Answer

A

Calculate the mean and standard deviation for all drives in each mini batch

Normalized_Drive_i =
(Drive_i - Mean) / Standard Deviation

New_Normalized_Drive =
Scale * Normalized_Drive + Shift
(Scale initialized with 1’s and Shift with 0’s)

–> For testing the is no batch to normalize over, so the implementation is complicated

Question 7

Q

Data augmentation

Answer

A

A common problem is, that the input data are either too few or too homogenous. By augmenting the data you can produce artificial extra data, which also brings more variance into your dataset.

Question 8

Q

Class imbalance

Answer

A

Example Skin cancer images:
(More images of no skin cancer than of cancer)

After training on random mini-batches from this data your network will classify each image as ‘no cancer’, because this gives high accuracy and low loss.

Question 9

Q

Solutions for Class imbalance

Answer

A

Draw balanced mini-batches. (Can over fit the rare data)

2. Punishing wrong classifications for rare classes stronger.

Question 10

Q

Training Parameters (Batch size and learningrate)

Answer

A

If the batch size is large the gradients show a clear direction to go → the learning rate can be quite large.

If the batch size is small the gradients do not show a clear irection to go → we have to do careful small step = learning rate should be small.

Question 11

Q

What does overfitting mean?

Answer

A

Overfitting means, that the training error gets significantly lower than the validation error.
The model overfits the training data and does not generalize to unseen data

Question 12

Q

What are solutions against overfitting?

Answer

A

Early Stopping
Transfer Learning
L2 Regularization:
Dropout

Question 13

Q

Early Stopping

Answer

A

Early Stopping:

Stop training if you see that the validation accuracy decreases again.

Question 14

Q

Transfer Learning

Answer

A

Overfitting often occurs due to few training data.
Use a network that was already trained on enough data (e.g. ImageNet) and only retrain the last layers.
els that we trained.

Question 15

Q

L2 Regularization

Answer

A

If the parameters adapt to the data one can often observe that they get really large.
Thus one common regularizer is punishing large weights.

Question 16

Q

Dropout

Answer

Study These Flashcards

A

During the forward step in training we randomly drop some neurons with a fixed probability 1-p.
We train multiple different models that share parameters.
During testing we do not dropout any neurons, but multiply all weights with the probability p.
We average above all models that were trained.

ANN Lecture 6 - Training and Enhancing ANNs Flashcards

(16 cards)