ANN Lecture 6 - Training and Enhancing ANNs Flashcards

1
Q

Parts of the ANN architecture

A
  • Number of layers (depth)
  • Kind of layers (convolution, fully-connected, etc.)
  • Neurons/Kernels per layer (width)
    If you would have infinit amount of computational power you could use machine learning to find these parameters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

The wider and the deeper a Network, the better?

A

Obviously this would mean more parameters which will slow down your training.
Deeper networks are difficult to train.
Wider networks can lead to overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Vanishing Gradient

A

TanH:

  • If there are a lot of high drives and thus small derivatives the magnitude of the gradients decrease towards the earlier layers.
  • If the gradients for the first layer are too small the weights stay random. Therefore there is not any learning!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Exploding/ Dying Gradients

A
  • For each drive that is larger than 0 the derivative is one. This may lead to very large gradients in the early layers and is called exploding gradients.
  • -> Random weights
  • If the drive of a neuron is below zero the activation and the derivative is zero, thus the gradients are also zero.
  • -> Weights will never change.

Solution: Activation centered around zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do we center the drive around zero?

A

Input Normalization:
(Input - Mean) / Standard Deviation
(For images Normalize each image on its own)

Weights Initialization:
Random Normal Initialization with a variance dependent on the number of input neurons to one layer (fan-in)

Bias Initialization:
The bias can be initialized with zeros.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Batch Normalization

A
  • Calculate the mean and standard deviation for all drives in each mini batch

Normalized_Drive_i =
(Drive_i - Mean) / Standard Deviation

New_Normalized_Drive =
Scale * Normalized_Drive + Shift
(Scale initialized with 1’s and Shift with 0’s)

–> For testing the is no batch to normalize over, so the implementation is complicated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data augmentation

A

A common problem is, that the input data are either too few or too homogenous. By augmenting the data you can produce artificial extra data, which also brings more variance into your dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Class imbalance

A

Example Skin cancer images:
(More images of no skin cancer than of cancer)

After training on random mini-batches from this data your network will classify each image as ‘no cancer’, because this gives high accuracy and low loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Solutions for Class imbalance

A
  1. Draw balanced mini-batches. (Can over fit the rare data)

2. Punishing wrong classifications for rare classes stronger.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Training Parameters (Batch size and learningrate)

A

If the batch size is large the gradients show a clear direction to go → the learning rate can be quite large.

If the batch size is small the gradients do not show a clear irection to go → we have to do careful small step = learning rate should be small.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does overfitting mean?

A

Overfitting means, that the training error gets significantly lower than the validation error.
The model overfits the training data and does not generalize to unseen data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are solutions against overfitting?

A
  • Early Stopping
  • Transfer Learning
  • L2 Regularization:
  • Dropout
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Early Stopping

A

Early Stopping:

Stop training if you see that the validation accuracy decreases again.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Transfer Learning

A
  • Overfitting often occurs due to few training data.
  • Use a network that was already trained on enough data (e.g. ImageNet) and only retrain the last layers.
    els that we trained.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

L2 Regularization

A
  • If the parameters adapt to the data one can often observe that they get really large.
  • Thus one common regularizer is punishing large weights.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Dropout

A
  • During the forward step in training we randomly drop some neurons with a fixed probability 1-p.
  • We train multiple different models that share parameters.
  • During testing we do not dropout any neurons, but multiply all weights with the probability p.
  • We average above all models that were trained.