4/5 - Deep Convolutional Neural Networks Flashcards
Deep learning
Learning on a network with more than 3 layers
Overfitting in deep learning (parameters)
More parameters increases the risk of overfitting.
Too many parameters and not sufficient data points = overfitting
Shape analogy for layers (Abstraction)
Layer 1 could detect a horizontal or vertical line
Layer 2 could then detect a shape
Layer 3 could then detect an item.
Abstraction
MNIST
Handwriting data set
4 AI uses in images
Image segmentation (item separation)
Image captioning
Question answering (is there a boat in this image?)
Action recognition
How do you choose the values in a kernel/filter?
Randomly produced and allow the system to learn the best filter
CNN Do you flatten the input image?
No, it stays as 2D
2D Convolution Layer (5x5 input and 3x3 kernel example)
5x5 Input for example with a 3x3 kernel.
You slide the 3x3 kernel across the image from the top left, sum them and then set the location as that in a new 3x3 matrix
Stride
Number of steps taken when sliding the kernel
Eg slide 1 is one pixel
slide 2 is 2
You move faster through the image.
Higher stride does what to output?
Reduces the output
Padding
Add zeros all around edge and then scan with kernel
Computing the size of an output with convolutional:
(W-F + 2P)/S + 1
W input size
F filter size
S stride
P padding
Images have how many channels and why?
3 - RGB
(Where the image is RGB of course)
3 Channel input and a 3 channel filter. How many channels is the output?
1
To make a 3 channel output of a 3 channel input, we need how many 3 channel filters?
3 filters
More filters you have means what for parameters?
More parameters. Potential overfitting if too many filters.
Must your number of input channels match the number of filter channels?
YES
Stack of convolutional layers example
Input
Filters
Feature Maps
Filters
Feature Maps
Filters
ReLU
Rectified Linear Unit
max(0,x)
On graph it’s y=0 below x=0 but otherwise y=x
When is ReLU needed?
After every convolutional layer.
Why can’t we have two convolutional layers right next to each other?
Matrix Multiplication. Two convolutional filters one after the next would be the same as having just one.
We need non-linearity (ReLU etc, sigmoid and others…)
Advantages of ReLU vs tanh/sigmoid
- Faster convergence than tanh
- Easier and faster calculation
- Lower probability of vanishing gradient
Disadvantages of ReLU:
- Dying ReLU: when inputs are all or mostly negative, recovery is hard because grad is 0 in neg half
Pooling Layer: Max Pooling
Takes a sub-part of the image LIKE CONVOLUTIONAL (using a scan kernel) but takes the highest value as the output (rather than multiplying and adding)
Pooling: does it collapse the channels, like convolutional?
No, 3 channels in 3 channels out
Advantages of pooling layer
- Improved translational invariance (moving pixels to slightly different locations has little effect)
- Downsampling (reduced resources)
Batch Normalisation concept
Network is easier to train if the input is normalised
Issue with normalisation
In training, the parameters are chaning which leads to changing of data distribution in different layers
Batch normalisation
Input: values of x over a mini-batch B = {X1…m}
Parameters to be learned yotta
mini batch mean
mini batch variance
normalise
scale and shift
Fully Connected Layer
Each neuron/unit in the previous layer connects to every unit in the this layer.
If you had a 9x9 patch and compared a 4 layer approach to a 1 layer approach (ending at 1x1), how many parameters does each have?
1 has 4 biases (filters)
1 has 4 3 3 = 36
4 + 36 = 40 parameters
2 has 1 bias
2 has 1 9 9 = 81
1+81 = 82 parameters
Approach 2 more likely to overfit
2 advantages of using more, small filters
- Fewer parameters: lower chance of overfitting and lower computational cost
- More ReLUs: more non-linearity so higher representation capacity
Why does GoogleNet arch use several different filters in parallel? What is the problem with this?
Some filter sizes might be better in certain circumstances than others.
The problem is computational cost
How did GoogLeNet solve the computational cost issue?
They use Dimension reduction with bottleneck layers.
The 1x1 convolutional layer
Residual Block
Takes the input, bypasses the layers and sums it with the layer outputs.
If you’ve already learned then just ensure that f(x) does not do much to effect the input.
ResNet uses them every 2 layers.
Densely connected Networks
Residual blocks but connected ot every future layer output.
Pros of Densely connected networks
- Stronger gradient in back propagation so it’s easier to train
- Learn feautres from multiple levels
- Reuse of features (fewer filters but same num of feature maps)
Drawback of densely connected network
High memory cost. You must save all intermediate feature maps.
Dense Block Networks
Uses smaller blocks of densely connected layers, separated by standard layers.
Fine-Tuning data requirements
Lots of data required. You could use a pre-trained network then fine tune
Fine tuning with less data
You could freeze all layers other than the last and retrain that with only a few data points.
You could also train a linear classifier.
Fine tuning with more data
Freeze less layers and train more layers going backwards.
Fine tuning with a LOT data
Freeze no layers and retrain all.
Be careful of overfitting
Data augmentation
Add transforms like rotation/translation/zooming/noise etc)
Dropout
Blackout some units every time you perform inference
Harder to overfit because network can’t rely on one neuron anymore
Early stop
You can stop the network when it starts to overfit.