Lesson 3 - Convolutional Neural Networks Flashcards
What is the goal or the use of having the cross entropy block there during training?
It will tell us how close our prediction ŷ is to the ground truth, the one that was provided with the input. During training, it will give an indication as to how good our model is behaving at a moment during training
Why do we use the gradient information during the backward pass? What is the gradient telling us?
The gradient will tell which is the direction I want the parameters of the network (in this case the weights of the layers) which will minimize my loss and that is why I always follow the direction of the gradient or the negative of the gradient because I want to minimize it.
How do we know when to stop training?
When we see that the loss of training is increasing again
= overfitting
ReLU allows us to circumvent some issues that we had when computing the gradient. What where those issues? What was the problem with for example Sigmoid?
Sigmoid tended to saturate in the two extremes which had as side-effect that in those regions the gradient tended to vanish (become 0)
A gradient of 0 means that when you start to navigate the parameter space to update your model, you will not get sufficient information to do that.
What are some characteristics of visual data?
- Locality: neighboring pixels are highly correlated
- Translation Invariance: meaningful patterns can appear anywhere
- Compositionality: Learning feature hierarchies
Why is it not sufficient to just flatten an image?
There are all sorts of translations that can happen and we don’t want the model to train for specific positions.
For example if the image shifts 6 pixels to the left, weights learned can suddenly be not good anymore
What are the differences between locally connected layers and convolutional layers?
Locally Connected Layers
Locally Connected Layer: In a locally connected layer, each neuron is connected to a small, local region of the input, but these connections are not shared across the spatial dimensions. This means that the weights for connections in different regions are independent.
Drawback: This lack of weight sharing leads to a significant increase in the number of parameters, which can be inefficient and prone to overfitting, especially for large input sizes.
Convolutional Layers
Weight Sharing: A convolutional layer addresses this by sharing weights across different spatial locations. A single set of weights (called a filter or kernel) is used to slide across the entire input, creating a feature map.
Efficiency: This drastically reduces the number of parameters compared to locally connected layers, making the model more efficient and less prone to overfitting.
Parameter Sharing: By using the same filter for all spatial locations, convolutional layers are able to detect the same feature (like an edge or a texture) regardless of its position in the input.
Can you explain what the response/feature map is?
The output or result of applying an operation in a convolutional layer.
When sliding the kernel over the input, the output of that kernel is in the response map.
What is the receptive field?
What receptive field refers to, is actually, given a specific neuron, what is the part of the input that that neuron perceives, that that neuron can observe, in order to get that activation, that response value.
When you say “my layer has 10 kernels”, what does that mean?
That means that you have ten masks sliding over your input so at least looking for ten different features
What convolution operations did we see in class?
- Valid Convolution -> every considered point lies within the input
- Full Convolution -> at least one value of the kernel covers the input
- Same Convolution -> kernel evaluated (centered) at every location of the input
- Strided Convolution -> sparser kernel evaluations
- Dilated Convolution -> points considered in the kernel are spread
What is VALID convolution? How is the size ratio for input and output?
Normal convolution, every considered point lies within the input
size of input is larger than size of output
What is FULL convolution? How is the size ratio for input and output?
At least one value of the kernel covers the input
size of input is smaller than size of output
What is a potential issue with full convolution?
Also points outside the input are considered but these points are undefined.
Therefor we need padding
We cannot pad with 0 because that may influence the result
What is SAME convolution? How is the size ratio for input and output?
Kernel evaluated (centered) at every location of the input
Sizes are equal