CNN Flashcards
what a CNN do?
What Does CNN Do with Images?
CNNs extract important features from images, like edges, shapes, and textures, and use them to recognize objects, patterns, or even emotions.
Uses of CNN in Images
✔ Image Classification – Identifying objects (e.g., cat vs. dog).
✔ Face Recognition – Used in biometric systems (e.g., Face ID).
✔ Medical Imaging – Detecting diseases in X-rays or MRIs.
✔ Object Detection – Finding objects in images (e.g., self-driving cars).
✔ Image Segmentation – Separating different parts of an image (e.g., background removal).
What is the Architecture of CNN?
CNN Architecture (Layers Overview)
1️⃣ Input Layer → Raw image (e.g., 32×32×3 for RGB).
2️⃣ Convolution Layer → Extracts features using filters/kernels.
3️⃣ Activation Layer (ReLU) → Applies non-linearity, removes negatives.
4️⃣ Pooling Layer (Max/Average Pooling) → Reduces spatial size.
5️⃣ (Repeat Conv → ReLU → Pooling) → Deeper feature extraction.
6️⃣ Flatten Layer → Converts feature maps into a 1D vector.
7️⃣ Fully Connected Layer (Dense) → Processes features for classification.
8️⃣ Output Layer (Softmax/Sigmoid) → Gives final class probabilities.
what is Convolution Layer in CNN?
Convolution Layer in CNN
Input Representation:
The image is represented as a matrix of pixel values (grayscale: 2D matrix, color: 3D matrix with RGB channels).
Filters (Kernels):
A small grid (filter/kernel) slides over the image to detect patterns.
Example filters:
Edge detection (horizontal, vertical, diagonal).
Blur, sharpen, texture extraction, etc.
Convolution Operation:
Multiply the filter with the corresponding image section.
Sum up the values to get a single number.
This forms a new matrix called the feature map.
Purpose of Feature Maps:
Helps the CNN detect edges, textures, shapes, and patterns.
Multiple filters extract different features from the same image.
Output Size Adjustment:
The output feature map is usually smaller than the original image due to the filter movement.
Padding is added to maintain size if needed.
Learnable Filters:
Unlike traditional fixed filters (Sobel, Prewitt), CNN learns filters automatically through training.
The model updates filter values to detect the most relevant features.
Final Thought: The Convolution Layer transforms raw pixel data into meaningful feature maps, helping the network recognize objects efficiently.
What are common Filters in CNN?
Common Filters in CNN
Edge Detection Filters:
Sobel, Prewitt, Scharr (detect horizontal/vertical edges).
Sharpening Filters:
Enhance the contrast and make features more pronounced.
Blurring Filters:
Gaussian Blur (reduces noise).
Embossing Filters:
Highlights high and low areas in an image.
Learned Filters (CNN-specific):
Initial layers detect simple patterns (edges, corners, textures).
Deeper layers detect complex patterns (eyes, faces, objects).
Convolution layer in CNN with example
Input Layer (The Original Image - 5×5 Matrix)
This is your input image, where each cell (pixel) is an input neuron:
1 2 3 4 5
5 3 5 6 3
1 2 5 6 7
5 5 7 8 3
1 4 6 8 2
Each number represents a grayscale pixel value (0-255 for real images).
Convolution Layer (Feature Extraction Using a 3×3 Filter)
A filter (kernel) moves over this input. Example 3×3 filter (edge detection):
1 0 -1
1 0 -1
1 0 -1
How it works:
This filter slides over the image and multiplies corresponding pixel values, then sums them up.
This process extracts edges or patterns from the image.
Feature Map (Output After Convolution)
After applying the filter over the entire image (stride = 1, no padding), the result is a 3×3 feature map:
2 0 2
−4 0 4
−2 0 2
This is the first hidden layer of neurons in CNN!
These values represent detected edges in the image.
Activation Function (ReLU)
ReLU is applied to remove negative values, keeping only strong features:
0 0 2
0 0 4
0 0 2
Now only important edge information is retained.
This is the activated hidden layer.
What is stride in CNN?
Strides in CNN
What is Stride?
Stride defines how much the filter moves over the input matrix at each step.
📌 Stride = 1 → The filter moves one pixel at a time (default setting).
📌 Stride = 2 → The filter moves two pixels at a time, reducing the output size.
Effect of Stride:
Higher Stride → Smaller Output (less computational cost, but might lose details).
Lower Stride → Larger Output (captures more details but increases computation).
🔹 Example:
If you have a 5×5 image and a 3×3 filter:
With stride 1, output size = 3×3
With stride 2, output size = 2×2
What is Padding in CNN?
Padding in CNN
What is Padding?
Padding is adding extra zeros around the input image before applying the filter.
Why use Padding?
Preserves image size → Without padding, output size shrinks after each convolution.
Better feature detection → Ensures edges & corners of the image are processed well.
Types of Padding:
Valid Padding (No Padding, “VALID” in TensorFlow)
No extra zeros are added.
Output size reduces after each convolution.
Same Padding (“SAME” in TensorFlow)
Adds enough padding to keep the output same size as the input.
What is Depth in CNN?
Depth in CNN
📌 What is Depth in CNN?
Depth refers to the number of filters (kernels) applied in a convolutional layer.
How It Works?
Each filter detects different features from the image (e.g., edges, textures, shapes).
If we apply N filters to an input image, we get N output feature maps stacked together → This forms the depth of the output.
Example:
Input Image: 28×28×3 (RGB image with 3 color channels)
Filters Applied: 32 filters of size 3×3×3
Output Shape: 28×28×32 → The depth is 32
📌 Why is Depth Important?
More filters = More feature extraction power
Helps in learning complex patterns
Deeper layers capture high-level features (e.g., object parts, textures)
What is pooling layer in CNN?
Pooling Layer – Simplifying the Feature Maps
After the convolution layer extracts patterns, the Pooling Layer reduces the size of the feature maps while keeping the important information. This makes the CNN faster and less sensitive to small changes in the image.
How Pooling Works:
A small window (e.g., 2×2 or 3×3) moves across the feature map.
It applies a function (like max pooling or average pooling) to summarize the values inside the window.
This reduces the dimensions while keeping key features.
Types of Pooling:
Max Pooling (Most Common) – Takes the largest value in the window.
Why? Keeps the strongest features and removes weak ones.
Average Pooling – Takes the average of values in the window.
Why? Keeps general information but loses sharp details.
Example (Max Pooling, 2×2 Window, Stride = 2):
Feature Map (4×4 matrix):
1 3 2 4
5 6 1 2
3 1 7 8
9 2 4 6
After Max Pooling (2×2, Stride=2):
6 4
9 8
Why Pooling is needed in CNN?
Why Use Pooling?
✅ Reduces computation (fewer neurons).
✅ Makes CNN more robust to small shifts in the image.
✅ Prevents overfitting by reducing unnecessary details.
What are the layers in CNN?
Where Are These Layers in CNN?
First layers: Initial convolution + activation + pooling cycle.
Middle layers: Additional convolution + activation + pooling cycles.
Deeper layers: Fully connected (Dense) layers before output.
Breaking It Down:
First (Shallow) Layers:
Detect basic patterns like edges, corners, and textures.
Example: A filter might detect horizontal or vertical edges in an image.
Middle Layers:
Combine basic features into shapes and objects (like eyes, noses, or fur patterns).
Example: A filter might detect circles or curves in an image of a face.
Deeper Layers:
Recognize complex structures and full objects (like faces, cats, or cars).
Example: A filter might detect a whole face rather than just eyes or lips.
What is Flatten layer ?
The Flatten layer in CNN converts the 2D feature maps (from convolution/pooling layers) into a 1D vector so it can be passed into a fully connected (dense) layer for classification.
Why Flatten?
CNN extracts spatial features using convolution & pooling. But, classification needs a fully connected layer that works with a 1D input. Flatten helps in this transition.
Example
Suppose after pooling, we have a 4 × 4 × 16 feature map.
Flatten converts it into a 1D array of 256 values (4 × 4 × 16 = 256).
This is then fed into a dense layer for classification.
Key Takeaway
Flatten preserves extracted features but converts them into a format suitable for classification.
What is Fully Connected Layer?
What Happens in the Fully Connected Layer?
Flattening:
The output from the last pooling layer is a 3D matrix (e.g., 7×7×64).
The Flatten layer converts this into a 1D vector (e.g., size 3136 = 7×7×64).
Fully Connected (Dense) Layer:
This 1D vector is passed through multiple fully connected neurons.
Every neuron is connected to every value in the vector (like in a standard neural network).
It learns to combine extracted features to recognize objects.
Activation Function (ReLU & Softmax/Sigmoid):
Hidden layers in the FC part usually use ReLU to introduce non-linearity.
The final layer uses Softmax (for multi-class) or Sigmoid (for binary classification) to produce probabilities.
Think of it like ANN here, series of neurons connected
How CNN Works – Step-by-Step Flow
How CNN Works – Step-by-Step Flow
1️⃣ Input Image (Matrix Form)
The image is converted into a matrix of pixel values.
2️⃣ Convolution Layer
Multiple filters (kernels) slide over the image.
Each filter extracts specific patterns (edges, textures, shapes).
3️⃣ Activation Function (ReLU)
Applied after convolution to remove negative values.
Helps introduce non-linearity for complex learning.
4️⃣ Pooling Layer (Max/Average Pooling)
Reduces size of feature maps (downsampling).
Keeps important features, removes noise & reduces computation.
5️⃣ Repeat Steps 2-4 Multiple Times
Each layer extracts more complex patterns.
First layers → simple edges, middle layers → textures, deep layers → full objects.
6️⃣ Flattening
Converts final feature map into a 1D vector.
This vector contains all extracted features.
7️⃣ Fully Connected (Dense) Layer
Standard ANN layer that learns patterns from extracted features.
Makes the final classification/prediction.
8️⃣ Output Layer (Softmax/Sigmoid)
Converts final output into probabilities (e.g., dog = 90%, cat = 10%).
🔹 Backpropagation & Training
The CNN compares predictions with actual labels.
Errors are sent back (backpropagation) to adjust filters & weights.
Model learns over time to improve accuracy.
how to overcome overfitting in CNN?
Deep learning models, especially Convolutional Neural Networks (CNNs), are particularly susceptible to overfitting due to their capacity for high complexity and their ability to learn detailed patterns in large-scale data.
Dropout: This consists of randomly dropping some neurons during the training process, which forces the remaining neurons to learn new features from the input data.
Batch normalization: The overfitting is reduced at some extent by normalizing the input layer by adjusting and scaling the activations. This approach is also used to speed up and stabilize the training process.
Pooling Layers: This can be used to reduce the spatial dimensions of the input image to provide the model with an abstracted form of representation, hence reducing the chance of overfitting.
Early stopping: This consists of consistently monitoring the model’s performance on validation data during the training process and stopping the training whenever the validation error does not improve anymore.
Noise injection: This process consists of adding noise to the inputs or the outputs of hidden layers during the training to make the model more robust and prevent it from a weak generalization.
L1 and L2 normalizations: Both L1 and L2 are used to add a penalty to the loss function based on the size of weights. More specifically, L1 encourages the weights to be spare, leading to better feature selection. On the other hand, L2 (also called weight decay) encourages the weights to be small, preventing them from having too much influence on the predictions.
Data augmentation: This is the process of artificially increasing the size and diversity of the training dataset by applying random transformations like rotation, scaling, flipping, or cropping to the input images.