lecture 7 - CNNs Flashcards

Question

What is the primary output difference between valid convolution and same convolution?

Answer 1

1. Valid convolution (no padding): The output size is smaller than the input. 2. Same convolution (zero padding): The output size is the same as the input due to zero padding.

Answer 2

- Stride refers to **how much the filter moves across the input image** during convolution. - It determines the step size of the filter movement **both horizontally and vertically**.

Answer 3

- Increasing the stride reduces the number of computations by skipping pixels, which saves time and computational power. - This is particularly effective for high-resolution images.

Answer 4

When the stride is 1, the filter moves one pixel at a time in both horizontal and vertical directions, resulting in **maximum overlap** between adjacent filter positions.

Answer 5

- The output size is determined by the formula: - ([n+2p-f]/s) + 1

Answer 6

1. They are the smallest possible filters that can capture information in four directions (up, down, left, right). 2. Stacking multiple smaller filters can achieve the same effect as using a larger filter (e.g., 5×5 or 7×7).

Answer 7

1. Smaller filters **lead to deeper networks** with more non-linearities, which improves the network's ability to learn complex patterns. 2. They result in **fewer parameters**, helping with regularization and reducing overfitting.

Answer 8

- RGB images are represented as three input channels: red, green, and blue. - Each pixel's color is determined by its RGB values.

Answer 9

The RGB channels are highly **correlated**, so they need to be processed together to capture meaningful features across all color channels.

Answer 10

A filter (e.g., 3×3×3) is applied across the spatial dimensions of the RGB image, performing a dot product between the filter weights and the overlapping region of the input to produce a single output value.

Answer 11

The resulting feature map is a 2D grid (e.g., 4×4 in size, depending on the input size and filter parameters) that summarizes the detected features.

Answer 12

The filter slides across the image spatially, performing the dot product at each position to compute the corresponding output value in the feature map.

Answer 13

- Multiple filters produce a set of output feature maps. - These output feature maps collectively form a new 3D tensor, which is the input for the next layer.

Answer 14

The network learns the **weights of the filters** through backpropagation, enabling it to detect various features in the input data.

Answer 15

- The spatial dimensions of the image decrease over time - The number of filters (and thus the depth of the output tensor) increases.

Answer 16

A 1×1 output is a feature vector representing high-level features of the input image. It can be fed into fully connected layers for tasks like classification or regression.

Answer 17

Fully connected layers perform the final classification or regression by mapping the high-level feature vector to the desired output (e.g., class probabilities or bounding boxes).

Answer 18

- The parameter count for a filter is determined by its size plus a bias parameter **e.g., for a 3x3x3 filter*** - parameters per filter: 3x3x3+1 = 28 - for 10 filters: 28 x 10 = 280

Answer 19

1. The number of parameters in a CNN independent of the input image dimensions 2. Depends only on the filter size and the number of filters, not on the input image size.

Answer 20

1. **Convolutional layer (CONV)**: Applies filters to extract features from the input. 2. **Pooling layer (POOL)**: Reduces the spatial dimensions of the feature maps. 3. **Fully connected layer (FC)**: Connects all neurons to produce the final output for classification or regression.

Answer 21

- Max pooling is the most common form of pooling. - It summarizes data by taking the maximum value in each region of the feature map

Answer 22

- Reduces the size of feature maps - Helps control overfitting by reducing parameters - Preserves the most important features (max values) in each region.

Answer 23

Average pooling summarizes data by taking the average value in each region instead of the maximum value, resulting in a **smoother representation of the feature map**.

Answer 24

Pooling reduces the spatial dimensions of feature maps and decreases computational complexity

Answer 25

1. parameter sharing 2. sparsity of connections

Answer 26

A feature detector (such as a vertical edge detector) that is useful in one part of the image is probably useful in another part of the image

Answer 27

- Sparsity of connections means each output value depends only on a small number of inputs (the receptive field). - This reduces the complexity and increases the computational efficiency of the network.

Answer 28

CNNs evolved from traditional computer vision architectures to deeper learning computer vision

Answer 29

- Was the first deep CNN to achieve significant results in image recognition - processed grayscale images and had the following layers: CONV → POOL → CONV → POOL → FC → FC → FC → SOFTMAX

Answer 30

Most of the parameters in LeNet-5 are in the fully connected layers, which play a crucial role in classification by combining extracted features.

Answer 31

- AlexNet was a deeper network with multiple convolutional and max-pooling layers, followed by fully connected layers. - It used ReLU activation and dropout to improve performance and reduce overfitting.

Answer 32

- The bulk of the parameters is in the fully connected layers - Dropping fully connected layers reduces parameters significantly, with a small performance drop. - Dropping the other layers has a smaller performance drop, but also lower number of parameters dropped

Answer 33

- used a simple and uniform design of 3×3 convolutional filters throughout the network. - Using small 3×3 filters allowed VGG to increase depth while maintaining computational efficiency, leading to better accuracy. - Deeper networks give vanishing gradient problem

Answer 34

ResNet addressed the vanishing gradient problem by introducing **residual blocks with skip connections**, allowing very deep networks (e.g., 152 layers) to be trained effectively.

Answer 35

Residual blocks feed gradient information to both the next layer and the one after, enabling gradients to flow directly through the network and preventing vanishing gradients.

Answer 36

- Inception networks used **multiple filter sizes** (1×1, 3×3, 5×5) in parallel within each layer to capture features at different scales. - They also used **1×1** convolutions to reduce dimensionality and computational cost.

Answer 37

1×1 convolutions acted as bottleneck layers, reducing the number of channels before applying larger filters, significantly **decreasing the number of multiplications** required.

Answer 38

1. **Classification**: Determining what the object is. 2. **Localization**: Identifying where the object is in the image. 3. **Detecting multiple objects**: Recognizing and localizing multiple objects within an image.

Answer 39

Object localization predicts the location of an object in an image - Predicts the position and dimensions of the bounding box enclosing the object.

Answer 40

- (b_x, b_y, b_h, b_w) - denote the center coordinates, height, and width of the bounding box

Answer 41

1. A binary value indicating whether there is an object (1) or not (0). 2. Bounding box coordinates if an object is present. 3. Class label of the object (e.g., 010 for car).

Answer 42

- Adjusted so that it combines the prediction error for **class labels** and the localization error for **bounding box coordinates** - i.e., depending on if there is an object or not (1/0 on first position), the cost function needs to be different - if an object is present, the error is the squared difference of all values summed - if no object is present, only the classification error is considered

Answer 43

A rectangle is slid over the image at different scales using a CNN trained for localization, and predictions are made for each region to detect objects.

Answer 44

Fully connected layers **can be replaced by 1×1 convolutional layers**, enabling operations to handle spatial information throughout the network using convolution.

Answer 45

Fully convolutional networks allow the detection process to be done in a single pass by adapting convolutional operations to **handle both localization and classification in one step**.

Answer 46

Several bounding boxes may detect the same object, leading to redundant detections.

Answer 47

evaluates how well the predicted bounding box matches the ground truth by comparing the ratio of the overlapping area (intersection) to the combined area of both boxes (union).

Answer 48

[area of intersection] / [area of union]

Answer 49

The IoU value will be 0, indicating a completely incorrect prediction.

Answer 50

Large bounding boxes increase the union area without significantly increasing the intersection area, resulting in a lower IoU value, which correctly penalizes imprecise detections.

Answer 51

1. when the intersection area is large 2. when the intersection is not larger than the ground truth box - this indicates a good match between the predicted and ground truth bounding boxes.

Answer 52

NMS is a post-processing step used to remove redundant or overlapping bounding boxes for the same object while retaining only the box with the highest confidence score.

Answer 53

1. Select the bounding box with the highest confidence score. 2. Calculate the IoU with other boxes. 3. Discard boxes with IoU greater than a threshold (e.g., 0.5). 4. Repeat until only unique, high-confidence detections remain.

Answer 54

The Anchor Box algorithm helps in detecting overlapping or closely packed objects by pre-defining a set of bounding boxes at different scales and aspect ratios.

Answer 55

The anchor box response holds copies of the original response so that it can detect a lot of different items. - It has therefore been trained with a limit on the number of objects it can detect at the same time.

Answer 56

- Face verification: One-to-one comparison to check if an input image matches a claimed identity (e.g., logging into a phone using face recognition). - Face recognition: One-to-many comparison to identify an individual from a database (e.g., identifying someone in a surveillance video).

Answer 57

- Input: An image and a claimed identity (name or ID). - Output: A binary result (yes/no) indicating whether the input image matches the claimed identity.

Answer 58

-I nput: An image. - Output: The ID of the matched individual from the database, or "not recognized" if no match is found.

Answer 59

The main challenge is **one-shot learning**, where the system must generalize from minimal data and recognize faces with high accuracy.

Answer 60

One-shot learning trains a network to learn a **similarity function**, which compares two face images and outputs a similarity score indicating whether the faces belong to the same person.

Answer 61

A **Siamese network** is commonly used, which compares two input images and learns to group similar images together based on a distance function.

Answer 62

- A Siamese network consists of two identical sub-networks that process two input images and output feature vectors (encodings). - The similarity between images is determined by comparing these encodings using a distance metric (e.g., Euclidean distance or cosine similarity).

Answer 63

Instead of training a separate classifier for each class, a Siamese network **learns a similarity function** that can compare any pair of images and determine whether they belong to the same class.

Answer 64

The triplet loss function is used to ensure that the distance between an anchor and a positive example is smaller than the distance between the anchor and a negative example by a margin α.

Answer 65

1. Anchor (A): Reference image. 2. Positive (P): Image of the same class as the anchor. 3. Negative (N): Image of a different class from the anchor.

Answer 66

The margin ensures that the network does not learn trivial solutions where the distances are zero by requiring a minimum difference between positive and negative distances.

Answer 67

L(A,P,N)=max(∥f(A)−f(P)∥^2 −∥f(A)−f(N)∥^2 +α,0) - f(A), f(P), and f(N) are the encodings for the anchor, positive, and negative examples.

Answer 68

NST is a technique that uses neural networks to combine the content of one image with the artistic style of another image to create a new, stylized image.

Answer 69

1. Content image (C): Provides the structure or layout (e.g., buildings, objects). 2. Style image (S): Provides the artistic feel or texture (e.g., brush strokes, colors). 3. Generated image (G): Combines the structure of the content image with the style of the style image.

Answer 70

1. Early layers detect simple features like edges and colors. 2. Later layers detect complex patterns, parts of objects, and full objects. - For content representation, **features are sampled from later layers of a pretrained CNN**.

Answer 71

The goal is to minimize the cost function J(G), which combines two losses: 1. **Content loss**: Ensures the generated image has a similar structure to the content image. 2. **Style loss**: Ensures the generated image has a similar texture to the style image.

Answer 72

similarity between the activation of the generated image G and the content image C at a chosen layer 𝑙 of a pretrained CNN.

Answer 73

similarity between the activation of the generated image G and the style image S by comparing the Gram matrices of their feature maps.

Answer 74

- The Gram matrix captures correlations between different channels (feature maps) and helps represent the style of an image by measuring how different features co-occur. - i.e., we minimize the correlation structure over the different channels to get the style into the generated image.

Answer 75

- They control the relative importance of content and style similarity. - Adjusting these weights changes the balance between preserving the structure of the content and applying the style.

lecture 7 - CNNs Flashcards

(99 cards)