lecture 7 - CNNs Flashcards

1
Q

What is image classification in machine learning?

A
  • Image classification involves analyzing an image (represented as pixel data with color values) to assign it a label.
  • Example: determining whether a picture contains a cat or not.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the typical data structure for an image in image classification?

A
  • Each image is represented as a 3D array
  • dimensions height × width × number of color channels
  • e.g., a 64×64 image with 3 color channels results in 64×64×3 datapoints per image
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is object detection, and how does it differ from image classification?

A
  • Object detection involves identifying objects within an image, determining how many objects are present, and specifying their locations.
  • Image classification, it provides labels
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is neural style transfer?

A

Neural style transfer combines a source image with a style from another image, transferring the style onto the source image while preserving its content.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do larger images affect normal neural network design?

A

Larger images have more datapoints, leading to a greater number of weights in the network. This increases computational complexity and memory requirements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why are regular neural networks not ideal for image classification tasks with large images?

A
  • Regular neural networks require a very high number of weights for large images, making them computationally infeasible.
  • Specialized architectures, like convolutional neural networks (CNNs), are used to manage this complexity.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is edge detection in image processing?

A
  • Edge detection identifies areas in an image where the intensity (brightness) changes sharply.
  • These areas often represent object boundaries or transitions from one color to another.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does a filter (kernel) work in edge detection?

A
  • A filter scans across the image by sliding over the input grid and performs a convolution operation to compute an output.
  • This operation detects changes in intensity, indicating edges.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

If you apply a 3×3 filter on a 6×6 image, what will be the size of the output?

A
  • The output will be a 4×4 matrix
  • applying a filter reduces the output dimensions by the size of the filter minus one.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Does the orientation of the input image affect the result of edge detection?

A

No, flipping the input image around does not change the filter’s ability to detect edges, as the convolution operation remains consistent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the mathematical operation performed during convolution?

A

The filter and a corresponding section of the input image are multiplied element-wise, and the resulting values are summed to produce one output value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the primary purpose of using convolution in edge detection?

A

Convolution helps extract meaningful patterns, such as edges, from the input image, facilitating feature extraction in downstream computer vision tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does a vertical filter detect in an image?

A
  • A vertical filter detects vertical edges by emphasizing intensity differences between the left and right sides of the filter.
  • If one side is much brighter, an edge is detected.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does a horizontal filter detect edges?

A
  • A horizontal filter detects horizontal edges by comparing brightness between the top and bottom parts of the filter.
  • It is the transposed version of a vertical filter.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a Sobel filter, and why is it useful?

A
  • A Sobel filter is an advanced edge detection filter that gives more importance to the center of the image section being analyzed.
  • It works well for detecting faint edges.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the purpose of a Scharr filter?

A

A Scharr filter is a fine-tuned version of the Sobel filter that detects edges with even greater precision.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the two main advantages of using convolution in image processing?

A
  1. Parameter/Weight sharing: The filter size is fixed, reducing the number of weights significantly.
  2. Local information: Convolution captures local patterns by taking into account the spatial relationship of neighboring pixels.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why is the filter size considered a hyperparameter in convolutional neural networks?

A
  • The filter size determines the receptive field and affects the output size.
  • It is typically chosen as an odd number (e.g., 3×3 or 5×5) to ensure proper centering.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the purpose of padding in convolutional neural networks?

A

Padding prevents the output from shrinking after each convolution, enabling the network to go deeper while preserving the original image size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How does padding help in edge detection tasks?

A

Padding ensures that pixels on the borders of the image are used as frequently as those in the center, allowing the network to accurately detect information near the edges.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How is padding typically applied to an image?

A

Padding adds extra rows and columns (usually filled with zeros) around the original image to maintain the desired output size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the difference in output size between convolution with and without padding?

A
  1. without padding: [n-f+1] x [n-f+1]
  2. with padding: [n+2p-f+1] x [n+2p-f+1]
  • The output size remains the same as the input if the padding is chosen appropriately.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the two main benefits of using padding in CNNs?

A
  1. It allows the filter to operate at the edges, ensuring that all pixels are considered equally.
  2. It maintains the image size after convolution, making it easier to design deeper networks.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How is the required padding size determined for padding?

A
  • p = (f-1)/2
  • f = filter size
  • this will give an nxn output
  • This formula works best when the filter size is an odd number, ensuring an integer padding value.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the primary output difference between valid convolution and same convolution?

A
  1. Valid convolution (no padding): The output size is smaller than the input.
  2. Same convolution (zero padding): The output size is the same as the input due to zero padding.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is stride in the context of convolutional neural networks?

A
  • Stride refers to how much the filter moves across the input image during convolution.
  • It determines the step size of the filter movement both horizontally and vertically.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How does stride affect computation and time in convolutional

A
  • Increasing the stride reduces the number of computations by skipping pixels, which saves time and computational power.
  • This is particularly effective for high-resolution images.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What happens when the stride is set to 1 in convolution?

A

When the stride is 1, the filter moves one pixel at a time in both horizontal and vertical directions, resulting in maximum overlap between adjacent filter positions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

How is the output size calculated when using stride in convolution?

A
  • The output size is determined by the formula:
  • ([n+2p-f]/s) + 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Why are 3×3 filters commonly used in convolutional neural networks?

A
  1. They are the smallest possible filters that can capture information in four directions (up, down, left, right).
  2. Stacking multiple smaller filters can achieve the same effect as using a larger filter (e.g., 5×5 or 7×7).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are the benefits of using multiple smaller filters instead of a single large filter?

A
  1. Smaller filters lead to deeper networks with more non-linearities, which improves the network’s ability to learn complex patterns.
  2. They result in fewer parameters, helping with regularization and reducing overfitting.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How are RGB images represented in convolutional neural networks?

A
  • RGB images are represented as three input channels: red, green, and blue.
  • Each pixel’s color is determined by its RGB values.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Why can’t RGB channels be treated separately in convolutional operations?

A

The RGB channels are highly correlated, so they need to be processed together to capture meaningful features across all color channels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What happens when a filter is applied to an RGB image?

A

A filter (e.g., 3×3×3) is applied across the spatial dimensions of the RGB image, performing a dot product between the filter weights and the overlapping region of the input to produce a single output value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is the result of applying a convolutional filter on an RGB image?

A

The resulting feature map is a 2D grid (e.g., 4×4 in size, depending on the input size and filter parameters) that summarizes the detected features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

How does the convolution operation proceed across an RGB image?

A

The filter slides across the image spatially, performing the dot product at each position to compute the corresponding output value in the feature map.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What happens when multiple filters are applied in a convolutional neural network?

A
  • Multiple filters produce a set of output feature maps.
  • These output feature maps collectively form a new 3D tensor, which is the input for the next layer.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What does a convolutional neural network learn during training?

A

The network learns the weights of the filters through backpropagation, enabling it to detect various features in the input data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

How does the size of the image and number of filters change in a CNN as the network gets deeper

A
  • The spatial dimensions of the image decrease over time
  • The number of filters (and thus the depth of the output tensor) increases.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is a 1×1 output in CNNs, and how is it used?

A

A 1×1 output is a feature vector representing high-level features of the input image. It can be fed into fully connected layers for tasks like classification or regression.

41
Q

Why are fully connected layers used after convolutional layers?

A

Fully connected layers perform the final classification or regression by mapping the high-level feature vector to the desired output (e.g., class probabilities or bounding boxes).

42
Q

How is the parameter count for filters in a CNN calculated?

A
  • The parameter count for a filter is determined by its size plus a bias parameter

e.g., for a 3x3x3 filter*

  • parameters per filter: 3x3x3+1 = 28
  • for 10 filters: 28 x 10 = 280
43
Q

What is size independence

A
  1. The number of parameters in a CNN independent of the input image dimensions
  2. Depends only on the filter size and the number of filters, not on the input image size.
44
Q

What are the three main types of layers in a convolutional neural network (CNN)?

A
  1. Convolutional layer (CONV): Applies filters to extract features from the input.
  2. Pooling layer (POOL): Reduces the spatial dimensions of the feature maps.
  3. Fully connected layer (FC): Connects all neurons to produce the final output for classification or regression.
45
Q

What is max pooling, and how does it work?

A
  • Max pooling is the most common form of pooling.
  • It summarizes data by taking the maximum value in each region of the feature map
46
Q

How does max pooling help in a CNN?

A
  • Reduces the size of feature maps
  • Helps control overfitting by reducing parameters
  • Preserves the most important features (max values) in each region.
47
Q

What is average pooling, and how is it different from max pooling?

A

Average pooling summarizes data by taking the average value in each region instead of the maximum value, resulting in a smoother representation of the feature map.

48
Q

Why is pooling used in CNNs?

A

Pooling reduces the spatial dimensions of feature maps and decreases computational complexity

49
Q

why do convolutions work

A
  1. parameter sharing
  2. sparsity of connections
50
Q

Why does parameter sharing make convolutional layers effective?

A

A feature detector (such as a vertical edge detector) that is useful in one part of the image is probably useful in another part of the image

51
Q

What is sparsity of connections in a convolutional layer, and why is it beneficial?

A
  • Sparsity of connections means each output value depends only on a small number of inputs (the receptive field).
  • This reduces the complexity and increases the computational efficiency of the network.
52
Q

How did CNNs prove their effectiveness in the ImageNet Challenge?

A

CNNs evolved from traditional computer vision architectures to deeper learning computer vision

53
Q

What is LeNet-5, and why is it significant?

A
  • Was the first deep CNN to achieve significant results in image recognition
  • processed grayscale images and had the following layers:
    CONV → POOL → CONV → POOL → FC → FC → FC → SOFTMAX
54
Q

What is the significance of fully connected layers in LeNet-5?

A

Most of the parameters in LeNet-5 are in the fully connected layers, which play a crucial role in classification by combining extracted features.

55
Q

What is AlexNet, and how did it improve upon LeNet-5?

A
  • AlexNet was a deeper network with multiple convolutional and max-pooling layers, followed by fully connected layers.
  • It used ReLU activation and dropout to improve performance and reduce overfitting.
56
Q

How does dropping fully connected layers affect AlexNet’s performance?

A
  • The bulk of the parameters is in the fully connected layers
  • Dropping fully connected layers reduces parameters significantly, with a small performance drop.
  • Dropping the other layers has a smaller performance drop, but also lower number of parameters dropped
57
Q

What is VGG, and what was its key architectural innovation?

A
  • used a simple and uniform design of 3×3 convolutional filters throughout the network.
  • Using small 3×3 filters allowed VGG to increase depth while maintaining computational efficiency, leading to better accuracy.
  • Deeper networks give vanishing gradient problem
58
Q

What challenge did ResNet address, and how?

A

ResNet addressed the vanishing gradient problem by introducing residual blocks with skip connections, allowing very deep networks (e.g., 152 layers) to be trained effectively.

59
Q

How do residual blocks in ResNet work?

A

Residual blocks feed gradient information to both the next layer and the one after, enabling gradients to flow directly through the network and preventing vanishing gradients.

60
Q

What was the key idea behind Inception (GoogLeNet) networks?

A
  • Inception networks used multiple filter sizes (1×1, 3×3, 5×5) in parallel within each layer to capture features at different scales.
  • They also used 1×1 convolutions to reduce dimensionality and computational cost.
61
Q

How did 1×1 convolutions improve computational efficiency in Inception networks?

A

1×1 convolutions acted as bottleneck layers, reducing the number of channels before applying larger filters, significantly decreasing the number of multiplications required.

62
Q

What are the three main components of object detection?

A
  1. Classification: Determining what the object is.
  2. Localization: Identifying where the object is in the image.
  3. Detecting multiple objects: Recognizing and localizing multiple objects within an image.
63
Q

What is object localization, and what does it predict?

A

Object localization predicts the location of an object in an image

  • Predicts the position and dimensions of the bounding box enclosing the object.
64
Q

How are bounding box coordinates represented in object localization?

A
  • (b_x, b_y, b_h, b_w)
  • denote the center coordinates, height, and width of the bounding box
65
Q

How is the output for object localization structured for a classification task?

A
  1. A binary value indicating whether there is an object (1) or not (0).
  2. Bounding box coordinates if an object is present.
  3. Class label of the object (e.g., 010 for car).
66
Q

How is the cost function adjusted in object localization?

A
  • Adjusted so that it combines the prediction error for class labels and the localization error for bounding box coordinates
  • i.e., depending on if there is an object or not (1/0 on first position), the cost function needs to be different
  • if an object is present, the error is the squared difference of all values summed
  • if no object is present, only the classification error is considered
67
Q

How is object detection achieved using a CNN trained for object localization?

A

A rectangle is slid over the image at different scales using a CNN trained for localization, and predictions are made for each region to detect objects.

68
Q

How can fully connected layers be replaced in object detection models?

A

Fully connected layers can be replaced by 1×1 convolutional layers, enabling operations to handle spatial information throughout the network using convolution.

69
Q

How does using fully convolutional networks improve object detection?

A

Fully convolutional networks allow the detection process to be done in a single pass by adapting convolutional operations to handle both localization and classification in one step.

70
Q

What is a common problem in object detection related to bounding boxes?

A

Several bounding boxes may detect the same object, leading to redundant detections.

71
Q

What is the Intersection over Union (IoU) algorithm used for in object detection?

A

evaluates how well the predicted bounding box matches the ground truth by comparing the ratio of the overlapping area (intersection) to the combined area of both boxes (union).

72
Q

How is IoU calculated?

A

[area of intersection] / [area of union]

73
Q

What happens if the predicted bounding box and the ground truth do not intersect?

A

The IoU value will be 0, indicating a completely incorrect prediction.

74
Q

How does IoU penalize large bounding boxes?

A

Large bounding boxes increase the union area without significantly increasing the intersection area, resulting in a lower IoU value, which correctly penalizes imprecise detections.

75
Q

When does IoU have a high value?

A
  1. when the intersection area is large
  2. when the intersection is not larger than the ground truth box
  • this indicates a good match between the predicted and ground truth bounding boxes.
76
Q

What is the purpose of the Non-Max Suppression (NMS) algorithm in object detection?

A

NMS is a post-processing step used to remove redundant or overlapping bounding boxes for the same object while retaining only the box with the highest confidence score.

77
Q

How does the Non-Max Suppression (NMS) algorithm work?

A
  1. Select the bounding box with the highest confidence score.
  2. Calculate the IoU with other boxes.
  3. Discard boxes with IoU greater than a threshold (e.g., 0.5).
  4. Repeat until only unique, high-confidence detections remain.
78
Q

What problem does the Anchor Box algorithm solve in object detection?

A

The Anchor Box algorithm helps in detecting overlapping or closely packed objects by pre-defining a set of bounding boxes at different scales and aspect ratios.

79
Q

How does the Anchor Box algorithm work?

A

The anchor box response holds copies of the original response so that it can detect a lot of different items.

  • It has therefore been trained with a limit on the number of objects it can detect at the same time.
80
Q

What is the difference between face verification and face recognition?

A
  • Face verification: One-to-one comparison to check if an input image matches a claimed identity (e.g., logging into a phone using face recognition).
  • Face recognition: One-to-many comparison to identify an individual from a database (e.g., identifying someone in a surveillance video).
81
Q

What is the input and output for face verification?

A
  • Input: An image and a claimed identity (name or ID).
  • Output: A binary result (yes/no) indicating whether the input image matches the claimed identity.
82
Q

What is the input and output for face recognition?

A

-I nput: An image.

  • Output: The ID of the matched individual from the database, or “not recognized” if no match is found.
83
Q

What is the main challenge in face recognition systems?

A

The main challenge is one-shot learning, where the system must generalize from minimal data and recognize faces with high accuracy.

84
Q

How is one-shot learning applied in face recognition?

A

One-shot learning trains a network to learn a similarity function, which compares two face images and outputs a similarity score indicating whether the faces belong to the same person.

85
Q

What type of network is commonly used to implement the similarity function in one-shot learning for face recognition?

A

A Siamese network is commonly used, which compares two input images and learns to group similar images together based on a distance function.

86
Q

What is a Siamese network, and how does it work?

A
  • A Siamese network consists of two identical sub-networks that process two input images and output feature vectors (encodings).
  • The similarity between images is determined by comparing these encodings using a distance metric (e.g., Euclidean distance or cosine similarity).
87
Q

What is the key idea behind using a Siamese network for face recognition?

A

Instead of training a separate classifier for each class, a Siamese network learns a similarity function that can compare any pair of images and determine whether they belong to the same class.

88
Q

What is the triplet loss function, and how is it used in training a Siamese network?

A

The triplet loss function is used to ensure that the distance between an anchor and a positive example is smaller than the distance between the anchor and a negative example by a margin α.

89
Q

What are the three components of a triplet used in the triplet loss function?

A
  1. Anchor (A): Reference image.
  2. Positive (P): Image of the same class as the anchor.
  3. Negative (N): Image of a different class from the anchor.
90
Q

Why is a margin (α) added in the triplet loss function?

A

The margin ensures that the network does not learn trivial solutions where the distances are zero by requiring a minimum difference between positive and negative distances.

91
Q

How is the triplet loss function formulated?

A

L(A,P,N)=max(∥f(A)−f(P)∥^2 −∥f(A)−f(N)∥^2
+α,0)

  • f(A), f(P), and f(N) are the encodings for the anchor, positive, and negative examples.
92
Q

What is Neural Style Transfer (NST)?

A

NST is a technique that uses neural networks to combine the content of one image with the artistic style of another image to create a new, stylized image.

93
Q

What are the key components of Neural Style Transfer?

A
  1. Content image (C): Provides the structure or layout (e.g., buildings, objects).
  2. Style image (S): Provides the artistic feel or texture (e.g., brush strokes, colors).
  3. Generated image (G): Combines the structure of the content image with the style of the style image.
94
Q

How does Neural Style Transfer visualize content and style features at different layers of a CNN?

A
  1. Early layers detect simple features like edges and colors.
  2. Later layers detect complex patterns, parts of objects, and full objects.
  • For content representation, features are sampled from later layers of a pretrained CNN.
95
Q

What is the goal of the cost function J(G) in Neural Style Transfer?

A

The goal is to minimize the cost function J(G), which combines two losses:

  1. Content loss: Ensures the generated image has a similar structure to the content image.
  2. Style loss: Ensures the generated image has a similar texture to the style image.
96
Q

How is the content loss J_content(C,G) defined?

A

similarity between the activation of the generated image G and the content image C at a chosen layer
𝑙 of a pretrained CNN.

97
Q

How is the content loss style(S,G) defined?

A

similarity between the activation of the generated image G and the style image S by comparing the Gram matrices of their feature maps.

98
Q

What is the role of the Gram matrix in Neural Style Transfer?

A
  • The Gram matrix captures correlations between different channels (feature maps) and helps represent the style of an image by measuring how different features co-occur.
  • i.e., we minimize the correlation structure over the different channels to get the style into the generated image.
99
Q

How do the weights α and β in the cost function J(G) affect the generated image?

A
  • They control the relative importance of content and style similarity.
  • Adjusting these weights changes the balance between preserving the structure of the content and applying the style.