Deep Learning 2 Flashcards

1
Q
  1. How can we represent an image such that a computer can process it in a neural network?
A

A computer sees an image as a 2D or 3D array of pixel values, e.g. a 1080×1080×3 matrix for an RGB image. Each pixel is a number from 0 to 255 (per color channel).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
  1. Give an example task in computer vision and describe how a classification output might look for an image of a face.
A

For a face classification task, the model might output probabilities over possible identities (e.g., Lincoln: 0.8, Washington: 0.1, Jefferson: 0.05, Obama: 0.05). The highest probability label is the predicted class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
  1. Before the deep learning era, how were computer vision features typically handled, and what was a drawback of that approach?
A

Engineers would manually define and extract features (like edges, corners, SIFT, HOG). This was time-consuming, required domain expertise, and wasn’t easily scalable, making it brittle for new tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  1. How does a convolutional filter help extract visual features from an image?
A

A filter (a small matrix of weights) is convolved with local patches of the image. At each spatial position, we do element-wise multiplication and sum, detecting specific patterns like edges, corners, or textures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  1. What does it mean to ‘share parameters’ across spatial locations when applying a convolutional filter?
A

Instead of learning different weights for every position, one small filter’s weights are used repeatedly across all positions in the image. This drastically reduces the number of parameters and ensures that the same feature can be detected anywhere in the image.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
  1. Why is a fully connected network not ideal for image tasks?
A

Fully connecting each neuron to all pixels ignores the 2D spatial structure and leads to a large number of parameters. It can’t exploit local patterns and is computationally more expensive, making it less efficient for images.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  1. In a convolutional neural network (CNN), describe the sequence of operations typically performed on the input image.
A

(1) Convolution: apply learned filters to patches of the image. (2) Non-linearity: pass the resulting feature maps through an activation (e.g., ReLU). (3) Pooling: downsample the feature maps for spatial invariance. (4) Repeat or proceed to fully connected layers for classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  1. What does a ‘patch’ or ‘receptive field’ refer to in the convolutional layer context?
A

It’s the local region of the input that a particular filter (or neuron) sees. For instance, a 3×3 filter only ‘looks’ at a 3×3 patch of the input at each position before sliding (moving by stride) to the next patch.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  1. How does applying multiple different filters benefit feature extraction in CNNs?
A

Each filter specializes in detecting specific patterns (e.g., vertical edges, color gradients, corners). Multiple filters allow the network to capture a variety of features at different orientations and scales in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  1. Show a small numerical example of convolving a 3×3 filter over a 5×5 image patch and getting a single output value.
A

If the 5×5 patch is:\n[1 0 1 2 3]\n[2 1 1 0 1]\n[3 2 2 2 2]\n[1 0 1 1 2]\n[2 1 1 1 0]\n\nand the 3×3 filter is:\n[0 1 0]\n[1 1 1]\n[0 1 0]\n\nThen convolving at the top-left corner means element-wise multiply the top-left 3×3 portion of the image with the filter and sum. Repeat for each position.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  1. What is meant by ‘feature map’ in a CNN after the convolution operation?
A

A feature map is the resulting 2D (or 3D including channel dimension) output after convolving a filter across the entire input. Each location in the feature map indicates how strongly the filter matched that patch of the input.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  1. Why do CNN architectures typically include a ReLU activation after each convolution?
A

ReLU (Rectified Linear Unit) introduces non-linearity by zeroing out negative values. This allows CNNs to learn non-linear mappings, which is essential for modeling complex visual patterns rather than just linear filters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  1. What is ‘pooling’ in a CNN, and why is it used?
A

Pooling (like max pooling or average pooling) downsamples the feature map by summarizing local regions (e.g., picking the max value in a 2×2 patch). This reduces spatial resolution, lowers parameter counts, and provides spatial invariance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
  1. Describe one advantage of using pooling layers in convolutional neural networks.
A

Pooling makes the network more robust to small translations or distortions in the image. Even if the object shifts slightly, pooling ensures the extracted features remain stable in location.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
  1. Summarize how a CNN learns to classify images, from input to output.
A

(1) The input image is convolved with learned filters in multiple layers. (2) Non-linearities and pooling refine and compress the extracted features. (3) A final fully connected layer (or layers) aggregates these features and outputs class scores or probabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
  1. Why do we often say CNNs learn a hierarchy of features (low-level to high-level)?
A

Early layers focus on simple edges and textures (low-level), deeper layers combine these into mid-level shapes, and the final layers recognize high-level concepts (e.g., faces, objects) as the network depth increases.

17
Q
  1. What is ‘ImageNet,’ and how did CNNs revolutionize its classification challenge?
A

ImageNet is a large-scale dataset with 14 million images in over 20,000 categories. CNNs (starting with AlexNet in 2012) drastically reduced error rates in the annual ImageNet competition, outperforming traditional methods and sparking the deep learning boom in computer vision.

18
Q
  1. Mention two state-of-the-art CNN-based architectures that have emerged from the ImageNet Challenge and a key difference between them.
A

Examples include: VGG (19 layers, uses stacked 3×3 convolutions) and ResNet (up to 152 layers, introduces residual connections). The main difference is ResNet’s skip/residual connections that help train extremely deep networks without vanishing gradients.

19
Q
  1. Beyond classification, name three different computer vision tasks that CNNs can perform and briefly describe each.
A

(1) Object Detection: locate objects in an image and classify them (e.g., bounding boxes). (2) Semantic Segmentation: assign a class label to every pixel. (3) Image Captioning: generate a natural language description for the entire image.

20
Q
  1. What is a Fully Convolutional Network (FCN), and for which task is it typically used?
A

An FCN replaces fully connected layers with convolutional layers (and often upsampling layers). It’s commonly used for semantic segmentation because it outputs a label map, assigning class labels at each pixel.

21
Q
  1. How does R-CNN approach object detection differently than just standard CNN classification?
A

R-CNN first proposes candidate regions likely to contain objects (region proposals). Then it applies a CNN to each region patch to classify what object might be there. This differs from classifying the whole image at once.

22
Q
  1. Give an example of how CNNs integrate with Recurrent Neural Networks (RNNs) to produce image captions.
A

The CNN extracts a high-level feature vector from the image (e.g., from the last convolutional layer). This vector is then fed into an RNN (like an LSTM) that generates a sentence token by token, using the CNN features as context.

23
Q
  1. Why is having large datasets (like ImageNet) so critical for training modern CNNs?
A

CNNs have millions of trainable parameters. Large, diverse datasets reduce the risk of overfitting and help the network learn generalizable feature representations. Without enough data, these deep models can’t converge to meaningful solutions.

24
Q
  1. Mention two real-world impact areas of CNNs in computer vision with a quick example of each.
A

(1) Face Detection and Recognition (e.g., phone face unlock or tagging in social media). (2) Self-driving cars (e.g., detecting pedestrians, lanes, and traffic signs in real-time for autonomous navigation).

25
Q
  1. Summarize how CNN-based techniques have drastically influenced the field of computer vision overall.
A

CNNs exploit local connectivity and shared filters to efficiently learn hierarchical features from image data. They’ve led to breakthroughs across classification, detection, segmentation, image captioning, and more, often exceeding human performance on large-scale tasks and unlocking new applications in healthcare, robotics, and beyond.