Deep Learning for Computer Vision Flashcards

Question 1

Q

What should the number of bias’ be equal to?

Answer

A

The number of output neurons

Question 2

Q

How do we count the number of layers in a neural network?

Answer

A

typically only count the layers that have parameters, e.g. convolution layers and fully connected layers
pooling, ReLu and BatchNorm layers for example have no parameters so are not counted

Question 3

Q

What is the formula for calculating the number of parameters for a convolutional layer:

For an input image of 32x32 with convolutional layer of 5x5x6 filters, what is the number of parameters used?

Answer

A

(filter_width * filter_height* input_depth + 1 for bias) * filter depth

(551 + 1) * 6

Question 4

Q

What is the formula for calculating the number of parameters for a fully connected layer:

For 120 input nodes with 84 output nodes, what is the number of parameters used?

Answer

A

(Input nodes + 1 for bias) * output nodes

(120 + 1) * 84 = 10164

Question 5

Q

How many layers does the leNet5 model have?

Answer

A

5:
3 convolution layers and 2 fully connected layers

Question 6

Q

How many layers does the AlexNet model have?

Answer

A

8:
5 convolution layer and 3 fully connected layers

Question 7

Q

How many layers does the VGG-16 model have?

Answer

A

16:
13 convolution layers and 3 fully connected layers

Question 8

Q

Why do we decrease the feature map size and increase the depth of the channels/filters for each stage?

Answer

A

This is done to maintain the content.

Question 9

Q

What is the advantage of using small filters/kernel size?

Answer

A

Stacking two 3x3 conv (stride 1) layers has the same receptive field as one 5x5 conv layer, however the stacked smaller filters use fewer parameters for convolution hence save memory.

Question 10

Q

What is the difference in parameter size by using a 3x3 filter rather than a 5x5 filter?

Answer

A

2*(3^2) = 18 vs.
5^2 = 25 (supposing we don’t calculate bias). So we save memory by using the smaller filter

Question 11

Q

Why use a deeper neural network over a shallow one?

Answer

A

With good regularlisation they generalise better on unseen data
They’re able to learn more complex features

Question 12

Q

How does VGG-net different from AlexNet and leNet?

Answer

A

VGG-Net makes use of the ReLu layer

Question 13

Q

How many layers does GoogLeNet have and what makes it different to other models?

Answer

A

It has 22 layers. It uses efficient modules such as the inception model
- 12 times less parameters than AlexNet

Question 14

Q

What are auxiliary classifiers?

Answer

A

Additional classifiers placed at earlier layers of the network:
- solving vanishing gradient issue
- form of regularisation: earlier layers get evaluated sooner and thus their parameters can be updated to be more accurate

Question 15

Q

How many auxiliary classifiers does GoogLeNet use?

Answer

A

3 auxiliary classifiers

Question 16

Q

How do inception module work?

Answer

A

uses parallel convolutional and pooling layers
Extracts features across multiple scales and combines these into a single output
maintains a good computational cost

Question 17

Q

What are the two types of inception module:

Answer

A

Naive and dimension reduction

Question 18

Q

Very deep CNNs are prone to degredation, what can be used to avoid/handle this?

Answer

A

Using Residual blocks

Question 19

Q

What does ResNet use that makes it different to other NNs

Answer

A

Residual blocks

Question 20

Q

What are residual blocks?

Answer

A

connects the input feature map to the output using the element wise sum.

Question 21

Q

What do residual networks solve?

Answer

A

The vanishing gradient problem, as all information is propagated to the final layer

Question 22

Q

What is DenseNet’s key feature?

Answer

A

Dense connectivity

Question 23

Q

What is dense connectivity?

Answer

A

each layer is connected to every other layer, using concatenation, to ensure maximum information reuse and gradient flow

Question 24

Q

Why use dense connectivity over resnet blocks?

Answer

A

DenseNet uses far less parameters

Question 25

Q

What is the benefit of using dense connectivity?

Answer

A

Loss is propagated through all the layers in a dense block, which creates strong gradient
It ensures that gradients can flow directly to earlier layers during backpropagation, addressing vanishing gradient issues.

Question 26

Q

Describe SE blocks:

Answer

A

squeeze: performs pooling for each feature map then flattens these into 1D vector to get compact representation of image
excitation: two fully connected layers with a non-linearity layer (ReLu) inbetween to generate weights, that indicate the importance of each channel.
uses sigmoid to up-scale to original feature map size, amplifying important channels and suppressing irrelevant ones

Question 27

Q

What is the motivation behind SENet/ SE-blocks?

Answer

A

They improve representational power of features and produces better model performance

Question 28

Q

What are the two types of separable convolution?

Answer

A

Depth-wise and point-wise convolution

Question 29

Q

What is depthwise convolution?

Answer

A

given the input with 3 channels, and a filter of 3 channels, we do the convolution channel by channel. E.g. The first filter only convolves with the first channel.

Question 30

Q

What is pointwise convolution?

Answer

A

Can be treated as a one by one convolution, the filter size is 1x1, this mixes all the channel information.
If we want 256 output channels we need to use 256 filter channels.

Question 31

Q

What does MobileNet use in it architecture? What benefit does this have?

Answer

A

Uses depthwise and pointwise convolution

this largely reduces the number of parameters needed

Question 32

Q

What is Transfer Learning?

Answer

A

A ML technique where a model that’s been trained for a specific task is repurposed as a starting point for another similar task

Question 33

Q

What is the difference between traditional ML and transfer learning?

Answer

A

Training using traditional ML uses isolated single task learning: learning is performed without considering past learnt information from other tasks.

Training using transfer learning, is where learning of a new task relies on a previous learned task.

Question 34

Q

What is true about the layers in the model used for a new task when transfer learning is used?

Answer

A

The new model will typically borrow several layers from a past model which are fixed (“frozen”)
Only the last few layers of the new model will be trained

Question 35

Q

What is true about the datasets used in the new and old models for transfer learning?

Answer

A

The old model typically uses a large dataset and the transfer learning model typically uses a small dataset

Question 36

Q

What are the 4 steps of the transfer learning process?

Answer

A

1) Design a NN and train a large dataset on it
2) Borrow the first several layers of this trained NN, for the new transfer learning model, borrowing parameters, weights and bias.
3) Train only the last layers of the new NN on a small dataset
4) evaluate the model on a new small dataset, measure classification accuracy.

Question 37

Q

For a small dataset if you don’t have enough access to GPUs, what can you use?

Answer

A

Transfer Learning

Question 38

Q

Why does a transfer learning model use a small learning rate?

Answer

A

Ensures that pre-trained weights are not drastically altered, allowing the model to fine-tune its existing knowledge to the new task gradually.
Ensures gradient updates are smooth and that the pre-trained features adapt without disrupting the stability of the network.

Question 39

Q

What are the advantages of transfer learning?

Answer

A

faster training time for model (as the new model borrows optimised weights and bias from trained model)
it performs very well on small datasets

Question 40

Q

When might transfer learning not be appropriate to use?

Answer

A

If the dataset is very large

Question 41

Q

What is metric learning?

Answer

A

A ML method that involves learning a distance function or metric over a dataset. Used for face verification

Uses loss to move similar objects as close as possible and dissimilar objects far away from each other, in clusters
for x categories, they’ll be x clusters

Question 42

Q

How does metric loss with contrastive loss work?

Answer

A

take an input pair of samples that are either similar or dissimilar
use loss formula to bring similar samples closer and dissimilar samples far apart
if samples are similar, minimise distance
if samples are dissimilar, maximise distance

Question 43

Q

Describe the pipeline for training a model to do face recognition:

Answer

A

1) design a NN to extract facial features
2) turn this into a 1D vector using pooling
3) Use this to predict the category
4) Compare similarities to decide if the feature is the same or not
5) Calculate overall similarity. If it meets a threshold we classify it as the same person.

Question 44

Q

What is CLIP

Answer

A

Contrastive loss image pretraining: is a ML model that uses metric learning
- associates images with textual descriptions from a large dataset of text and images

Question 45

Q

How does CLIP work for the training stage?

Answer

A

Uses two neural networks: a text encoder and an image encoder, these generate text and image embeddings and project them onto a joint space
During training, the model is given image-text pairs
use contrastive loss function to bring embeddings of corresponding pairs closer together and distance embeddings of mismatched pairs

Question 46

Q

Describe how text for an image is predicted using CLIP for testing?

Answer

A

1) image is passed through image encoder to obtain its embedding in the shared space.
2) A set of text descriptions is encoded using the text encoder. Each candidate description results in a separate text embedding.
3) computes similarity between image embedding and each text embedding
4) text with highest similarity score to the image embedding is selected as predicted description.

Question 47

Q

Why are R-CNN, Fast R-CNN and Faster R-CNN known as 2 stage object detection?

Answer

A

Because the first stage involves getting proposals for objects and the second stage involves using a neural network to refine the position and predict the category

Question 48

Q

What is the difference between how R-CNN, Fast R-CNN and Faster R-CNN find proposals for object regions?

Answer

A

R-CNN and Fast R-CNN use selective search, Faster R-CNN uses a small neural network known as the region proposal network

Question 49

Q

What are the summaries of R-CNN, Fast R-CNN and Faster R-CNN

Answer

A

R-CNN: Selective search + CNN + SVM
Fast R-CNN: Selective search + CNN + ROI
Faster R-CNN: Region proposal NN + CNN + ROI`

Question 50

Q

What is the difference between Nearest neighbour interpolation and max unpooling for upsampling?

Answer

A

Nearest neighbour only needs to store the pixel value whereas max unpooling requires the pixel value and the pixel position, so you need to also store the position

Question 51

Q

What is the motivation of generative models?

Answer

A

To learn complex distributions in order to generate new realistic images.

Question 52

Q

What are the two separate tasks of object detection?

Answer

A

Classification
Localisation

First perform feature extraction, then perform classification to predict labels and bounding boxes

Question 53

Q

Name a one stage method for deep learning object detection:

Question 54

Q

Name some region proposal two stage methods for deep learning object detection:

Answer

A

R-CNN, Fast R-CNN, Faster R-CNN

Question 55

Q

What is the sliding window method for multiple object detection using deep learning:

Answer

A

Given an image, crop a region of the image to a predefined window size, then forward it through the classifier/CNN to predict the category and position. Slide the window along the image, forwarding every subregion through the CNN.

Question 56

Q

What is the disadvantage of using the sliding window as a deep learning object detection method?

Answer

A

repeatedly applying the CNN to many cropped regions is very computationally expense/time consuming

Question 57

Q

Describe YOLO as a deep learning object detection method:

Answer

A

resize image to a predefined size
forward it through a CNN to extract features
divide the feature map into an SxS grid
for each grid cell, predict two branches, one for bounding box position (and if foreground/background) and predict scores for each category
combine position and category to get final detections
perform non-maximum suppression to remove duplicate detections

Question 58

Q

Describe how non-maximum suppression is performed for YOLO:

Answer

A

1) sort bounding boxes based on highest confidence score
2) save bounding box with highest confidence score as detection
3) remove all bounding boxes that meet a threshold of IOU with the selected bounding box
- Repeat steps 2 and 3 until one bounding box remains

Question 59

Q

Describe how R-CNN works:

Answer

A

1) selective search produces an amount of proposal regions in an image
2) for each proposed region, crop and resize this to a predefined size
3) Forward region through the CNN to extract features
4) use SVMs to predict category scores and bounding box position

Question 60

Q

How does selective search used in R-CNN and Fast R-CNN work?

Answer

A

generate many candidate regions
use region growing to combine similar regions into larger regions used as final region proposals

Question 61

Q

What are the issues with R-CNN and Fast R-CNN

Answer

A

1) forwarding every proposal region through the CNN separately is very time consuming
2) selective search is a fixed algorithm, no learning takes place, could make bad region proposals and is very time consuming

Question 62

Q

Describe how fast R-CNN works and improves R-CNN:

Answer

A

1) selective search produces an amount of proposal regions for the image
2) Fast R-CNN is much faster than R-CNN, as it forwards the whole image through the CNN to get the feature map
3) the proposal regions are marked as regions of interest on the CNN output and forwarded through a ROI pooling layer
4) A fully connected layer is used to predict the class scores and bounding box position

Question 63

Q

How is Faster R-CNN performed?

Answer

A

1) the image is fed through a region proposal network to produce the region proposals
2) the whole image is forwarded through a CNN network to get the feature map
3) the proposal regions are marked as regions of interest on the CNN output and forwarded through ROI pooling layer
4) a fully connected layer is used to predict the class scores and bounding box position

Question 64

Q

What does the region proposal network used in Faster R-CNN consist of?

Answer

A

candidate bounding boxes are generated
uses sliding windows on the feature map to refine candidate bounding boxes
uses anchors: predefined bounding boxes of different sizes placed at each position on the feature map

Answer 64

A

Fast R-CNN is faster and more accurate than R-CNN
Faster R-CNN is faster and more accurate than Fast R-CNN and R-CNN

Answer 65

A

thresholding
region based
K-means clustering
sliding window
fully convolutional networks (U-Net, SegNet, PSPNet, Mask R-CNN)

Answer 66

A

1) given an image, crop it to a predefined window size.
2) pass the cropped region through a CNN to get class scores
3) slide the window across the whole image, passing every region through the CNN

Answer 67

A

It’s inefficient and very time consuming as you have to forward every subregion separately through the CNN

Answer 68

A

1) Given an image and it’s label, pass it through a classifier to get feature maps
2) replace FC layer with 1x1 convolution to get feature map
3) use upsampling to resize feature map to original size
4) use ground truth to calculate loss

Answer 69

A

Nearest neighbour interpolation: copy the nearest neighbour to surrounding region
Max unpooling: requires storing values and locations from original image, values are 0 apart from pixel with max value
deconvolution: learnable upsampling: inverse of convolution

Answer 70

A

Symmetrical architecture, requires fewer training samples
downsamples to learn context and upsamples to restore spatial details with skip connections
used for medical image segmentation

Answer 71

A

Has no fully connected layers
Uses max pooling in encoder and max unpooling in decoder
uses softmax in final layer
gives pixel wise predictions
uses simpler more efficient upsampling

Answer 72

A

uses pyramid pooling module which captures global context from different spatial scales
features captured are upsampled and concatenated to learn segmentations
doesn’t use fully connected layers, which is more efficient
high accuracy for segregating similar regions

Answer 73

A

Same architecture as Faster R-CNN but adds a third branch
branches for Faster R-CNN are class scores and bounding box position.
third branch generates pixel-level masks for each region of interest
uses ROI align
very good at instance segmentation

Answer 74

A

Given a dataset they generate images
- they learn complex distributions in attempt to generate realistic looking images.
- example: GANs

Answer 75

A

Generative Adversarial networks:
- A two player game, consisting of two neural networks:
- a generator uses a dataset to generate realistic looking images
- these are passed to the discriminator who classifies if an image is real or fake.

Answer 76

A

because they involve two competing neural networks:
the generator is seen as player one, trying to minimise the objective by generating realistic images to fool the discriminator
the discriminator is seen as player two, and tries to catch out the generator by telling which images are real and fake, maximising the objective

Answer 77

A

They’re trained in iteratively in a minimax game.
They’re trained in turns, so when the discriminator is training, the generator is fixed and vice versa.
the generator tries to minimise the objective
the discriminator tries to maximise the objective