Deep Learning for Computer Vision Flashcards

1
Q

What should the number of bias’ be equal to?

A

The number of output neurons

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do we count the number of layers in a neural network?

A
  • typically only count the layers that have parameters, e.g. convolution layers and fully connected layers
  • pooling, ReLu and BatchNorm layers for example have no parameters so are not counted
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the formula for calculating the number of parameters for a convolutional layer:

For an input image of 32x32 with convolutional layer of 5x5x6 filters, what is the number of parameters used?

A

(filter_width * filter_height* input_depth + 1 for bias) * filter depth

(551 + 1) * 6

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the formula for calculating the number of parameters for a fully connected layer:

For 120 input nodes with 84 output nodes, what is the number of parameters used?

A

(Input nodes + 1 for bias) * output nodes

(120 + 1) * 84 = 10164

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How many layers does the leNet5 model have?

A

5:
3 convolution layers and 2 fully connected layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How many layers does the AlexNet model have?

A

8:
5 convolution layer and 3 fully connected layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How many layers does the VGG-16 model have?

A

16:
13 convolution layers and 3 fully connected layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why do we decrease the feature map size and increase the depth of the channels/filters for each stage?

A

This is done to maintain the content.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the advantage of using small filters/kernel size?

A

Stacking two 3x3 conv (stride 1) layers has the same receptive field as one 5x5 conv layer, however the stacked smaller filters use fewer parameters for convolution hence save memory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the difference in parameter size by using a 3x3 filter rather than a 5x5 filter?

A

2*(3^2) = 18 vs.
5^2 = 25 (supposing we don’t calculate bias). So we save memory by using the smaller filter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why use a deeper neural network over a shallow one?

A
  • With good regularlisation they generalise better on unseen data
  • They’re able to learn more complex features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does VGG-net different from AlexNet and leNet?

A

VGG-Net makes use of the ReLu layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How many layers does GoogLeNet have and what makes it different to other models?

A

It has 22 layers. It uses efficient modules such as the inception model
- 12 times less parameters than AlexNet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are auxiliary classifiers?

A

Additional classifiers placed at earlier layers of the network:
- solving vanishing gradient issue
- form of regularisation: earlier layers get evaluated sooner and thus their parameters can be updated to be more accurate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How many auxiliary classifiers does GoogLeNet use?

A
  • 3 auxiliary classifiers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do inception module work?

A
  • uses parallel convolutional and pooling layers
  • Extracts features across multiple scales and combines these into a single output
  • maintains a good computational cost
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the two types of inception module:

A

Naive and dimension reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Very deep CNNs are prone to degredation, what can be used to avoid/handle this?

A

Using Residual blocks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What does ResNet use that makes it different to other NNs

A

Residual blocks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are residual blocks?

A
  • connects the input feature map to the output using the element wise sum.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What do residual networks solve?

A

The vanishing gradient problem, as all information is propagated to the final layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is DenseNet’s key feature?

A
  • Dense connectivity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is dense connectivity?

A
  • each layer is connected to every other layer, using concatenation, to ensure maximum information reuse and gradient flow
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Why use dense connectivity over resnet blocks?

A
  • DenseNet uses far less parameters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the benefit of using dense connectivity?

A
  • Loss is propagated through all the layers in a dense block, which creates strong gradient
  • It ensures that gradients can flow directly to earlier layers during backpropagation, addressing vanishing gradient issues.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Describe SE blocks:

A
  • squeeze: performs pooling for each feature map then flattens these into 1D vector to get compact representation of image
  • excitation: two fully connected layers with a non-linearity layer (ReLu) inbetween to generate weights, that indicate the importance of each channel.
  • uses sigmoid to up-scale to original feature map size, amplifying important channels and suppressing irrelevant ones
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the motivation behind SENet/ SE-blocks?

A

They improve representational power of features and produces better model performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are the two types of separable convolution?

A

Depth-wise and point-wise convolution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is depthwise convolution?

A

given the input with 3 channels, and a filter of 3 channels, we do the convolution channel by channel. E.g. The first filter only convolves with the first channel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is pointwise convolution?

A

Can be treated as a one by one convolution, the filter size is 1x1, this mixes all the channel information.
If we want 256 output channels we need to use 256 filter channels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What does MobileNet use in it architecture? What benefit does this have?

A

Uses depthwise and pointwise convolution

  • this largely reduces the number of parameters needed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is Transfer Learning?

A

A ML technique where a model that’s been trained for a specific task is repurposed as a starting point for another similar task

32
Q

What is the difference between traditional ML and transfer learning?

A

Training using traditional ML uses isolated single task learning: learning is performed without considering past learnt information from other tasks.

Training using transfer learning, is where learning of a new task relies on a previous learned task.

33
Q

What is true about the layers in the model used for a new task when transfer learning is used?

A
  • The new model will typically borrow several layers from a past model which are fixed (“frozen”)
  • Only the last few layers of the new model will be trained
34
Q

What is true about the datasets used in the new and old models for transfer learning?

A

The old model typically uses a large dataset and the transfer learning model typically uses a small dataset

35
Q

What are the 4 steps of the transfer learning process?

A

1) Design a NN and train a large dataset on it
2) Borrow the first several layers of this trained NN, for the new transfer learning model, borrowing parameters, weights and bias.
3) Train only the last layers of the new NN on a small dataset
4) evaluate the model on a new small dataset, measure classification accuracy.

36
Q

For a small dataset if you don’t have enough access to GPUs, what can you use?

A

Transfer Learning

37
Q

Why does a transfer learning model use a small learning rate?

A
  • Ensures that pre-trained weights are not drastically altered, allowing the model to fine-tune its existing knowledge to the new task gradually.
  • Ensures gradient updates are smooth and that the pre-trained features adapt without disrupting the stability of the network.
38
Q

What are the advantages of transfer learning?

A
  • faster training time for model (as the new model borrows optimised weights and bias from trained model)
  • it performs very well on small datasets
39
Q

When might transfer learning not be appropriate to use?

A

If the dataset is very large

40
Q

What is metric learning?

A

A ML method that involves learning a distance function or metric over a dataset. Used for face verification

  • Uses loss to move similar objects as close as possible and dissimilar objects far away from each other, in clusters
  • for x categories, they’ll be x clusters
41
Q

How does metric loss with contrastive loss work?

A
  • take an input pair of samples that are either similar or dissimilar
  • use loss formula to bring similar samples closer and dissimilar samples far apart
  • if samples are similar, minimise distance
  • if samples are dissimilar, maximise distance
41
Q

Describe the pipeline for training a model to do face recognition:

A

1) design a NN to extract facial features
2) turn this into a 1D vector using pooling
3) Use this to predict the category
4) Compare similarities to decide if the feature is the same or not
5) Calculate overall similarity. If it meets a threshold we classify it as the same person.

42
Q

What is CLIP

A

Contrastive loss image pretraining: is a ML model that uses metric learning
- associates images with textual descriptions from a large dataset of text and images

43
Q

How does CLIP work for the training stage?

A
  • Uses two neural networks: a text encoder and an image encoder, these generate text and image embeddings and project them onto a joint space
  • During training, the model is given image-text pairs
  • use contrastive loss function to bring embeddings of corresponding pairs closer together and distance embeddings of mismatched pairs
43
Q

Describe how text for an image is predicted using CLIP for testing?

A

1) image is passed through image encoder to obtain its embedding in the shared space.
2) A set of text descriptions is encoded using the text encoder. Each candidate description results in a separate text embedding.
3) computes similarity between image embedding and each text embedding
4) text with highest similarity score to the image embedding is selected as predicted description.

44
Q

Why are R-CNN, Fast R-CNN and Faster R-CNN known as 2 stage object detection?

A

Because the first stage involves getting proposals for objects and the second stage involves using a neural network to refine the position and predict the category

45
Q

What is the difference between how R-CNN, Fast R-CNN and Faster R-CNN find proposals for object regions?

A

R-CNN and Fast R-CNN use selective search, Faster R-CNN uses a small neural network known as the region proposal network

46
Q

What are the summaries of R-CNN, Fast R-CNN and Faster R-CNN

A
  • R-CNN: Selective search + CNN + SVM
  • Fast R-CNN: Selective search + CNN + ROI
  • Faster R-CNN: Region proposal NN + CNN + ROI`
47
Q

What is the difference between Nearest neighbour interpolation and max unpooling for upsampling?

A

Nearest neighbour only needs to store the pixel value whereas max unpooling requires the pixel value and the pixel position, so you need to also store the position

48
Q

What is the motivation of generative models?

A

To learn complex distributions in order to generate new realistic images.

49
Q

What are the two separate tasks of object detection?

A
  • Classification
  • Localisation

First perform feature extraction, then perform classification to predict labels and bounding boxes

50
Q

Name a one stage method for deep learning object detection:

51
Q

Name some region proposal two stage methods for deep learning object detection:

A

R-CNN, Fast R-CNN, Faster R-CNN

52
Q

What is the sliding window method for multiple object detection using deep learning:

A

Given an image, crop a region of the image to a predefined window size, then forward it through the classifier/CNN to predict the category and position. Slide the window along the image, forwarding every subregion through the CNN.

53
Q

What is the disadvantage of using the sliding window as a deep learning object detection method?

A
  • repeatedly applying the CNN to many cropped regions is very computationally expense/time consuming
54
Q

Describe YOLO as a deep learning object detection method:

A
  • resize image to a predefined size
  • forward it through a CNN to extract features
  • divide the feature map into an SxS grid
  • for each grid cell, predict two branches, one for bounding box position (and if foreground/background) and predict scores for each category
  • combine position and category to get final detections
  • perform non-maximum suppression to remove duplicate detections
55
Q

Describe how non-maximum suppression is performed for YOLO:

A

1) sort bounding boxes based on highest confidence score
2) save bounding box with highest confidence score as detection
3) remove all bounding boxes that meet a threshold of IOU with the selected bounding box
- Repeat steps 2 and 3 until one bounding box remains

56
Q

Describe how R-CNN works:

A

1) selective search produces an amount of proposal regions in an image
2) for each proposed region, crop and resize this to a predefined size
3) Forward region through the CNN to extract features
4) use SVMs to predict category scores and bounding box position

57
Q

How does selective search used in R-CNN and Fast R-CNN work?

A
  • generate many candidate regions
  • use region growing to combine similar regions into larger regions used as final region proposals
58
Q

What are the issues with R-CNN and Fast R-CNN

A

1) forwarding every proposal region through the CNN separately is very time consuming
2) selective search is a fixed algorithm, no learning takes place, could make bad region proposals and is very time consuming

59
Q

Describe how fast R-CNN works and improves R-CNN:

A

1) selective search produces an amount of proposal regions for the image
2) Fast R-CNN is much faster than R-CNN, as it forwards the whole image through the CNN to get the feature map
3) the proposal regions are marked as regions of interest on the CNN output and forwarded through a ROI pooling layer
4) A fully connected layer is used to predict the class scores and bounding box position

60
Q

How is Faster R-CNN performed?

A

1) the image is fed through a region proposal network to produce the region proposals
2) the whole image is forwarded through a CNN network to get the feature map
3) the proposal regions are marked as regions of interest on the CNN output and forwarded through ROI pooling layer
4) a fully connected layer is used to predict the class scores and bounding box position

61
Q

What does the region proposal network used in Faster R-CNN consist of?

A
  • candidate bounding boxes are generated
  • uses sliding windows on the feature map to refine candidate bounding boxes
  • uses anchors: predefined bounding boxes of different sizes placed at each position on the feature map
62
Q

How do R-CNN, Fast R-CNN and Faster R-CNN compare to each other?

A
  • Fast R-CNN is faster and more accurate than R-CNN
  • Faster R-CNN is faster and more accurate than Fast R-CNN and R-CNN
63
Q

Name 3 traditional object segmentation methods and then name some deep learning methods:

A
  • thresholding
  • region based
  • K-means clustering
  • sliding window
  • fully convolutional networks (U-Net, SegNet, PSPNet, Mask R-CNN)
64
Q

Describe the sliding window method for DL object segmentation:

A

1) given an image, crop it to a predefined window size.
2) pass the cropped region through a CNN to get class scores
3) slide the window across the whole image, passing every region through the CNN

65
Q

What is the downfall of the sliding window method for DL object segementation?

A

It’s inefficient and very time consuming as you have to forward every subregion separately through the CNN

66
Q

Describe Fully convolutional networks as a method for DL object segmentation:

A

1) Given an image and it’s label, pass it through a classifier to get feature maps
2) replace FC layer with 1x1 convolution to get feature map
3) use upsampling to resize feature map to original size
4) use ground truth to calculate loss

67
Q

Describe the 3 in network upsampling methods used by DL object segmentation:

A
  • Nearest neighbour interpolation: copy the nearest neighbour to surrounding region
  • Max unpooling: requires storing values and locations from original image, values are 0 apart from pixel with max value
  • deconvolution: learnable upsampling: inverse of convolution
68
Q

Describe U-Net architecture and motivation:

A
  • Symmetrical architecture, requires fewer training samples
  • downsamples to learn context and upsamples to restore spatial details with skip connections
  • used for medical image segmentation
69
Q

Describe SegNet architecture and motivation:

A
  • Has no fully connected layers
  • Uses max pooling in encoder and max unpooling in decoder
  • uses softmax in final layer
  • gives pixel wise predictions
  • uses simpler more efficient upsampling
69
Q

Describe PSPNet architecture and motivation:

A
  • uses pyramid pooling module which captures global context from different spatial scales
  • features captured are upsampled and concatenated to learn segmentations
  • doesn’t use fully connected layers, which is more efficient
  • high accuracy for segregating similar regions
70
Q

Describe Mask R-CNN architecture and motivation:

A
  • Same architecture as Faster R-CNN but adds a third branch
  • branches for Faster R-CNN are class scores and bounding box position.
  • third branch generates pixel-level masks for each region of interest
  • uses ROI align
  • very good at instance segmentation
71
Q

What are generative models:
Give an example:

A

Given a dataset they generate images
- they learn complex distributions in attempt to generate realistic looking images.
- example: GANs

72
Q

Describe how GANs works:

A

Generative Adversarial networks:
- A two player game, consisting of two neural networks:
- a generator uses a dataset to generate realistic looking images
- these are passed to the discriminator who classifies if an image is real or fake.

73
Q

Why is GANs considered a two player game?

A
  • because they involve two competing neural networks:
  • the generator is seen as player one, trying to minimise the objective by generating realistic images to fool the discriminator
  • the discriminator is seen as player two, and tries to catch out the generator by telling which images are real and fake, maximising the objective
74
Q

Describe training a GANs:

A
  • They’re trained in iteratively in a minimax game.
  • They’re trained in turns, so when the discriminator is training, the generator is fixed and vice versa.
  • the generator tries to minimise the objective
  • the discriminator tries to maximise the objective