Deep Learning Basics Flashcards

1
Q
  1. What happens if you apply a 5x5 filter to a 7x7 image with no padding, stride = 1.
  2. What happens if you apply a 5x5 filter to a 7x7 image with no padding, stride = 2.
  3. What happens if you apply a 5x5 filter to a 7x7 image with no padding, stride = 3.
A
  1. Get a 3x3
  2. Get a 2x2
  3. Not possible as the filter would go outside the image
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is stride?

A

Stride is how many pixels along we move the filter each time.
Stride = 1, means we move 1 pixel in any direction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the formula for the output size of an image?

What is the size of the following:
1. 7x7 image, filter = 3x3, stride = 1
2. 7x7 image, filter = 3x3, stride = 2
3. 7x7 image, filter = 3x3, stride = 3

A

((N-F)/Stride) + 1

stride 1 => ((7 - 3) / 1) + 1 = 5
stride 2 => ((7 - 3) / 2) + 1 = 3
stride 3 => ((7 - 3) / 3) + 1 = 2.33 - not recommended because it’s not an integer value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the motivation of padding?

A

To obtain an output size that is the same as the input image size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What formula do we use to calculate how the size we would pad with?

How much should we pad for filter size:
- 3
- 5
- 7

A

(F-1)/2

(3-1)/2 = 1 (padding)
(5-1)/2 = 2 (padding)
(7-1)/2 = 3 (padding)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do we deal with images that have a depth of 3 - RGB images (larger than 1)

A

We must use a filter with a depth that matches in input image depth. E.g. for an RGB image of depth 3, the filter must have depth 3.
- We calculate the dot product to merge the 3 output values into 1. Perform the convolution then add the 3 values to get 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What happens if the image depth and filter depth aren’t the same value?

A

We can’t calculate their dot product.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

We may want to use multiple filters, how does this effect the output size?

A

The filter size decides the depth of the output, the output is known as the number of activation maps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Convolutional Neural Networks (CNN) size
In one convolution layer, we have 128 filters of 3x3x3 applied to input volume 128x128x3 with stride 2 and pad 1. What is the size of the output volume? Give details of how you calculate the size of the output volume.

A

(((N - F + (2*P))/stride) + 1

(((128-3+(2*1))/2)+1 = 64.5

round down/floor to 64

so output is 64x64x128
- number of filters is always the same as the number of output activation maps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the formulas for calculating the output images height, width and depth:

A

a volume of size W1 x H1 x D1
Four hyperparameters are required: Number of filters K, Filter size F, stride S, amount of zero padding P

When W2 and H2 are integers:
* Next layer: a volume of size W2 x H2 x D2
W2 = (W1 - F +2P) / S +1
H2 = (H1 - F +2P) / S +1
D2 = K

When and are not integers:
Next layer: a volume of size W2 x H2 x D2
▪ W2 = floor((W1- F +2P) / S) +1
▪ H2 = floor((H1 - F +2P) / S) +1
▪ D2 = K

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain the role of the pooling layer in CNN:

A
  • performs down-sampling making image representations smaller and more manageable, by aggregating information, reducing computation costs and memory usage
  • e.g. given an input image of 200x200, it can make it 100x100.
  • it operates over each activation map independently. The width and height get smaller but the depth remains the same
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

List two methods of down-sampling/pooling layers:

A

Max pooling: given a subregion, take the max value. The next region is where the next stride is. max(2, 9, 4, 5) = 9

Average pooling: given a subregion, calculate the average of the pixels. Next region is where the next stride is. avg(2, 9, 4, 5) = 20/4 = 5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a drawback of max and average pooling:

A
  • They may remove important information or whole features from an image when down-sampling.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Calculate the output size of the pooling layer:

Give the output size for this question:
image = 128x128x3, Filter/pooling size = 2x2, stride = 2, no padding

A

image: W x H x D, filter F, stride S
- depth is the same
- W = floor((W - F)/S) + 1
- H = floor((H - F)/S) + 1

D = 3
W = ((128-2)/2)+1 = 64
H = ((128-2)/2)+1 = 64
output = 64x64x3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain the role of the fully connected layer in CNN:

A
  • connects all neurons, so every input of the input vector influences every output of the output vector.
  • maps learned high level features to the final output
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain the process of the fully connected layer:

What is the output size of a fully connected layer with input image 32x32x3, 10 categories.

How many parameters would there be:

A

step 1) stretch channels into a 1D vector
step 2) connect each input node to each output node
step 3) y = Wx (ignoring bias)

  • the 32x32x3 image becomes a 3072x1 vector
  • the weights matrix would consist of 10x3072, perform matrix multiplication to get a 10x1 output

there would be 30720 parameters without bias. (30730 with bias)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the role of the activation function in CNN:

A
  • it introduces non-linearity to the network, enabling it to learn more complex relationships/patterns in the data.
  • They can mitigate vanishing gradients
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the role of the BatchNorm layer in CNN?

A
  • It normalises the output of each layer in the network
  • Ensures outputs of a layer have mean = 0 and standard deviation = 1 which improves training stability
  • Improves convergence speed. Can use random initialisation and a big learning rate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

When is the BatchNorm layer applied:

A

After the convolution layer but before the activation function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What happens if we don’t perform batchNorm:

A

The state distributions are very strange for each layer, they are unknown and vary for each iteration.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why do we forward batches through model instead of single images:

A

This creates a more stable network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Why does batchNorm have two learnable parameters?

A
  • γ (scale) and β (shift) for each channel
    -The parameters γ and β allow the network to learn the optimal scale and shift for each feature map.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Why is the training process more stable using batchNorm?

A

Because the moving average of the mean and standard deviation is used to update them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are the differences between a standard network and a network that uses batchNorm?

A
  • standard has to initalise parameters beforehand, have to design careful initialisation strategy
  • BatchNorm can use random initialisation
  • standard sometimes can’t use a big learning rate, converges slower
  • batchNorm uses a big learning rate so converges quicker
  • batchNorm uses many more layers than standard
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What affect does batchNorm have on the number of epochs required to train a deep network?

A
  • it dramatically reduces the number of epochs, as it can use a bigger learning rate and therefore converge faster
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the role of the dropout layer in CNN:

A
  • Dropout is used to overcome overfitting to the training data and improve generalisation of the model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

When might overfitting take place?

A

If there’s a relatively large number of parameters with an insufficient number of training samples, overfitting often occurs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How is the dropout operation/layer performed?

A
  • For each node we calculate a probability of dropout.
  • if we set a dropout rate at 30%, then 30% of the nodes in that layer will be dropped
  • if a node is dropped all the connections in the higher layers will be disconnected.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Why may we only perform dropout in the later layers of our neural network:

A
  • early dropout may disrupt the gradient
  • perform dropout only on deeper layers to preserve feature extraction in the early layers
  • deeper layers are more prone to overfitting so more effective to perform dropout then
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is the role of the ReLu layer?

A
  • an activation function, that improves the non-linear representation ability of the network.
31
Q

What is the role of the deconvolutional layer?

A
  • aka transposed convolution
  • used to upsample feature maps to a bigger size
  • useful when you need the output to match the size of the input dimensions
  • used for semantic segmentation and used by GANs for image generation,
32
Q

What does regularlisation achieve?

A
  • prevent overfitting
  • improves model generalisation
  • stabilise the training process
  • reduce model complexity
33
Q

What can happen if your dataset is too small?

A
  • underfitting, due to bias
34
Q

What technique can be used if the dataset is too small?

A

N-Fold Cross Validation

35
Q

What is N-Fold Cross Validation?

A
  • split the data into n folds
  • train model on n-1 folds and 1 fold for testing.
  • each round a different fold is used as the test set, while the rest are used for training
  • after each round the model’s performance is recorded
  • for n folds you get n performance scores
  • these are then averaged to get the final model evaluation score
36
Q

What is the long tail problem?

and what is the solution to the tail problem?

A
  • Many categories but the dataset has few examples of each category, so it’s hard to train the model accurately.
  • randomly splitting the dataset could mean the training dataset doesn’t even include one instance of a category. All the instances will be in the validation set, so the model will have no ability to predict this category.
  • solution is n-fold cross validation, as this makes sure every image is in the training and test datasets at least once.
37
Q

What evaluation metrics can we use for a model performing binary classification?

A

Two categories, positive (1) and negative (0)

Classifier predicts:
TP - true positives: number of positive samples that the classifier has correctly identified
TN - true negatives: number of negative samples that the classifier has correctly identified ‘not lion’
FP - false positives: number of positive samples that the classifier has incorrectly identified
FN - false negatives: number of negative samples that the classifier has incorrectly identified

38
Q

Given the metrics used to evaluate a classifiers performance on binary classification, what are the formulas we can calculate with them?

A

Precision: TP/TP+FP

Recall: TP/TP+FN

F1: 2* (precision*recall)/precision+recall
-combines precision and recall to get the final performance of the model

Accuracy: TP+TN/TP+TN+FP+FN
- correct predictions over total dataset

39
Q

What qualities lead to underfitting, fitting well and overfitting:

A

underfitting: high bias
fit well: low bias, low variance
overfitting: high variance

40
Q

What can we do to fix underfitting:

A
  • train the network with more epochs
41
Q

What can we do to fix overfitting:

A
  • data augmentation
  • add regularlisation layers
42
Q

What are L1 and L2 regularlisation methods?

A
  • used to prevent overfitting
  • by adding a penalty term to the loss function to encourage smaller weights
43
Q

What is the role of the convolution layer in CNN?

A
  • It extracts local features by applying convolutional filters/kernels to output feature maps.
44
Q

What are the differences between the training, validation and test datasets?

How is a dataset split into training, validation and testing?

A
  • Training set: Training models
  • Validation set: Test the model trained on training set and select the best performed model
  • Testing set: Test the model
  • Training set vs validation set: 80% vs 20% or 90% vs 10%
  • Training vs validation vs testing set: 60%-20%-20%
  • No standard rule, if a large amount of data, e.g., 1,000,000. It is possible to have 98%-1%-1%
  • Sometimes we only get the training & validation set (in some competitions the test set isn’t available to us)
45
Q

How is L1 regularlisation calculated?

A

loss = squared error loss + penalty term
loss = sum of (actual - input*weight)^2 + sum of absolute value of the weights

46
Q

How is L2 regularlisation calculated?

A

loss = squared error loss + penalty term
loss = sum of (actual - input*weight)^2 + sum of (weights^2)

47
Q

What is the difference between L1 and L2 regularlisation loss:

A

L1: the penalty term added is proportional to the absolute values of the weights
L2: the penalty term added is proportional to the square of the weights

L1 loss encourages sparsity in the weights
L2 loss discourages large weights

48
Q

What are some regularlisation techniques?

A
  • L1 and L2 regularlisation
  • Dropout
  • early stopping
  • N-Folds cross validation
49
Q

Why do dropout layers help with regularlisation?

A

Ensures the model doesn’t put too much weight on a single node, as it may get dropped

50
Q

When training a deep network, what could be the reason why the loss doesn’t decrease in a few epochs?

A
  • The learning rate is too slow
  • The model is overfitted
51
Q

How is early stopping used as a regularlisation strategy:

A

Early stopping stops the training process when validation loss is small

52
Q

What is an epoch?

A

The period of time/amount for an entire dataset to be passed forwards and backwards through a NN.

53
Q

Calculate the epoch size:
Total samples: 1000
Batch size: 50

A

1000/50 = 20
epoch size = 20 batches or iterations

54
Q

What are the benefits of data augmentation

A
  • Prevents models from overfitting
  • used when the initial training set is too small
  • improves model accuracy as improves model generalisation
55
Q

Give some examples of data augmentation:

A
  • Image Flip
  • Image crop
  • image rotation
  • CutMix
56
Q

Describe the network training pipeline:

A

1) given a training dataset, of images and the ground truth, design a neural network, a CNN
2) perform the forward passes, which extracts the image features and produces predictions for each of the categories.
3) choose the max value of the predictions as the category of the image
4) use the ground truth to calculate the loss and then perform backpropagation to update the parameters
5) train the model, iteration by iteration

57
Q

Describe one iteration of the training process:

A

1) sample a batch of the dataset (e.g. 25 images)
2) forward propagate the batch through the network to calculate the loss
3) backpropagate to calculate the gradients
4) update the parameters using the gradient`

58
Q

How does batchNorm normalise data:

A

Transforms the data to have a mean of 0 and a standard deviation of 1

59
Q

What is a mini-batch:

A

Used when our batch size is very large and we can’t forward all the data through in one iteration, so this is a sample of the data.

60
Q

If the training data has 32,000 images, and batch size is set to 32, how many mini batches will there be?

A

1000 mini-batches

61
Q

Describe the softmax loss function:

Give the formula:

A
  • normalise data to be in range [0,1], as we want the predictions for the categories
  • all predictions add up to 1

Zj is the logit/raw score for class j
e^zj = numerator
K = number of categories
the sum of all e^k = demonimator

softmax for each category = e^zj/sum of (e^k)

the labelled category is the max value of the softmax predictions

62
Q

What does softmax loss do?

A

It converts raw logits (output scores from a NN) into probabilities.

63
Q

What does cross entropy loss measure?

A

It measure how well the predicted probability distribution aligns with the true labels.

64
Q

When is the cross entropy loss function applied?

A

After the softmax loss function

65
Q

What is the cross entropy loss formula?

A

-1/n * for each n (sum of ground truth*log(softmax prediction)

where n is the number of samples

for 1 sample with ground truth [1 0 0] and prediction [0.66 0.24 0.1]:
-1/1(1log(0.66)+0log(0.24)+0log(0.1)) = -log(0.24)

66
Q

What is the formula for the MSE mean squared error loss function:

A

MSE = 1/n*sum of (actual - prediction)^2

67
Q

What does a model optimiser do?

A

Adjust the model’s parameters during training to minimise loss.

68
Q

What does the learning rate parameter do?

A
  • multiply learning rate by gradient (calculated using chain rule) to update weights
  • It determines how fast or slow we move towards the optimal weights.
69
Q

What is the difference between calculating gradient descent, stochastic gradient descent and mini-batch gradient descent:

A
  • gradient descent: pass all dataset through model, then calculate loss and use that to get the gradient
  • stochastic: pass 1 image through, calculate loss and use this to get the gradient
  • mini-batch: pass mini-batch through, calculate loss for this and use this to then get the gradient
70
Q

What will a small learning rate result in?

A
  • updates to weights are small, may take too long to converge or get stuck at local minima
71
Q

What will a large learning rate result in?

A
  • model learns fast, may cause the model to oscillate around or jump over minima. Might cause weights to overflow/not converge
72
Q

What does a good learning rate look like:

A

Tradeoff between convergence rate and overshooting.
- not too small so that our algorithm can converge swiftly
- not too large so that our algorithm won’t jump back and forth without reaching an undesirable local minima.

73
Q

What is learning rate decay?

A
  • minimise learning rate over time
  • It decreases as the number of epochs increases
  • can either have multi-step LR decay or exponential decay