Deep Learning Basics Flashcards

Question

What affect does batchNorm have on the number of epochs required to train a deep network?

Answer 1

- it dramatically reduces the number of epochs, as it can use a bigger learning rate and therefore converge faster

Answer 2

- Dropout is used to overcome overfitting to the training data and improve generalisation of the model

Answer 3

If there's a relatively large number of parameters with an insufficient number of training samples, overfitting often occurs

Answer 4

- For each node we calculate a probability of dropout. - if we set a dropout rate at 30%, then 30% of the nodes in that layer will be dropped - if a node is dropped all the connections in the higher layers will be disconnected.

Answer 5

- early dropout may disrupt the gradient - perform dropout only on deeper layers to preserve feature extraction in the early layers - deeper layers are more prone to overfitting so more effective to perform dropout then

Answer 6

- an activation function, that improves the non-linear representation ability of the network.

Answer 7

- aka transposed convolution - used to upsample feature maps to a bigger size - useful when you need the output to match the size of the input dimensions - used for semantic segmentation and used by GANs for image generation,

Answer 8

- prevent overfitting - improves model generalisation - stabilise the training process - reduce model complexity

Answer 9

- underfitting, due to bias

Answer 10

N-Fold Cross Validation

Answer 11

- split the data into n folds - train model on n-1 folds and 1 fold for testing. - each round a different fold is used as the test set, while the rest are used for training - after each round the model's performance is recorded - for n folds you get n performance scores - these are then averaged to get the final model evaluation score

Answer 12

- Many categories but the dataset has few examples of each category, so it’s hard to train the model accurately. - randomly splitting the dataset could mean the training dataset doesn't even include one instance of a category. All the instances will be in the validation set, so the model will have no ability to predict this category. - solution is n-fold cross validation, as this makes sure every image is in the training and test datasets at least once.

Answer 13

Two categories, positive (1) and negative (0) Classifier predicts: TP - true positives: number of positive samples that the classifier has correctly identified TN - true negatives: number of negative samples that the classifier has correctly identified 'not lion' FP - false positives: number of positive samples that the classifier has incorrectly identified FN - false negatives: number of negative samples that the classifier has incorrectly identified

Answer 14

Precision: TP/TP+FP Recall: TP/TP+FN F1: 2* (precision*recall)/precision+recall -combines precision and recall to get the final performance of the model Accuracy: TP+TN/TP+TN+FP+FN - correct predictions over total dataset

Answer 15

underfitting: high bias fit well: low bias, low variance overfitting: high variance

Answer 16

- train the network with more epochs

Answer 17

- data augmentation - add regularlisation layers

Answer 18

- used to prevent overfitting - by adding a penalty term to the loss function to encourage smaller weights

Answer 19

- It extracts local features by applying convolutional filters/kernels to output feature maps.

Answer 20

* Training set: Training models * Validation set: Test the model trained on training set and select the best performed model * Testing set: Test the model * Training set vs validation set: 80% vs 20% or 90% vs 10% * Training vs validation vs testing set: 60%-20%-20% * No standard rule, if a large amount of data, e.g., 1,000,000. It is possible to have 98%-1%-1% * Sometimes we only get the training & validation set (in some competitions the test set isn’t available to us)

Answer 21

loss = squared error loss + penalty term loss = sum of (actual - input*weight)^2 + sum of absolute value of the weights

Answer 22

loss = squared error loss + penalty term loss = sum of (actual - input*weight)^2 + sum of (weights^2)

Answer 23

L1: the penalty term added is proportional to the absolute values of the weights L2: the penalty term added is proportional to the square of the weights L1 loss encourages sparsity in the weights L2 loss discourages large weights

Answer 24

- L1 and L2 regularlisation - Dropout - early stopping - N-Folds cross validation

Answer 25

Ensures the model doesn't put too much weight on a single node, as it may get dropped

Answer 26

- The learning rate is too slow - The model is overfitted

Answer 27

Early stopping stops the training process when validation loss is small

Answer 28

The period of time/amount for an entire dataset to be passed forwards and backwards through a NN.

Answer 29

1000/50 = 20 epoch size = 20 batches or iterations

Answer 30

- Prevents models from overfitting - used when the initial training set is too small - improves model accuracy as improves model generalisation

Answer 31

- Image Flip - Image crop - image rotation - CutMix

Answer 32

1) given a training dataset, of images and the ground truth, design a neural network, a CNN 2) perform the forward passes, which extracts the image features and produces predictions for each of the categories. 3) choose the max value of the predictions as the category of the image 4) use the ground truth to calculate the loss and then perform backpropagation to update the parameters 5) train the model, iteration by iteration

Answer 33

1) sample a batch of the dataset (e.g. 25 images) 2) forward propagate the batch through the network to calculate the loss 3) backpropagate to calculate the gradients 4) update the parameters using the gradient`

Answer 34

Transforms the data to have a mean of 0 and a standard deviation of 1

Answer 35

Used when our batch size is very large and we can't forward all the data through in one iteration, so this is a sample of the data.

Answer 36

1000 mini-batches

Answer 37

- normalise data to be in range [0,1], as we want the predictions for the categories - all predictions add up to 1 Zj is the logit/raw score for class j e^zj = numerator K = number of categories the sum of all e^k = demonimator softmax for each category = e^zj/sum of (e^k) the labelled category is the max value of the softmax predictions

Answer 38

It converts raw logits (output scores from a NN) into probabilities.

Answer 39

It measure how well the predicted probability distribution aligns with the true labels.

Answer 40

After the softmax loss function

Answer 41

-1/n * for each n (sum of ground truth*log(softmax prediction) where n is the number of samples for 1 sample with ground truth [1 0 0] and prediction [0.66 0.24 0.1]: -1/1*(1*log(0.66)+0*log(0.24)+0*log(0.1)) = -log(0.24)

Answer 42

MSE = 1/n*sum of (actual - prediction)^2

Answer 43

Adjust the model's parameters during training to minimise loss.

Answer 44

- multiply learning rate by gradient (calculated using chain rule) to update weights - It determines how fast or slow we move towards the optimal weights.

Answer 45

- gradient descent: pass all dataset through model, then calculate loss and use that to get the gradient - stochastic: pass 1 image through, calculate loss and use this to get the gradient - mini-batch: pass mini-batch through, calculate loss for this and use this to then get the gradient

Answer 46

- updates to weights are small, may take too long to converge or get stuck at local minima

Answer 47

- model learns fast, may cause the model to oscillate around or jump over minima. Might cause weights to overflow/not converge

Answer 48

Tradeoff between convergence rate and overshooting. - not too small so that our algorithm can converge swiftly - not too large so that our algorithm won’t jump back and forth without reaching an undesirable local minima.

Answer 49

- minimise learning rate over time - It decreases as the number of epochs increases - can either have multi-step LR decay or exponential decay

Deep Learning Basics Flashcards

(73 cards)