Deep Learning Flashcards

Question

What activation function dyou use for multi class classification problem for the outputs layer? Explain how it works

Answer 1

A Convolutional Neural Network, also known as CNN or ConvNet, is a class of neural networks that specializes in processing data that has a grid-like topology, such as an image. A digital image is a binary representation of visual data. It contains a series of pixels arranged in a grid-like fashion that contains pixel values to denote how bright and what color each pixel should be. A CNN typically has three layers: a convolutional layer, a pooling layer, and a fully connected layer. The convolution layer performs a dot product between two matrices, where one matrix is the set of learnable parameters otherwise known as a kernel, and the other matrix is the restricted portion of the receptive field. During the forward pass, the kernel slides across the height and width of the image-producing the image representation of that receptive region. The pooling layer replaces the output of the network at certain locations by deriving a summary statistic of the nearby outputs. This helps in reducing the spatial size of the representation, which decreases the required amount of computation and weights. Default is max pooling The fully connecred layer helps to map the representation between the input and the output.

Answer 2

-data augmentation eg for images can flip or crop images to increase data set - Early stopping plot cost against iterations for both CV dataset and train dataset, stop training when CV cost start to increase.

Answer 3

Easier to converge to minima when using gradient descent

Answer 4

For training large dataset, use mini batch gradient descent runs much faster than batch gradient descent Whereas with batch gradient descent, a single pass through the training set allows you to take only one gradient descent step. With mini-batch gradient descent, a single pass through the training set, that is one epoch, allows you to take 5,000 gradient descent steps. Now of course you want to take multiple passes through the training set which you usually want to, you might want another for loop for another while loop out there.

Answer 5

- Because gradient descent uses a whole batch for training, the algo will take nice large steps into the minima but because sgd uses each sample training data as it’s own mini batch, the descent into the minima will be noisier as each sample have diff quality, and hence it also won’t get into minima but rather wander around the minima

Answer 6

lose speed from vectorisation

Answer 7

- if less than 2000 use batch gradient descent - Else 64 or 128 or 256 and 512 (2 to the power due to computer memory configurations) - Also depends on CPU and GPU memory - Can try diff values, see which one is more efficient

Answer 8

- make progress before processing the whole dataset

Answer 9

- known as root mean square prop - The vertical axis rep b and the horizontal axis rep w. - The aim is to slow down the learning in the vertical direction ie. b direction and speed up learning in the horizontal direction ie. w. - Sdw=beta*sdw + (1-beta)*dw^2 (element wise) - W=w- alpha*dw/sqrt(sdw)*

Answer 10

- adaptive moment estimation - combining momentum with RMSProp

Answer 11

List of things to tune - number of layers - number of nodes - LR rate / LR decay rate - Size of mini batch - Dropout Use random search rather than grid search to cover more search area then narrow down the search area Tuning importance: Alpha / LR Momentum term = ~0.8 Hidden units Mini batch size Number of layers LR decay never tune Beta1, beta2 and epsilon

Answer 12

Transfer learning save training time and have better performance without needing a lot of data.

Answer 13

In comp vision, neural networks usually try to detect edges in the r earlier layers, shapes in the middle layer and some task-specific features in the latter layers. In transfer learning, the early and middle layers are used and we only retrain the latter layers. It helps leverage the labeled data of the task it was initially trained on

Answer 14

Random crop, mirroring, color shifting (adding numbers to the RGB values) using PCA color augmentation

Answer 15

- for normal convolutions, you slide each filter block across the image block, for each stride multiply all the numbers in the filter block with all the numbers in the image block within the filter space and sun it up. The computational cost = number of filter parameters* number of filter positions * number of filters - For depthwise convolution, the number of filters = number of image channels. Match each filter to each channel and slide the corresponding filter to the corresponding image channel. Then perform the number multiplications and sun it up. The computation cost = number of filter Params * number of filter positions * number of filters. Do a pointwise convolution 1*1*nc with nc’ filters

Answer 16

- low computational cost at deployment - Useful for mobile and embedded vision applications

Answer 17

Instead of choosing which conv / pooling to come first, just do them all with a 1x1 convolution but concatenate the blocks together and let the network learn. Drawback is computation cost but with 1x1 convolutions you can shrink the number of channels before applying convolution

Answer 18

It works because a^l+2 = g(z^l+2 + a^l) = g( w^l+2 a^l+1 + b^l+2 + a^l) If you use l2 reg, w^l+2 will tend to shrink, and if w is 0, then the equation will just b g(a^l). Hence it is easy for resblocks to learn

Answer 19

Purpose 1: shrink channels - Eg. You have 28x28x192 and you wna shrink the volume to 28x28x32 you can apply 32 filters of 1x1x192 Purpose 2: add non-linearity

Answer 20

Resnets also known as residual networks are build out of smth called residual blocks that allow you to train very very deep networks (>100 layers) ResNet works by adding residual connections to the network, which helps to maintain the information flow throughout the network and prevents the gradients from vanishing. The residual connection is a shortcut that allows the information to bypass one or more layers in the network and reach the output directly. In theory, the more layers the lower the error but in practice or reality the training error increases which means the network have a harder time learning. But when you use resets, the more layers the lower the error, but eventually flattens and plateau.

Answer 21

- for every convolution layer the image shrinks where the new image is n-f+1.so if you dw ur image to shrink a lot can add padding esp if building v deep networks. New image after = n+2p-f+1 - Also if you don’t add padding the pixel in the corner only contributes once to the model as compared the a more centralised pixel, if you don’t pad you’ll be throwing away a lot of info from edge

Answer 22

sequential model Conv2D(n_filters, shape of filter, activation, input shape = (pixel, pixels, 3) ) MaxPooling2D(2,2) <- shape of pooling window Flatten() Dense(n_neruons, activation) Dense(n_output, activation)

Answer 23

- if less than 2000 use batch gradient descent - Else 64 or 128 or 256 and 512 (2 to the power due to computer memory configurations) - Also depends on CPU and GPU memory - Can try diff values, see which one is more efficient

Answer 24

Architecture (Two outputs) - conv net - softmax to output different possible classes of object (eg. Car, pedestrian, motto cycle and background) - Output bounding box (bx, by, bh, bw) where bx and by is the middle of the bounding box Y = vector ( Pc ie. is there an object, bx, by, bh, bw, C1 Ie. is it object C1, C2, C3) If no object, just label as don’t care Loss function: - if the actual label is got object then loss is the sum of squares of the prediction of each component (Pc ie. is there an object, bx, by, bh, bw, C1 Ie. is it object C1,C2,C3) - actual label - If actual label is no object, square of (Pc hat - Pc)

Answer 25

Landmark detection - landmark detection is like detecting eyes/nose on a face - Label training dataset with the a certain number coordinates/landmarks that surrounds the feature you are trying to detect - change CNN output layer to output the feature of the face and coordinates of the points you want in the image like this - Y = vector(face?, l1x, l1y, …l64x,l64y)

Answer 26

Object detection - train a conv net to detect a car using heavily cropped images - Then for an image, start by a picking a window size, input the window image into conv net and get a prediction. Slide the window, and pass in the next window image into the conv net. Do this until the whole image is covered. Stride can be customised - Change window size and do it again Intersection over union = size of intersection between ground truth box and prediction box / size of union of both ground truth box and prediction box If IOU>= 0.5 then prediction is correct More generally, IOU is a measure of the overlap between two bounding boxes Non max suppression is a way to make sure your algo only detects the object once instead of multiple times Steps: - discard all boxes with probability of object <= 0.6 - Pick the box with the largest probability of object as a prediction - Discard any remaining box with IOU >= 0.5 with the box output in prev step Anchor box - aims to solve the problem that only one grid ce can only detect one object, if the grid cell contains two overallping objects, Steps - encode y Y = vector ( Pc ie. is there an object, bx, by, bh, bw, C1 Ie. is it object C1, C2, C3, bx, by, bh, bw, C1 Ie. is it object C1, C2, C3,) Where the second set refers anchor box 2 Limitations both object similar have the Sam e anchor box shape Output shape = grid row x grid Col x n anchor box x 8 Params (Pc ie. is there an object, bx, by, bh, bw, C1 Ie. is it object C1, C2, C3)

Answer 27

- blow the image back up in size using transpose convolution Contracting path (Encoder containing downsampling steps): Images are first fed through several convolutional layers which reduce height and width, while growing the number of channels. The contracting path follows a regular CNN architecture, with convolutional layers, their activations, and pooling layers to downsample the image and extract its features. In detail, it consists of the repeated application of two 3 x 3 same padding convolutions, each followed by a rectified linear unit (ReLU) and a 2 x 2 max pooling operation with stride 2 for downsampling. At each downsampling step, the number of feature channels is doubled. Crop function: This step crops the image from the contracting path and concatenates it to the current image on the expanding path to create a skip connection. Expanding path (Decoder containing upsampling steps): The expanding path performs the opposite operation of the contracting path, growing the image back to its original size, while shrinking the channels gradually. In detail, each step in the expanding path upsamples the feature map, followed by a 2 x 2 convolution (the transposed convolution). This transposed convolution halves the number of feature channels, while growing the height and width of the image. Next is a concatenation with the correspondingly cropped feature map from the contracting path, and two 3 x 3 convolutions, each followed by a ReLU. You need to perform cropping to handle the loss of border pixels in every convolution. Final Feature Mapping Block: In the final layer, a 1x1 convolution is used to map each 64-component feature vector to the desired number of classes. The channel dimensions from the previous layer correspond to the number of filters used, so when you use 1x1 convolutions, you can transform that dimension by choosing an appropriate number of 1x1 filters. When this idea is applied to the last layer, you can reduce the channel dimensions to have one layer per class. The U-Net network has 23 convolutional layers in total. Important Note: The figures shown in the assignment for the U-Net architecture depict the layer dimensions and filter sizes as per the original paper on U-Net with smaller

Answer 28

Cost function = alpha*Jcontent(content image, generated image) + beta JStyle(style image, generated image) Steps 1. randomly initiate Generated image ie. pixel numbers are random 2. Use gradient descent to minimise the cost function be Content cost function - use a pre trained conv net - Let a[l](C) and a[l](G) be the activation of layer l on the images - If a[l](C) and a[l](G) are similar, both images have similar content - where the similarity is measured by the sum of the element wise differences squared of the two vectors Style cost function - Sun of squares of the element wise difference between the correlation matrices of the style image and the generate image where the correlation Matrix measures the correlation of the channels in the activation part of hidden layer

Answer 29

Tokenize the sentence, build a vocabulary, map to one hot vectors, add EOS tokens

Answer 30

Tokenize the sentence, build a vocabulary, map to one hot vectors, add EOS tokens

Answer 31

it is a gated recurrent unit, used in RNN, to solve vanishing gradient problem and capture long range connections

Answer 32

it is a gated recurrent unit, used in RNN, to solve vanishing gradient problem and capture long range connections

Answer 33

A type of RNN that takes info from both earlier and later in the sequence

Answer 34

A connection that allows the gate values to depend not just on t-1 and xt but also on the previous memory cell value

Answer 35

Bidirectional RNN with LSTM blocks

Answer 36

It is when the error signal in back propagation become too small to update the weights of the earlier layers of RNN

Answer 37

LSTM is more powerful and flexible with 3 gates while GRU is simpler with two gates and easier to scale to larger networks

Answer 38

LSTM is more powerful and flexible with 3 gates while GRU is simpler with two gates and easier to scale to larger networks

Answer 39

LSTM can learn longer range connections over GRU

Answer 40

The standard logistic regression loss also called cross entropy loss

Answer 41

Machine translation Image captioning

Answer 42

Use encoder network to find encoding of input sequence and decoder network to generate the corresponding sequence

Answer 43

Machine translation is a conditional language model. It is conditioned on the encoding of the given sentence rather than starting off w a vector of zeros It doesn’t just pick any sentence but it picks the most likely sentence conditioned on a given sentence

Answer 44

Greedy approach is picking the first best word and the second best word and so on and so forth. The reason it doesn’t rly work is because the next best word doesn’t necessarily mean the final sentence is better

Answer 45

Approximate search algo / beam search Try picking the sentence that maximises the conditional probability (in the context of conditional language models) Compared to greedy search that searches for the next best word, beam search is able to search for the best next N words where N is the parameter known as beam width How it works Lets set our beam width to 3 and grab the top three predicted words at each position in a given sequence. The encoded audio sequence is passed to a decoder, where a softmax function is applied to all the words in a set vocabulary (would be previously defined no matter if we're working with audio sequencing or text translation). Lets set our beam width to 3 and grab the top three predicted words at each position in a given sequence. The encoded audio sequence is passed to a decoder, where a softmax function is applied to all the words in a set vocabulary (would be previously defined no matter if we're working with audio sequencing or text translation). For the second word in a sequence we pass the first three selected words as input into the second position. As we did before, we apply the same softmax output layer function to the set vocabulary find the next 3 words we could use for the second position. While this happens, we use conditional probability to decide on the best combination of first position words and second position words. We run these 3 input words against all words in the vocabulary to find the best 3 combinations and will pass them to the next layer as input again. Words from the first position can get dropped moving forward if another input token has a higher probability with two different sequences. For instance, if "I will" and "I am" were higher than any combination with "Us" we can drop the "Us" token and continue with our new top three sequences. We repeat this process until we reach an END token and have now generated 3 different sequences. We now have 3 different text translations or audio sequence results that we still have to decide between. These output sequences can be different in length and total tokens, which can create nice variation in our results. We simply pick the decoder output with the highest probability at the end. Conditional probability = p(y1, y2 | x) = p of first word times p of second word Large B is better but slower Small B is worse result but faster Beam search doesn’t find the max all the time but it is faster

Answer 46

It is an enhancement to the beamsearch algo to get better results. 1 Maximise log P over the Max p 2 normalise by number of words in output sequence to the power of alpha where alpha is tunable (if alpha 1 is completely normalisaing by length, 0 is not normalising at all)

Answer 47

Using the Bleu score It compares the machine translation against human generated references. Modified precision we give each word credit up to the max amount time it appears in the reference sentence. Eg. If the MT output has the the the the the the the And reference is the cat is on the mat, the modified precision is 2/7 where 2 is the number of times the is credited and 7 is the count of the number of “the” in the MT output

Answer 48

Ref 1 the cat is in the mat Ref 2 there is a cat on the mat MT output the cat the cat on the mat The cat 2, clipped 1 Cat the 1 clipped 0 Cat on 1 clipped 1 On the 1 clipped 1 The mat 1 clipped 1 Modified precision 4/6 (total bigram) Precision (n gram) = sum of count clip n gram / total number of N grams in MT output Combined Bleu score exp( avg (Precision N score) ) and a BP score BP penalty 1 it MT output > reference length otherwise 0

Answer 49

Why? Encoder decoder networks precision (measured by the Bleu score) drops when sentences get longer and if the sentence is v short Intuition? Uses bidirectional RNN to take into context the activation of the word in front and after the word at a certain time step Sum the features of inputs of the words in front and after * corresponding attention weights How it works: To compute the current output it uses a simple neural network and takes into input the previous hidden state and the activation function The activation function is the sum of the features from the backward and forward activation that takes into account the words in front words behind and the input word. Through gradient descent the simple neural network will tell us how much attention to place on the activation function

Answer 50

Connections temporal classification. Rule: collapse repeated characters not separated by blank. If a character is related but separated by blank, it is included in the string

Answer 51

Given that you want a model to do task B but you have not much data on it but you have more data for a similar task (task A). You’ll pretrain your model on task A and fine tune your model on task B. First you train on task A then using the same weights you update the neural network weights by retraining on task B data. You can also just update the weights on the last layer if the dataset is v small.

Answer 52

Multitask learning is training a neural network to predict multiple things. It makes sense if training on a set of tasks that could benefit from have having shared lower level features And the amount of data you have for each task is quite similar And you can train a big enough network to do well on all tasks

Answer 53

How it’s trained: - Masked language modelling - random words in sentence is masked and the training obj is to predict the mask tokens to reconstruct original sentence - bidirectional ie used context from the whole sequence Usecases - NER - sentiment analysis - word classification Model BERT ROBERTA

Answer 54

Training objective: Causal language modelling: Predict next token based on previous set of tokens Use case Text gen Example GPT BLOOM

Answer 55

Span corruption The teacher X student Translation Text submission Question answering T5 BART

Answer 56

ROUGE (text summarisation), compares a summary to one or more reference summaries - the higher the rouge 1 score, the better BLEU score (text translation), compare to human generated translations

Answer 57

Cuz it created a full copy of the LLM during a full finetinkng of a task, it requires large memory to store the weights gradients, optimiser states, forward action etc.

Answer 58

As models get larger, full finetuning becomes infeasible to train on for consumer hardware. Storing and deployment for each downstream task becomes expensive. PEFT usually consists of freezing original model weight and only fine tune some model parameters, or add new layers. As such the new PEFT weights are only a few MB worth of memory. However there are some trades offs: - parameter efficiency - training speed - inference costs - model perf - memory efficiency PEFT methods: 1) Selective: select subset of initial LLM parameters to fine tune 2) Reparameterization: reparametize model weights using a low-rank representation (LoRA) 3) Additive: add trainable layers or parameters to model — adapters: add new trainable layers typically encode or decoder (after attention layer) — soft prompts: fix model architecture, focus on manipulating input to achieve better perf by 1) adding trainable parameters to prompt embedding and 2) retraining embedding weights (prompt tuning)

Answer 59

Intuition: LoRA doesn’t change the underlying model, but it changes how the model emphasizes different connections. Think of each low-rank matrix as a filter. 1. Freeze most of the original LLM weights 2. Inject 2 rank decomposition matrices before the self attention layer 3. Train the weights of the smaller matrices Steps to update model for inference 1. Matrix multiply the low rank matrices (B x A) <- weights for task A 2. Add to original weights (frozen + BxA)

Answer 60

Bleu score is a metric to measure the performance of a sequence to sequence model like machine translation. It takes as input the output sequence of the model and a human generated reference. Unigram precision = number of word matches / number of words in generation | but the problem with unigram precision is that if the output sequence contain a word that is in the human reference sequence and it is repeated many times, the unigram precision will reflect a high score despite the output sequence being nonsense. Modified unigram precision = clip(num word matches) based on number of times it appear in the human reference. However it still doesn’t take into account the order of the sequence. To deal with word ordering problems, we use BLEU score which computes the precision for several different n-grams and then averages the result. If there are not 4 gram matches then the 4 gram precision is 0. The Bleu score is the geometric mean of all four n-gram precisions (Uni, bigram, trigram, 4 gram). Ie (p1 * p2 * p3 *p4)^ 0.25 in actual coding, ppl use sacrebleu score because it allows the words to be non tokenised ie takes a string of words rather than a list of words

Answer 61

Rouge score is a score that tells use how good a machine generated summary is compared to one or more reference summaries. Rouge score compared N grams of the generate with n grams of the references. Rouge 1 recall = number of word matches / number of words in reference Rouge 1 precision = number of word matches / number of words in machine generated summary Rouge 1 f1 score = 2 ( precision * recall / precision + recall) Rouge L, instead of using unigram or bigrams, it uses the length of the longest common subsequence between the generated summary and referenced Recall = LCS(ref, gen) / number of words in reference Precision = LCS (ref, gen) / number of words in summary Advantage of rouge L over rouge 1 or 2 is that it doesn’t depend on consecutive n gram matches so it tends to capture sentence structure more accurately Rouge L sum is computed over a whole summary while Rouge L is averaged across individual sentences.

Answer 62

Model Pruning: Trim non-essential parameters, ensuring only those crucial to performance remain. This can drastically reduce the model's size without significantly compromising accuracy. Includes: PEFT Quantization: Convert the 32-bit floating-point numbers into more memory-efficient formats, such as 16-bit or 8-bit, to streamline operations without a discernible loss in quality. Model Distillation: Use larger models to train smaller, more compact versions that can deliver similar performance with a fraction of the resource requirement. The idea is to transfer the knowledge of larger models to smaller ones with simpler architecture. Optimized Hardware Deployment: Deploy models on specialized hardware like Tensor Processing Units (TPUs) or Field-Programmable Gate Arrays (FPGAs) designed for accelerated model inference. Batch Inference: The above LLM optimization techniques are helpful to optimize inference time but can reduce model accuracy. Inference time and accuracy trade-off requires special attention. One way could be using batch inference.This paper presents a batch prompting approach that enables the LLM to run inference in batches instead of one sample at a time. This approach reduces both token and time costs while retaining downstream performance.

Answer 63

One way is to retrain model on new data but this will quickly become expensive. Another way is to give the LLM access to additional external data at inference time using retrieval augmented generation. RAG is a framework for providing LLM access to data not seen in training by - connecting to external data sources - connecting to APIs The external data source can be: - sql database - csv files - web pages - vector stores Vector stores store embedding of words which are vector representation of words. But they also help LLMs to: 1. Retrieve relevant info or context during generation or understanding task. LLMs can query the vector store to obtain embedding for specific words or phrases, enhancing their ability to understand and generate human like texts 2. Commonly used in various task like semantic search, info retrieval, similarity analysis Vector databases (VDB) are an implementation of a vector store, which is a collection of unstructured text broken up or split into chunks (portions) with vector embeddings generated for each chunk. Each vector is also identified by a key. This can allow the text generated by RAG to include a citation for the document from which it was received.

Answer 64

1. Conveet qn into embedding 2. Do cosine similarity search in vector db containing the chunks of texts in embedding form 3. Grab top N

Answer 65

1. Index documents 2. Retrieving document 3. Generating using context window

Answer 66

If you want smth fast: - Use prompt engineering and few shot training If model performance is not good maybe there’s hallucinations: - Use Active RAG If you have time, money and high quality data: - use Finetuning

Deep Learning Flashcards

(91 cards)