Deep Learning Flashcards
What does ReLu stand for? And what does it mean?
Rectified Linear Unit
The rectified linear activation function or ReLU for short is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero.
What type of neural network architecture is used for eg. House price prediction or advertisement clicking probability?
Standard neural network architecture
What type of neural network architecture is used for image recognition?
CNN (convolution neural network)
For sequence data eg. Audio over time, what type of neural network architecture do we use?
Recurrent Neural Network
What is the vanishing gradient problem?
Optimisation of parameters uses gradient descent method to find the best parameters. Vanishing gradient problem occurs when the gradient becomes exponentially small so that the update of the parameter that we are trying to update becomes insignificant. The implications can be that the model never converges to optimum or it takes much longer to train.
Explain gradient descent
Gradient descent is a method of updating a0 and a1 to minimize the cost function (MSE). A regression model uses gradient descent to update the coefficients of the line (a0, a1 => xi, b) by reducing the cost function by a random selection of coefficient values and then iteratively update the values to reach the minimum cost function.
1) start with random coefficients
2) calculate predicted values
3) Calculate partial derivative w.r.t a0 and a1. Sub in the predicted values.
5) Multiply the value by learning rate and subtract it from coefficient
6) stop after 100 iterations or until the error is Low.
What activation should you use for the output layer of a binary classification and why
Sigmoid because you want to limit the output value in the range of 0 to 1
Ps, in the sigmoid function, when x=0, y=0.5
Name other activation functions more suitable for hidden later
Tanh: similar to sigmoid function but limits the output range to -1 and 1, when x=0, y=0
ReLu: (default choice when you dk what to use because training time is faster as compared to using tanh and sigmoid due to the lack of vanishing gradient) a=max(0,z). Tanh and sigmoid have vanishing gradient problems at the tails
Leaky ReLU: a=max(0.01z, z) when x is negative, instead of the slope being zero, there is a small slope. The constant 0.01 can be another learning parameter
What type of activation function should you use for a regression problem where the output is non-negative
Relu
Why is there a need for non-linear activation functions
For networks with more than 1 hidden layer, if one were to use linear activation functions for all layers, the output will still be the same as that of 1 hidden layer with a linear activation function.
What are the two phases in neural network?
During forward propagation, the input is fed into the neural network, and the network calculates the output. During backward propagation, the error between the predicted output and the actual output is calculated, and the weights and biases of each neuron are adjusted to reduce the error.
What are the common steps for pre-processing a new dataset
- Figure out dimensions and shapes of problem (m_train, m_test, num_px)
- Reshape the dataset such that each example is now a vector of size
- Standardised the data
In a neural network with 1 hidden layer with 3 nodes and 4 input features what is the shape of the weights matrix in layer 1
(3,4)
The number of rows in W is the number of neurons in that layer and the number of columns is the number of input of the layer
How to build a 2 layer neural network?
1) Initial parameters:
Weights matrix 1 with shape (size of hidden layer, size of input layer) | all random weights
Bias vector 1 with shape (size of hidden layer, 1) | all zero
Weights matrix 2 with shape (size of output layer, size of hidden layer) | all random
Bias vector 1 with shape (size of output layer, 1) | all zero
2) Forward propagation:
Z = np.dot(W, A) + b
Where A is the activation from prev layer or the input data
B is the bias vector
W is the weight matrix
3) Calculate the activation for the layer by applying the activation function to z
g(z)
4) Compute the cost function
- if it’s a regression problem, cost functions are MAE, MSE, RMSE
- If it is a classification problem, cost functions are cross entropy loss
5) Backward propagation
- compute the derivative of cost function with respect to AL ( probability vector)
- Use dAL to calculate the derivative of the cost function with respect to z
- Use dZ to calculate the calculate the derivative of the cost function with respect to W, b
6) Update the parameters
- The new W and B is updated by subtracting the learning rate * by the gradient computed in backward propagation
7) Repeat steps 2 to 6 for a set number of iterations like 1000 times or until the cost is at a satisfactory level
How does l2 regularisation work in neural networks
if lambda, the regularization parameter is large, then your parameters will be relatively small, because they are penalized being large in the cost function. And so if the weights, W, are small, then because z is a function of W, if W tends to be very small, then z will also be relatively small. And in particular, if z ends up taking relatively small values, just in this little range, then g of z will be roughly linear. So it’s as if every layer will be roughly linear, as if it is just linear regression. And we saw in course one that if every layer is linear, then your whole network is just a linear network. And so even a very deep network, with a deep network with a linear activation function is, at the end of the day, only able to compute a linear function. So it’s not able to, you know, fit those very, very complicated decision, very non-linear decision boundaries that allow it to, you know, really overfit
How does dropout regularisation work in NN
- each layer has a probability of keeping a node
- Keep prob lower for bigger weight matrix (eg, hidden layer) to increase drop out and higher fro layers with less nodes
- Intuition, can’t rely on any one feature, so have to spread out of weight
- For drop out the cost function will get fucked up, turn off drop out to ensure cost is dropping (ensure model is working) then turn it back
What are the ways to speed up mini batch gradient descent and explain how
1) Momentum gradient descent
Problem with mini batch gradient descent is that it may take many steps to get to minimum due to noise of the batches, causing the cost of oscillate before reaching the minimum.
Momentum gradient descent reduces the number of steps taken using by smoothing out the movement using exponentially weighted average
The smoothing constant, beta, is a hyper parameter using 0.9 which is average of 10 iterations
2) RMSprop
- known as root mean square prop
- The vertical axis rep b and the horizontal axis rep w.
- The aim is to slow down the learning in the vertical direction ie. b direction and speed up learning in the horizontal direction ie. w.
- Sdw=betasdw + (1-beta)dw^2 (element wise)
- W=w- alphadw/sqrt(sdw)
Why the need for learning rate decay?
At the start of learning, can afford to take bigger steps. But when learning rate is large, nearing the minima of gradient descent, the algo might wander around and not reach the minima due to the large noisy steps, hence using learning rate decay makes the lr smaller and hence the steps smaller towards the end of the training to better find the minima.
What is one epoch
One epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters. So if you have 500 mini batches of 100, the number of iterations is 500 ie the parameters have been updated 500 and all 50,000 samples have been seen by the model
How is the learning rate alpha updated in LR decay
It is updated after each epoch,
Where alpha next = (1/1+decay rate epoch num)alpha prev
Other methods
Exp decay: Alpha = 0.95 ^ epoch num * alpha prev
How to sample values on a log scale
1) determine upper and lower limit
2) transform log10x
3) np.randn
How to sample values on a log scale when sampling for learning rate
1) determine upper and lower limit
2) a=log10lowerlimit |b=log10upperlimit
3) random(a,b)
4) lr = 10^random number from 3
What is batch norm
So we know that normalising inputs help speed up training by transforming optimisation problem from more elongated to circular
Batch norm is normalising the inputs to the next layer, eg. a or z where a = g(z) and z=wa + b to speed up trianing
Batch norm also helps with regularisation
Each mini batch is scaled by the mean/variance computed on just that mini-batch
This adds some noise to the values zl within that minibatch so similar to dropout, it adds some noise to each hidden layers activations, adding some reg effect
How to implement batchnorm
Miu=1/m sum z
Variance = 1/m sum (z-miu)^2
Z norm i = zi - miu / sqrt(var+ epsilon)
But sometimes you dw z norm to have mean 0 and variance 1
~Z i = gamma z i norm + beta
Where game and beta are trainable parameters
What activation function dyou use for multi class classification problem for the outputs layer? Explain how it works
Softmax.
CNN
A Convolutional Neural Network, also known as CNN or ConvNet, is a class of neural networks that specializes in processing data that has a grid-like topology, such as an image. A digital image is a binary representation of visual data. It contains a series of pixels arranged in a grid-like fashion that contains pixel values to denote how bright and what color each pixel should be.
A CNN typically has three layers: a convolutional layer, a pooling layer, and a fully connected layer.
The convolution layer performs a dot product between two matrices, where one matrix is the set of learnable parameters otherwise known as a kernel, and the other matrix is the restricted portion of the receptive field.
During the forward pass, the kernel slides across the height and width of the image-producing the image representation of that receptive region.
The pooling layer replaces the output of the network at certain locations by deriving a summary statistic of the nearby outputs. This helps in reducing the spatial size of the representation, which decreases the required amount of computation and weights. Default is max pooling
The fully connecred layer helps to map the representation between the input and the output.
Other regularisation techniques
-data augmentation eg for images can flip or crop images to increase data set
- Early stopping plot cost against iterations for both CV dataset and train dataset, stop training when CV cost start to increase.
Why normalise data
Easier to converge to minima when using gradient descent
Minibatch or batch gradient descent for larger training sets? Why?
For training large dataset, use mini batch gradient descent runs much faster than batch gradient descent
Whereas with batch gradient descent, a single pass through the training set allows you to take only one gradient descent step. With mini-batch gradient descent, a single pass through the training set, that is one epoch, allows you to take 5,000 gradient descent steps. Now of course you want to take multiple passes through the training set which you usually want to, you might want another for loop for another while loop out there.
Diff between gradient descent and stochastic gradient descent
- Because gradient descent uses a whole batch for training, the algo will take nice large steps into the minima but because sgd uses each sample training data as it’s own mini batch, the descent into the minima will be noisier as each sample have diff quality, and hence it also won’t get into minima but rather wander around the minima
What is the con of SGD
lose speed from vectorisation
Mini batch size
- if less than 2000 use batch gradient descent
- Else 64 or 128 or 256 and 512 (2 to the power due to computer memory configurations)
- Also depends on CPU and GPU memory
- Can try diff values, see which one is more efficient
Pros of minibatch
- make progress before processing the whole dataset
What is RMS prop
- known as root mean square prop
- The vertical axis rep b and the horizontal axis rep w.
- The aim is to slow down the learning in the vertical direction ie. b direction and speed up learning in the horizontal direction ie. w.
- Sdw=betasdw + (1-beta)dw^2 (element wise)
- W=w- alphadw/sqrt(sdw)
What is Adam optimisation
- adaptive moment estimation
- combining momentum with RMSProp
Tuning DL networks
List of things to tune
- number of layers
- number of nodes
- LR rate / LR decay rate
- Size of mini batch
- Dropout
Use random search rather than grid search to cover more search area then narrow down the search area
Tuning importance:
Alpha / LR
Momentum term = ~0.8
Hidden units
Mini batch size
Number of layers
LR decay
never tune
Beta1, beta2 and epsilon
Why use transfer learning
Transfer learning save training time and have better performance without needing a lot of data.
How does transfer learning work
In comp vision, neural networks usually try to detect edges in the r earlier layers, shapes in the middle layer and some task-specific features in the latter layers. In transfer learning, the early and middle layers are used and we only retrain the latter layers. It helps leverage the labeled data of the task it was initially trained on
Common data augmentation techniques
Random crop, mirroring, color shifting (adding numbers to the RGB values) using PCA color augmentation
Normal vs Depthwise convolution
- for normal convolutions, you slide each filter block across the image block, for each stride multiply all the numbers in the filter block with all the numbers in the image block within the filter space and sun it up. The computational cost = number of filter parameters* number of filter positions * number of filters
- For depthwise convolution, the number of filters = number of image channels. Match each filter to each channel and slide the corresponding filter to the corresponding image channel. Then perform the number multiplications and sun it up. The computation cost = number of filter Params * number of filter positions * number of filters. Do a pointwise convolution 11nc with nc’ filters
Advantages of Mobile nets
- low computational cost at deployment
- Useful for mobile and embedded vision applications
Inception Network
Instead of choosing which conv / pooling to come first, just do them all with a 1x1 convolution but concatenate the blocks together and let the network learn.
Drawback is computation cost but with 1x1 convolutions you can shrink the number of channels before applying convolution
Why does resnet work?
It works because
a^l+2 = g(z^l+2 + a^l)
= g( w^l+2 a^l+1 + b^l+2 + a^l)
If you use l2 reg, w^l+2 will tend to shrink, and if w is 0, then the equation will just b g(a^l). Hence it is easy for resblocks to learn
Purpose of 1x1 convolution
Purpose 1: shrink channels
- Eg. You have 28x28x192 and you wna shrink the volume to 28x28x32 you can apply 32 filters of 1x1x192
Purpose 2: add non-linearity
What are resnets?
Resnets also known as residual networks are build out of smth called residual blocks that allow you to train very very deep networks (>100 layers)
ResNet works by adding residual connections to the network, which helps to maintain the information flow throughout the network and prevents the gradients from vanishing. The residual connection is a shortcut that allows the information to bypass one or more layers in the network and reach the output directly.
In theory, the more layers the lower the error but in practice or reality the training error increases which means the network have a harder time learning. But when you use resets, the more layers the lower the error, but eventually flattens and plateau.
CNN why add padding
- for every convolution layer the image shrinks where the new image is n-f+1.so if you dw ur image to shrink a lot can add padding esp if building v deep networks. New image after = n+2p-f+1
- Also if you don’t add padding the pixel in the corner only contributes once to the model as compared the a more centralised pixel, if you don’t pad you’ll be throwing away a lot of info from edge
How to build a tensorflow CNN?
sequential model
Conv2D(n_filters, shape of filter, activation, input shape = (pixel, pixels, 3) )
MaxPooling2D(2,2) <- shape of pooling window
Flatten()
Dense(n_neruons, activation)
Dense(n_output, activation)
Mini batch size
- if less than 2000 use batch gradient descent
- Else 64 or 128 or 256 and 512 (2 to the power due to computer memory configurations)
- Also depends on CPU and GPU memory
- Can try diff values, see which one is more efficient
Explain the architecture of a image classification with localisation problem
Architecture (Two outputs)
- conv net
- softmax to output different possible classes of object (eg. Car, pedestrian, motto cycle and background)
- Output bounding box (bx, by, bh, bw) where bx and by is the middle of the bounding box
Y = vector (
Pc ie. is there an object,
bx,
by,
bh,
bw,
C1 Ie. is it object C1,
C2,
C3)
If no object, just label as don’t care
Loss function:
- if the actual label is got object then loss is the sum of squares of the prediction of each component (Pc ie. is there an object, bx, by, bh, bw, C1 Ie. is it object C1,C2,C3) - actual label
- If actual label is no object, square of (Pc hat - Pc)
Explain landmark detection
Landmark detection
- landmark detection is like detecting eyes/nose on a face
- Label training dataset with the a certain number coordinates/landmarks that surrounds the feature you are trying to detect
- change CNN output layer to output the feature of the face and coordinates of the points you want in the image like this
- Y = vector(face?, l1x, l1y, …l64x,l64y)
Explain object detection
Object detection
- train a conv net to detect a car using heavily cropped images
- Then for an image, start by a picking a window size, input the window image into conv net and get a prediction. Slide the window, and pass in the next window image into the conv net. Do this until the whole image is covered. Stride can be customised
- Change window size and do it again
Intersection over union
= size of intersection between ground truth box and prediction box / size of union of both ground truth box and prediction box
If IOU>= 0.5 then prediction is correct
More generally, IOU is a measure of the overlap between two bounding boxes
Non max suppression is a way to make sure your algo only detects the object once instead of multiple times
Steps:
- discard all boxes with probability of object <= 0.6
- Pick the box with the largest probability of object as a prediction
- Discard any remaining box with IOU >= 0.5 with the box output in prev step
Anchor box
- aims to solve the problem that only one grid ce can only detect one object, if the grid cell contains two overallping objects,
Steps
- encode y
Y = vector (
Pc ie. is there an object,
bx,
by,
bh,
bw,
C1 Ie. is it object C1,
C2,
C3,
bx,
by,
bh,
bw,
C1 Ie. is it object C1,
C2,
C3,)
Where the second set refers anchor box 2
Limitations both object similar have the Sam e anchor box shape
Output shape = grid row x grid Col x n anchor box x 8 Params (Pc ie. is there an object,
bx,
by,
bh,
bw,
C1 Ie. is it object C1,
C2,
C3)
What architecture is used for image segmentation eg. Medical imaging
- blow the image back up in size using transpose convolution
Contracting path (Encoder containing downsampling steps):
Images are first fed through several convolutional layers which reduce height and width, while growing the number of channels.
The contracting path follows a regular CNN architecture, with convolutional layers, their activations, and pooling layers to downsample the image and extract its features. In detail, it consists of the repeated application of two 3 x 3 same padding convolutions, each followed by a rectified linear unit (ReLU) and a 2 x 2 max pooling operation with stride 2 for downsampling. At each downsampling step, the number of feature channels is doubled.
Crop function: This step crops the image from the contracting path and concatenates it to the current image on the expanding path to create a skip connection.
Expanding path (Decoder containing upsampling steps):
The expanding path performs the opposite operation of the contracting path, growing the image back to its original size, while shrinking the channels gradually.
In detail, each step in the expanding path upsamples the feature map, followed by a 2 x 2 convolution (the transposed convolution). This transposed convolution halves the number of feature channels, while growing the height and width of the image.
Next is a concatenation with the correspondingly cropped feature map from the contracting path, and two 3 x 3 convolutions, each followed by a ReLU. You need to perform cropping to handle the loss of border pixels in every convolution.
Final Feature Mapping Block: In the final layer, a 1x1 convolution is used to map each 64-component feature vector to the desired number of classes. The channel dimensions from the previous layer correspond to the number of filters used, so when you use 1x1 convolutions, you can transform that dimension by choosing an appropriate number of 1x1 filters. When this idea is applied to the last layer, you can reduce the channel dimensions to have one layer per class.
The U-Net network has 23 convolutional layers in total.
Important Note:
The figures shown in the assignment for the U-Net architecture depict the layer dimensions and filter sizes as per the original paper on U-Net with smaller
Neural style transfer
Cost function = alpha*Jcontent(content image, generated image) + beta JStyle(style image, generated image)
Steps
1. randomly initiate Generated image ie. pixel numbers are random
2. Use gradient descent to minimise the cost function be
Content cost function
- use a pre trained conv net
- Let al and al be the activation of layer l on the images
- If al and al are similar, both images have similar content
- where the similarity is measured by the sum of the element wise differences squared of the two vectors
Style cost function
- Sun of squares of the element wise difference between the correlation matrices of the style image and the generate image where the correlation Matrix measures the correlation of the channels in the activation part of hidden layer
How do you build a language model
Tokenize the sentence, build a vocabulary, map to one hot vectors, add EOS tokens
How do you build a language model
Tokenize the sentence, build a vocabulary, map to one hot vectors, add EOS tokens
What is GRU
it is a gated recurrent unit, used in RNN, to solve vanishing gradient problem and capture long range connections
What is GRU
it is a gated recurrent unit, used in RNN, to solve vanishing gradient problem and capture long range connections
What is bidirectional RNN
A type of RNN that takes info from both earlier and later in the sequence
What is the peephole connection in LSTM
A connection that allows the gate values to depend not just on t-1 and xt but also on the previous memory cell value
What type of RNN is commonly used in NLP
Bidirectional RNN with LSTM blocks
What is the vanishing gradient problem
It is when the error signal in back propagation become too small to update the weights of the earlier layers of RNN
LSTM vs GRU
LSTM is more powerful and flexible with 3 gates while GRU is simpler with two gates and easier to scale to larger networks
LSTM vs GRU
LSTM is more powerful and flexible with 3 gates while GRU is simpler with two gates and easier to scale to larger networks
Advantage of LSTM over GRU
LSTM can learn longer range connections over GRU
What is the loss function for a single prediction time at a single time step
The standard logistic regression loss also called cross entropy loss
When is sequence to sequence model used
Machine translation
Image captioning
Explain sequence to sequence architecture
Use encoder network to find encoding of input sequence and decoder network to generate the corresponding sequence
Diff between machine translation and Langugage model
Machine translation is a conditional language model. It is conditioned on the encoding of the given sentence rather than starting off w a vector of zeros
It doesn’t just pick any sentence but it picks the most likely sentence conditioned on a given sentence
Why not greedy search to pick the best sentence
Greedy approach is picking the first best word and the second best word and so on and so forth.
The reason it doesn’t rly work is because the next best word doesn’t necessarily mean the final sentence is better
Best algo to pick the best sentence
Approximate search algo / beam search
Try picking the sentence that maximises the conditional probability (in the context of conditional language models)
Compared to greedy search that searches for the next best word, beam search is able to search for the best next N words where N is the parameter known as beam width
How it works
Lets set our beam width to 3 and grab the top three predicted words at each position in a given sequence. The encoded audio sequence is passed to a decoder, where a softmax function is applied to all the words in a set vocabulary (would be previously defined no matter if we’re working with audio sequencing or text translation).
Lets set our beam width to 3 and grab the top three predicted words at each position in a given sequence. The encoded audio sequence is passed to a decoder, where a softmax function is applied to all the words in a set vocabulary (would be previously defined no matter if we’re working with audio sequencing or text translation).
For the second word in a sequence we pass the first three selected words as input into the second position. As we did before, we apply the same softmax output layer function to the set vocabulary find the next 3 words we could use for the second position. While this happens, we use conditional probability to decide on the best combination of first position words and second position words. We run these 3 input words against all words in the vocabulary to find the best 3 combinations and will pass them to the next layer as input again. Words from the first position can get dropped moving forward if another input token has a higher probability with two different sequences. For instance, if “I will” and “I am” were higher than any combination with “Us” we can drop the “Us” token and continue with our new top three sequences. We repeat this process until we reach an END token and have now generated 3 different sequences.
We now have 3 different text translations or audio sequence results that we still have to decide between. These output sequences can be different in length and total tokens, which can create nice variation in our results. We simply pick the decoder output with the highest probability at the end.
Conditional probability = p(y1, y2 | x) = p of first word times p of second word
Large B is better but slower
Small B is worse result but faster
Beam search doesn’t find the max all the time but it is faster
What is Length normalisation
It is an enhancement to the beamsearch algo to get better results.
1 Maximise log P over the Max p
2 normalise by number of words in output sequence to the power of alpha where alpha is tunable (if alpha 1 is completely normalisaing by length, 0 is not normalising at all)
How is accuracy of machine translation measured
Using the Bleu score
It compares the machine translation against human generated references.
Modified precision we give each word credit up to the max amount time it appears in the reference sentence.
Eg. If the MT output has the the the the the the the
And reference is the cat is on the mat, the modified precision is 2/7 where 2 is the number of times the is credited and 7 is the count of the number of “the” in the MT output
Bleu score using bi grams
Ref 1 the cat is in the mat
Ref 2 there is a cat on the mat
MT output the cat the cat on the mat
The cat 2, clipped 1
Cat the 1 clipped 0
Cat on 1 clipped 1
On the 1 clipped 1
The mat 1 clipped 1
Modified precision 4/6 (total bigram)
Precision (n gram) = sum of count clip n gram / total number of N grams in MT output
Combined Bleu score exp( avg (Precision N score) ) and a BP score
BP penalty 1 it MT output > reference length otherwise 0
Attention model
Why? Encoder decoder networks precision (measured by the Bleu score) drops when sentences get longer and if the sentence is v short
Intuition? Uses bidirectional RNN to take into context the activation of the word in front and after the word at a certain time step
Sum the features of inputs of the words in front and after * corresponding attention weights
How it works:
To compute the current output it uses a simple neural network and takes into input the previous hidden state and the activation function
The activation function is the sum of the features from the backward and forward activation that takes into account the words in front words behind and the input word.
Through gradient descent the simple neural network will tell us how much attention to place on the activation function
CTC
Connections temporal classification.
Rule: collapse repeated characters not separated by blank. If a character is related but separated by blank, it is included in the string
When to use transfer learning
Given that you want a model to do task B but you have not much data on it but you have more data for a similar task (task A). You’ll pretrain your model on task A and fine tune your model on task B. First you train on task A then using the same weights you update the neural network weights by retraining on task B data. You can also just update the weights on the last layer if the dataset is v small.
What is multitask learning and when does it make sense
Multitask learning is training a neural network to predict multiple things.
It makes sense if training on a set of tasks that could benefit from have having shared lower level features
And the amount of data you have for each task is quite similar
And you can train a big enough network to do well on all tasks
What are some encoder only models and in what types of use cases do you use encoder only models
How it’s trained:
- Masked language modelling
- random words in sentence is masked and the training obj is to predict the mask tokens to reconstruct original sentence
- bidirectional ie used context from the whole sequence
Usecases
- NER
- sentiment analysis
- word classification
Model
BERT
ROBERTA
Decoder only models
Training objective:
Causal language modelling: Predict next token based on previous set of tokens
Use case
Text gen
Example
GPT
BLOOM
Sequence to sequence models (encoder decoder)
Span corruption
The teacher X student
Translation
Text submission
Question answering
T5
BART
LLM eval metrics
ROUGE (text summarisation), compares a summary to one or more reference summaries
- the higher the rouge 1 score, the better
BLEU score (text translation), compare to human generated translations
Drawbacks of full fine tuning
Cuz it created a full copy of the LLM during a full finetinkng of a task, it requires large memory to store the weights gradients, optimiser states, forward action etc.
What is Parameter Efficient Finetuning and motivation
And what are the different PEFT methods
As models get larger, full finetuning becomes infeasible to train on for consumer hardware. Storing and deployment for each downstream task becomes expensive.
PEFT usually consists of freezing original model weight and only fine tune some model parameters, or add new layers. As such the new PEFT weights are only a few MB worth of memory.
However there are some trades offs:
- parameter efficiency
- training speed
- inference costs
- model perf
- memory efficiency
PEFT methods:
1) Selective: select subset of initial LLM parameters to fine tune
2) Reparameterization: reparametize model weights using a low-rank representation (LoRA)
3) Additive: add trainable layers or parameters to model
— adapters: add new trainable layers typically encode or decoder (after attention layer)
— soft prompts: fix model architecture, focus on manipulating input to achieve better perf by 1) adding trainable parameters to prompt embedding and 2) retraining embedding weights (prompt tuning)
LoRA (low rank adaptation of large language models)
Intuition: LoRA doesn’t change the underlying model, but it changes how the model emphasizes different connections. Think of each low-rank matrix as a filter.
- Freeze most of the original LLM weights
- Inject 2 rank decomposition matrices before the self attention layer
- Train the weights of the smaller matrices
Steps to update model for inference
1. Matrix multiply the low rank matrices (B x A) <- weights for task A
2. Add to original weights (frozen + BxA)
Bleu score
Bleu score is a metric to measure the performance of a sequence to sequence model like machine translation. It takes as input the output sequence of the model and a human generated reference.
Unigram precision = number of word matches / number of words in generation | but the problem with unigram precision is that if the output sequence contain a word that is in the human reference sequence and it is repeated many times, the unigram precision will reflect a high score despite the output sequence being nonsense.
Modified unigram precision = clip(num word matches) based on number of times it appear in the human reference.
However it still doesn’t take into account the order of the sequence. To deal with word ordering problems, we use BLEU score which computes the precision for several different n-grams and then averages the result.
If there are not 4 gram matches then the 4 gram precision is 0.
The Bleu score is the geometric mean of all four n-gram precisions (Uni, bigram, trigram, 4 gram). Ie (p1 * p2 * p3 *p4)^ 0.25
in actual coding, ppl use sacrebleu score because it allows the words to be non tokenised ie takes a string of words rather than a list of words
Rouge score
Rouge score is a score that tells use how good a machine generated summary is compared to one or more reference summaries.
Rouge score compared N grams of the generate with n grams of the references.
Rouge 1 recall = number of word matches / number of words in reference
Rouge 1 precision = number of word matches / number of words in machine generated summary
Rouge 1 f1 score = 2 ( precision * recall / precision + recall)
Rouge L, instead of using unigram or bigrams, it uses the length of the longest common subsequence between the generated summary and referenced
Recall = LCS(ref, gen) / number of words in reference
Precision = LCS (ref, gen) / number of words in summary
Advantage of rouge L over rouge 1 or 2 is that it doesn’t depend on consecutive n gram matches so it tends to capture sentence structure more accurately
Rouge L sum is computed over a whole summary while Rouge L is averaged across individual sentences.
Strategies for Inference Time Optimization
Model Pruning: Trim non-essential parameters, ensuring only those crucial to performance remain. This can drastically reduce the model’s size without significantly compromising accuracy.
Includes: PEFT
Quantization: Convert the 32-bit floating-point numbers into more memory-efficient formats, such as 16-bit or 8-bit, to streamline operations without a discernible loss in quality.
Model Distillation: Use larger models to train smaller, more compact versions that can deliver similar performance with a fraction of the resource requirement. The idea is to transfer the knowledge of larger models to smaller ones with simpler architecture.
Optimized Hardware Deployment: Deploy models on specialized hardware like Tensor Processing Units (TPUs) or Field-Programmable Gate Arrays (FPGAs) designed for accelerated model inference.
Batch Inference: The above LLM optimization techniques are helpful to optimize inference time but can reduce model accuracy. Inference time and accuracy trade-off requires special attention. One way could be using batch inference.This paper presents a batch prompting approach that enables the LLM to run inference in batches instead of one sample at a time. This approach reduces both token and time costs while retaining downstream performance.
How does LLM overcome knowledge cutoff or how to fit LLM with custom data
One way is to retrain model on new data but this will quickly become expensive.
Another way is to give the LLM access to additional external data at inference time using retrieval augmented generation. RAG is a framework for providing LLM access to data not seen in training by
- connecting to external data sources
- connecting to APIs
The external data source can be:
- sql database
- csv files
- web pages
- vector stores
Vector stores store embedding of words which are vector representation of words. But they also help LLMs to:
1. Retrieve relevant info or context during generation or understanding task. LLMs can query the vector store to obtain embedding for specific words or phrases, enhancing their ability to understand and generate human like texts
2. Commonly used in various task like semantic search, info retrieval, similarity analysis
Vector databases (VDB) are an implementation of a vector store, which is a collection of unstructured text broken up or split into chunks (portions) with vector embeddings generated for each chunk. Each vector is also identified by a key. This can allow the text generated by RAG to include a citation for the document from which it was received.
How does retrieval works in RAG
- Conveet qn into embedding
- Do cosine similarity search in vector db containing the chunks of texts in embedding form
- Grab top N
RAG steps
- Index documents
- Retrieving document
- Generating using context window
LLM optimising techniques, when to use what techniques
If you want smth fast:
- Use prompt engineering and few shot training
If model performance is not good maybe there’s hallucinations:
- Use Active RAG
If you have time, money and high quality data:
- use Finetuning