Deep Learning Flashcards

1
Q

What does ReLu stand for? And what does it mean?

A

Rectified Linear Unit

The rectified linear activation function or ReLU for short is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What type of neural network architecture is used for eg. House price prediction or advertisement clicking probability?

A

Standard neural network architecture

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What type of neural network architecture is used for image recognition?

A

CNN (convolution neural network)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

For sequence data eg. Audio over time, what type of neural network architecture do we use?

A

Recurrent Neural Network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the vanishing gradient problem?

A

Optimisation of parameters uses gradient descent method to find the best parameters. Vanishing gradient problem occurs when the gradient becomes exponentially small so that the update of the parameter that we are trying to update becomes insignificant. The implications can be that the model never converges to optimum or it takes much longer to train.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain gradient descent

A

Gradient descent is a method of updating a0 and a1 to minimize the cost function (MSE). A regression model uses gradient descent to update the coefficients of the line (a0, a1 => xi, b) by reducing the cost function by a random selection of coefficient values and then iteratively update the values to reach the minimum cost function.

1) start with random coefficients
2) calculate predicted values
3) Calculate partial derivative w.r.t a0 and a1. Sub in the predicted values.
5) Multiply the value by learning rate and subtract it from coefficient
6) stop after 100 iterations or until the error is Low.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What activation should you use for the output layer of a binary classification and why

A

Sigmoid because you want to limit the output value in the range of 0 to 1

Ps, in the sigmoid function, when x=0, y=0.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Name other activation functions more suitable for hidden later

A

Tanh: similar to sigmoid function but limits the output range to -1 and 1, when x=0, y=0

ReLu: (default choice when you dk what to use because training time is faster as compared to using tanh and sigmoid due to the lack of vanishing gradient) a=max(0,z). Tanh and sigmoid have vanishing gradient problems at the tails

Leaky ReLU: a=max(0.01z, z) when x is negative, instead of the slope being zero, there is a small slope. The constant 0.01 can be another learning parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What type of activation function should you use for a regression problem where the output is non-negative

A

Relu

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is there a need for non-linear activation functions

A

For networks with more than 1 hidden layer, if one were to use linear activation functions for all layers, the output will still be the same as that of 1 hidden layer with a linear activation function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two phases in neural network?

A

During forward propagation, the input is fed into the neural network, and the network calculates the output. During backward propagation, the error between the predicted output and the actual output is calculated, and the weights and biases of each neuron are adjusted to reduce the error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the common steps for pre-processing a new dataset

A
  1. Figure out dimensions and shapes of problem (m_train, m_test, num_px)
  2. Reshape the dataset such that each example is now a vector of size
  3. Standardised the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In a neural network with 1 hidden layer with 3 nodes and 4 input features what is the shape of the weights matrix in layer 1

A

(3,4)

The number of rows in W is the number of neurons in that layer and the number of columns is the number of input of the layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How to build a 2 layer neural network?

A

1) Initial parameters:

Weights matrix 1 with shape (size of hidden layer, size of input layer) | all random weights
Bias vector 1 with shape (size of hidden layer, 1) | all zero
Weights matrix 2 with shape (size of output layer, size of hidden layer) | all random
Bias vector 1 with shape (size of output layer, 1) | all zero

2) Forward propagation:

Z = np.dot(W, A) + b

Where A is the activation from prev layer or the input data
B is the bias vector
W is the weight matrix

3) Calculate the activation for the layer by applying the activation function to z
g(z)

4) Compute the cost function
- if it’s a regression problem, cost functions are MAE, MSE, RMSE
- If it is a classification problem, cost functions are cross entropy loss

5) Backward propagation
- compute the derivative of cost function with respect to AL ( probability vector)
- Use dAL to calculate the derivative of the cost function with respect to z
- Use dZ to calculate the calculate the derivative of the cost function with respect to W, b

6) Update the parameters
- The new W and B is updated by subtracting the learning rate * by the gradient computed in backward propagation

7) Repeat steps 2 to 6 for a set number of iterations like 1000 times or until the cost is at a satisfactory level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does l2 regularisation work in neural networks

A

if lambda, the regularization parameter is large, then your parameters will be relatively small, because they are penalized being large in the cost function. And so if the weights, W, are small, then because z is a function of W, if W tends to be very small, then z will also be relatively small. And in particular, if z ends up taking relatively small values, just in this little range, then g of z will be roughly linear. So it’s as if every layer will be roughly linear, as if it is just linear regression. And we saw in course one that if every layer is linear, then your whole network is just a linear network. And so even a very deep network, with a deep network with a linear activation function is, at the end of the day, only able to compute a linear function. So it’s not able to, you know, fit those very, very complicated decision, very non-linear decision boundaries that allow it to, you know, really overfit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does dropout regularisation work in NN

A
  • each layer has a probability of keeping a node
  • Keep prob lower for bigger weight matrix (eg, hidden layer) to increase drop out and higher fro layers with less nodes
  • Intuition, can’t rely on any one feature, so have to spread out of weight
  • For drop out the cost function will get fucked up, turn off drop out to ensure cost is dropping (ensure model is working) then turn it back
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the ways to speed up mini batch gradient descent and explain how

A

1) Momentum gradient descent

Problem with mini batch gradient descent is that it may take many steps to get to minimum due to noise of the batches, causing the cost of oscillate before reaching the minimum.

Momentum gradient descent reduces the number of steps taken using by smoothing out the movement using exponentially weighted average

The smoothing constant, beta, is a hyper parameter using 0.9 which is average of 10 iterations

2) RMSprop

  • known as root mean square prop
  • The vertical axis rep b and the horizontal axis rep w.
  • The aim is to slow down the learning in the vertical direction ie. b direction and speed up learning in the horizontal direction ie. w.
  • Sdw=betasdw + (1-beta)dw^2 (element wise)
  • W=w- alphadw/sqrt(sdw)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why the need for learning rate decay?

A

At the start of learning, can afford to take bigger steps. But when learning rate is large, nearing the minima of gradient descent, the algo might wander around and not reach the minima due to the large noisy steps, hence using learning rate decay makes the lr smaller and hence the steps smaller towards the end of the training to better find the minima.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is one epoch

A

One epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters. So if you have 500 mini batches of 100, the number of iterations is 500 ie the parameters have been updated 500 and all 50,000 samples have been seen by the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How is the learning rate alpha updated in LR decay

A

It is updated after each epoch,
Where alpha next = (1/1+decay rate epoch num)alpha prev

Other methods
Exp decay: Alpha = 0.95 ^ epoch num * alpha prev

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How to sample values on a log scale

A

1) determine upper and lower limit
2) transform log10x
3) np.randn

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How to sample values on a log scale when sampling for learning rate

A

1) determine upper and lower limit
2) a=log10lowerlimit |b=log10upperlimit
3) random(a,b)
4) lr = 10^random number from 3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is batch norm

A

So we know that normalising inputs help speed up training by transforming optimisation problem from more elongated to circular

Batch norm is normalising the inputs to the next layer, eg. a or z where a = g(z) and z=wa + b to speed up trianing

Batch norm also helps with regularisation

Each mini batch is scaled by the mean/variance computed on just that mini-batch
This adds some noise to the values zl within that minibatch so similar to dropout, it adds some noise to each hidden layers activations, adding some reg effect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How to implement batchnorm

A

Miu=1/m sum z
Variance = 1/m sum (z-miu)^2
Z norm i = zi - miu / sqrt(var+ epsilon)

But sometimes you dw z norm to have mean 0 and variance 1

~Z i = gamma z i norm + beta

Where game and beta are trainable parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What activation function dyou use for multi class classification problem for the outputs layer? Explain how it works

A

Softmax.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

CNN

A

A Convolutional Neural Network, also known as CNN or ConvNet, is a class of neural networks that specializes in processing data that has a grid-like topology, such as an image. A digital image is a binary representation of visual data. It contains a series of pixels arranged in a grid-like fashion that contains pixel values to denote how bright and what color each pixel should be.

A CNN typically has three layers: a convolutional layer, a pooling layer, and a fully connected layer.

The convolution layer performs a dot product between two matrices, where one matrix is the set of learnable parameters otherwise known as a kernel, and the other matrix is the restricted portion of the receptive field.

During the forward pass, the kernel slides across the height and width of the image-producing the image representation of that receptive region.

The pooling layer replaces the output of the network at certain locations by deriving a summary statistic of the nearby outputs. This helps in reducing the spatial size of the representation, which decreases the required amount of computation and weights. Default is max pooling

The fully connecred layer helps to map the representation between the input and the output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Other regularisation techniques

A

-data augmentation eg for images can flip or crop images to increase data set
- Early stopping plot cost against iterations for both CV dataset and train dataset, stop training when CV cost start to increase.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Why normalise data

A

Easier to converge to minima when using gradient descent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Minibatch or batch gradient descent for larger training sets? Why?

A

For training large dataset, use mini batch gradient descent runs much faster than batch gradient descent

Whereas with batch gradient descent, a single pass through the training set allows you to take only one gradient descent step. With mini-batch gradient descent, a single pass through the training set, that is one epoch, allows you to take 5,000 gradient descent steps. Now of course you want to take multiple passes through the training set which you usually want to, you might want another for loop for another while loop out there.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Diff between gradient descent and stochastic gradient descent

A
  • Because gradient descent uses a whole batch for training, the algo will take nice large steps into the minima but because sgd uses each sample training data as it’s own mini batch, the descent into the minima will be noisier as each sample have diff quality, and hence it also won’t get into minima but rather wander around the minima
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is the con of SGD

A

lose speed from vectorisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Mini batch size

A
  • if less than 2000 use batch gradient descent
  • Else 64 or 128 or 256 and 512 (2 to the power due to computer memory configurations)
  • Also depends on CPU and GPU memory
  • Can try diff values, see which one is more efficient
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Pros of minibatch

A
  • make progress before processing the whole dataset
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is RMS prop

A
  • known as root mean square prop
  • The vertical axis rep b and the horizontal axis rep w.
  • The aim is to slow down the learning in the vertical direction ie. b direction and speed up learning in the horizontal direction ie. w.
  • Sdw=betasdw + (1-beta)dw^2 (element wise)
  • W=w- alphadw/sqrt(sdw)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is Adam optimisation

A
  • adaptive moment estimation
  • combining momentum with RMSProp
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Tuning DL networks

A

List of things to tune
- number of layers
- number of nodes
- LR rate / LR decay rate
- Size of mini batch
- Dropout

Use random search rather than grid search to cover more search area then narrow down the search area

Tuning importance:
Alpha / LR

Momentum term = ~0.8
Hidden units
Mini batch size

Number of layers
LR decay

never tune
Beta1, beta2 and epsilon

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Why use transfer learning

A

Transfer learning save training time and have better performance without needing a lot of data.

38
Q

How does transfer learning work

A

In comp vision, neural networks usually try to detect edges in the r earlier layers, shapes in the middle layer and some task-specific features in the latter layers. In transfer learning, the early and middle layers are used and we only retrain the latter layers. It helps leverage the labeled data of the task it was initially trained on

39
Q

Common data augmentation techniques

A

Random crop, mirroring, color shifting (adding numbers to the RGB values) using PCA color augmentation

40
Q

Normal vs Depthwise convolution

A
  • for normal convolutions, you slide each filter block across the image block, for each stride multiply all the numbers in the filter block with all the numbers in the image block within the filter space and sun it up. The computational cost = number of filter parameters* number of filter positions * number of filters
  • For depthwise convolution, the number of filters = number of image channels. Match each filter to each channel and slide the corresponding filter to the corresponding image channel. Then perform the number multiplications and sun it up. The computation cost = number of filter Params * number of filter positions * number of filters. Do a pointwise convolution 11nc with nc’ filters
41
Q

Advantages of Mobile nets

A
  • low computational cost at deployment
  • Useful for mobile and embedded vision applications
42
Q

Inception Network

A

Instead of choosing which conv / pooling to come first, just do them all with a 1x1 convolution but concatenate the blocks together and let the network learn.

Drawback is computation cost but with 1x1 convolutions you can shrink the number of channels before applying convolution

43
Q

Why does resnet work?

A

It works because
a^l+2 = g(z^l+2 + a^l)
= g( w^l+2 a^l+1 + b^l+2 + a^l)

If you use l2 reg, w^l+2 will tend to shrink, and if w is 0, then the equation will just b g(a^l). Hence it is easy for resblocks to learn

44
Q

Purpose of 1x1 convolution

A

Purpose 1: shrink channels
- Eg. You have 28x28x192 and you wna shrink the volume to 28x28x32 you can apply 32 filters of 1x1x192
Purpose 2: add non-linearity

45
Q

What are resnets?

A

Resnets also known as residual networks are build out of smth called residual blocks that allow you to train very very deep networks (>100 layers)

ResNet works by adding residual connections to the network, which helps to maintain the information flow throughout the network and prevents the gradients from vanishing. The residual connection is a shortcut that allows the information to bypass one or more layers in the network and reach the output directly.

In theory, the more layers the lower the error but in practice or reality the training error increases which means the network have a harder time learning. But when you use resets, the more layers the lower the error, but eventually flattens and plateau.

46
Q

CNN why add padding

A
  • for every convolution layer the image shrinks where the new image is n-f+1.so if you dw ur image to shrink a lot can add padding esp if building v deep networks. New image after = n+2p-f+1
  • Also if you don’t add padding the pixel in the corner only contributes once to the model as compared the a more centralised pixel, if you don’t pad you’ll be throwing away a lot of info from edge
47
Q

How to build a tensorflow CNN?

A

sequential model

Conv2D(n_filters, shape of filter, activation, input shape = (pixel, pixels, 3) )
MaxPooling2D(2,2) <- shape of pooling window

Flatten()
Dense(n_neruons, activation)
Dense(n_output, activation)

48
Q

Mini batch size

A
  • if less than 2000 use batch gradient descent
  • Else 64 or 128 or 256 and 512 (2 to the power due to computer memory configurations)
  • Also depends on CPU and GPU memory
  • Can try diff values, see which one is more efficient
49
Q

Explain the architecture of a image classification with localisation problem

A

Architecture (Two outputs)
- conv net
- softmax to output different possible classes of object (eg. Car, pedestrian, motto cycle and background)
- Output bounding box (bx, by, bh, bw) where bx and by is the middle of the bounding box

Y = vector (
Pc ie. is there an object,
bx,
by,
bh,
bw,
C1 Ie. is it object C1,
C2,
C3)

If no object, just label as don’t care

Loss function:
- if the actual label is got object then loss is the sum of squares of the prediction of each component (Pc ie. is there an object, bx, by, bh, bw, C1 Ie. is it object C1,C2,C3) - actual label
- If actual label is no object, square of (Pc hat - Pc)

50
Q

Explain landmark detection

A

Landmark detection
- landmark detection is like detecting eyes/nose on a face
- Label training dataset with the a certain number coordinates/landmarks that surrounds the feature you are trying to detect
- change CNN output layer to output the feature of the face and coordinates of the points you want in the image like this
- Y = vector(face?, l1x, l1y, …l64x,l64y)

51
Q

Explain object detection

A

Object detection
- train a conv net to detect a car using heavily cropped images
- Then for an image, start by a picking a window size, input the window image into conv net and get a prediction. Slide the window, and pass in the next window image into the conv net. Do this until the whole image is covered. Stride can be customised
- Change window size and do it again

Intersection over union
= size of intersection between ground truth box and prediction box / size of union of both ground truth box and prediction box

If IOU>= 0.5 then prediction is correct
More generally, IOU is a measure of the overlap between two bounding boxes

Non max suppression is a way to make sure your algo only detects the object once instead of multiple times

Steps:
- discard all boxes with probability of object <= 0.6
- Pick the box with the largest probability of object as a prediction
- Discard any remaining box with IOU >= 0.5 with the box output in prev step

Anchor box
- aims to solve the problem that only one grid ce can only detect one object, if the grid cell contains two overallping objects,

Steps
- encode y

Y = vector (
Pc ie. is there an object,
bx,
by,
bh,
bw,
C1 Ie. is it object C1,
C2,
C3,

bx,
by,
bh,
bw,
C1 Ie. is it object C1,
C2,
C3,)

Where the second set refers anchor box 2

Limitations both object similar have the Sam e anchor box shape

Output shape = grid row x grid Col x n anchor box x 8 Params (Pc ie. is there an object,
bx,
by,
bh,
bw,
C1 Ie. is it object C1,
C2,
C3)

52
Q

What architecture is used for image segmentation eg. Medical imaging

A
  • blow the image back up in size using transpose convolution

Contracting path (Encoder containing downsampling steps):
Images are first fed through several convolutional layers which reduce height and width, while growing the number of channels.
The contracting path follows a regular CNN architecture, with convolutional layers, their activations, and pooling layers to downsample the image and extract its features. In detail, it consists of the repeated application of two 3 x 3 same padding convolutions, each followed by a rectified linear unit (ReLU) and a 2 x 2 max pooling operation with stride 2 for downsampling. At each downsampling step, the number of feature channels is doubled.
Crop function: This step crops the image from the contracting path and concatenates it to the current image on the expanding path to create a skip connection.

Expanding path (Decoder containing upsampling steps):
The expanding path performs the opposite operation of the contracting path, growing the image back to its original size, while shrinking the channels gradually.
In detail, each step in the expanding path upsamples the feature map, followed by a 2 x 2 convolution (the transposed convolution). This transposed convolution halves the number of feature channels, while growing the height and width of the image.
Next is a concatenation with the correspondingly cropped feature map from the contracting path, and two 3 x 3 convolutions, each followed by a ReLU. You need to perform cropping to handle the loss of border pixels in every convolution.
Final Feature Mapping Block: In the final layer, a 1x1 convolution is used to map each 64-component feature vector to the desired number of classes. The channel dimensions from the previous layer correspond to the number of filters used, so when you use 1x1 convolutions, you can transform that dimension by choosing an appropriate number of 1x1 filters. When this idea is applied to the last layer, you can reduce the channel dimensions to have one layer per class.
The U-Net network has 23 convolutional layers in total.
Important Note:
The figures shown in the assignment for the U-Net architecture depict the layer dimensions and filter sizes as per the original paper on U-Net with smaller

53
Q

Neural style transfer

A

Cost function = alpha*Jcontent(content image, generated image) + beta JStyle(style image, generated image)

Steps
1. randomly initiate Generated image ie. pixel numbers are random
2. Use gradient descent to minimise the cost function be

Content cost function
- use a pre trained conv net
- Let al and al be the activation of layer l on the images
- If al and al are similar, both images have similar content
- where the similarity is measured by the sum of the element wise differences squared of the two vectors

Style cost function
- Sun of squares of the element wise difference between the correlation matrices of the style image and the generate image where the correlation Matrix measures the correlation of the channels in the activation part of hidden layer

54
Q

How do you build a language model

A

Tokenize the sentence, build a vocabulary, map to one hot vectors, add EOS tokens

55
Q

How do you build a language model

A

Tokenize the sentence, build a vocabulary, map to one hot vectors, add EOS tokens

56
Q

What is GRU

A

it is a gated recurrent unit, used in RNN, to solve vanishing gradient problem and capture long range connections

57
Q

What is GRU

A

it is a gated recurrent unit, used in RNN, to solve vanishing gradient problem and capture long range connections

58
Q

What is bidirectional RNN

A

A type of RNN that takes info from both earlier and later in the sequence

59
Q

What is the peephole connection in LSTM

A

A connection that allows the gate values to depend not just on t-1 and xt but also on the previous memory cell value

60
Q

What type of RNN is commonly used in NLP

A

Bidirectional RNN with LSTM blocks

61
Q

What is the vanishing gradient problem

A

It is when the error signal in back propagation become too small to update the weights of the earlier layers of RNN

62
Q

LSTM vs GRU

A

LSTM is more powerful and flexible with 3 gates while GRU is simpler with two gates and easier to scale to larger networks

63
Q

LSTM vs GRU

A

LSTM is more powerful and flexible with 3 gates while GRU is simpler with two gates and easier to scale to larger networks

64
Q

Advantage of LSTM over GRU

A

LSTM can learn longer range connections over GRU

65
Q

What is the loss function for a single prediction time at a single time step

A

The standard logistic regression loss also called cross entropy loss

66
Q

When is sequence to sequence model used

A

Machine translation
Image captioning

67
Q

Explain sequence to sequence architecture

A

Use encoder network to find encoding of input sequence and decoder network to generate the corresponding sequence

68
Q

Diff between machine translation and Langugage model

A

Machine translation is a conditional language model. It is conditioned on the encoding of the given sentence rather than starting off w a vector of zeros

It doesn’t just pick any sentence but it picks the most likely sentence conditioned on a given sentence

69
Q

Why not greedy search to pick the best sentence

A

Greedy approach is picking the first best word and the second best word and so on and so forth.

The reason it doesn’t rly work is because the next best word doesn’t necessarily mean the final sentence is better

70
Q

Best algo to pick the best sentence

A

Approximate search algo / beam search

Try picking the sentence that maximises the conditional probability (in the context of conditional language models)

Compared to greedy search that searches for the next best word, beam search is able to search for the best next N words where N is the parameter known as beam width

How it works

Lets set our beam width to 3 and grab the top three predicted words at each position in a given sequence. The encoded audio sequence is passed to a decoder, where a softmax function is applied to all the words in a set vocabulary (would be previously defined no matter if we’re working with audio sequencing or text translation).

Lets set our beam width to 3 and grab the top three predicted words at each position in a given sequence. The encoded audio sequence is passed to a decoder, where a softmax function is applied to all the words in a set vocabulary (would be previously defined no matter if we’re working with audio sequencing or text translation).

For the second word in a sequence we pass the first three selected words as input into the second position. As we did before, we apply the same softmax output layer function to the set vocabulary find the next 3 words we could use for the second position. While this happens, we use conditional probability to decide on the best combination of first position words and second position words. We run these 3 input words against all words in the vocabulary to find the best 3 combinations and will pass them to the next layer as input again. Words from the first position can get dropped moving forward if another input token has a higher probability with two different sequences. For instance, if “I will” and “I am” were higher than any combination with “Us” we can drop the “Us” token and continue with our new top three sequences. We repeat this process until we reach an END token and have now generated 3 different sequences.

We now have 3 different text translations or audio sequence results that we still have to decide between. These output sequences can be different in length and total tokens, which can create nice variation in our results. We simply pick the decoder output with the highest probability at the end.

Conditional probability = p(y1, y2 | x) = p of first word times p of second word

Large B is better but slower
Small B is worse result but faster

Beam search doesn’t find the max all the time but it is faster

71
Q

What is Length normalisation

A

It is an enhancement to the beamsearch algo to get better results.

1 Maximise log P over the Max p
2 normalise by number of words in output sequence to the power of alpha where alpha is tunable (if alpha 1 is completely normalisaing by length, 0 is not normalising at all)

72
Q

How is accuracy of machine translation measured

A

Using the Bleu score

It compares the machine translation against human generated references.

Modified precision we give each word credit up to the max amount time it appears in the reference sentence.

Eg. If the MT output has the the the the the the the

And reference is the cat is on the mat, the modified precision is 2/7 where 2 is the number of times the is credited and 7 is the count of the number of “the” in the MT output

73
Q

Bleu score using bi grams

A

Ref 1 the cat is in the mat
Ref 2 there is a cat on the mat
MT output the cat the cat on the mat

The cat 2, clipped 1
Cat the 1 clipped 0
Cat on 1 clipped 1
On the 1 clipped 1
The mat 1 clipped 1

Modified precision 4/6 (total bigram)

Precision (n gram) = sum of count clip n gram / total number of N grams in MT output

Combined Bleu score exp( avg (Precision N score) ) and a BP score

BP penalty 1 it MT output > reference length otherwise 0

74
Q

Attention model

A

Why? Encoder decoder networks precision (measured by the Bleu score) drops when sentences get longer and if the sentence is v short

Intuition? Uses bidirectional RNN to take into context the activation of the word in front and after the word at a certain time step

Sum the features of inputs of the words in front and after * corresponding attention weights

How it works:
To compute the current output it uses a simple neural network and takes into input the previous hidden state and the activation function

The activation function is the sum of the features from the backward and forward activation that takes into account the words in front words behind and the input word.

Through gradient descent the simple neural network will tell us how much attention to place on the activation function

75
Q

CTC

A

Connections temporal classification.

Rule: collapse repeated characters not separated by blank. If a character is related but separated by blank, it is included in the string

76
Q

When to use transfer learning

A

Given that you want a model to do task B but you have not much data on it but you have more data for a similar task (task A). You’ll pretrain your model on task A and fine tune your model on task B. First you train on task A then using the same weights you update the neural network weights by retraining on task B data. You can also just update the weights on the last layer if the dataset is v small.

77
Q

What is multitask learning and when does it make sense

A

Multitask learning is training a neural network to predict multiple things.

It makes sense if training on a set of tasks that could benefit from have having shared lower level features

And the amount of data you have for each task is quite similar

And you can train a big enough network to do well on all tasks

78
Q

What are some encoder only models and in what types of use cases do you use encoder only models

A

How it’s trained:
- Masked language modelling
- random words in sentence is masked and the training obj is to predict the mask tokens to reconstruct original sentence
- bidirectional ie used context from the whole sequence

Usecases
- NER
- sentiment analysis
- word classification

Model
BERT
ROBERTA

79
Q

Decoder only models

A

Training objective:
Causal language modelling: Predict next token based on previous set of tokens

Use case
Text gen

Example
GPT
BLOOM

80
Q

Sequence to sequence models (encoder decoder)

A

Span corruption
The teacher X student

Translation
Text submission
Question answering

T5
BART

81
Q

LLM eval metrics

A

ROUGE (text summarisation), compares a summary to one or more reference summaries
- the higher the rouge 1 score, the better
BLEU score (text translation), compare to human generated translations

82
Q

Drawbacks of full fine tuning

A

Cuz it created a full copy of the LLM during a full finetinkng of a task, it requires large memory to store the weights gradients, optimiser states, forward action etc.

83
Q

What is Parameter Efficient Finetuning and motivation

And what are the different PEFT methods

A

As models get larger, full finetuning becomes infeasible to train on for consumer hardware. Storing and deployment for each downstream task becomes expensive.

PEFT usually consists of freezing original model weight and only fine tune some model parameters, or add new layers. As such the new PEFT weights are only a few MB worth of memory.

However there are some trades offs:
- parameter efficiency
- training speed
- inference costs
- model perf
- memory efficiency

PEFT methods:
1) Selective: select subset of initial LLM parameters to fine tune
2) Reparameterization: reparametize model weights using a low-rank representation (LoRA)
3) Additive: add trainable layers or parameters to model
— adapters: add new trainable layers typically encode or decoder (after attention layer)
— soft prompts: fix model architecture, focus on manipulating input to achieve better perf by 1) adding trainable parameters to prompt embedding and 2) retraining embedding weights (prompt tuning)

84
Q

LoRA (low rank adaptation of large language models)

A

Intuition: LoRA doesn’t change the underlying model, but it changes how the model emphasizes different connections. Think of each low-rank matrix as a filter.

  1. Freeze most of the original LLM weights
  2. Inject 2 rank decomposition matrices before the self attention layer
  3. Train the weights of the smaller matrices

Steps to update model for inference
1. Matrix multiply the low rank matrices (B x A) <- weights for task A
2. Add to original weights (frozen + BxA)

85
Q

Bleu score

A

Bleu score is a metric to measure the performance of a sequence to sequence model like machine translation. It takes as input the output sequence of the model and a human generated reference.

Unigram precision = number of word matches / number of words in generation | but the problem with unigram precision is that if the output sequence contain a word that is in the human reference sequence and it is repeated many times, the unigram precision will reflect a high score despite the output sequence being nonsense.

Modified unigram precision = clip(num word matches) based on number of times it appear in the human reference.

However it still doesn’t take into account the order of the sequence. To deal with word ordering problems, we use BLEU score which computes the precision for several different n-grams and then averages the result.

If there are not 4 gram matches then the 4 gram precision is 0.

The Bleu score is the geometric mean of all four n-gram precisions (Uni, bigram, trigram, 4 gram). Ie (p1 * p2 * p3 *p4)^ 0.25

in actual coding, ppl use sacrebleu score because it allows the words to be non tokenised ie takes a string of words rather than a list of words

86
Q

Rouge score

A

Rouge score is a score that tells use how good a machine generated summary is compared to one or more reference summaries.

Rouge score compared N grams of the generate with n grams of the references.

Rouge 1 recall = number of word matches / number of words in reference

Rouge 1 precision = number of word matches / number of words in machine generated summary

Rouge 1 f1 score = 2 ( precision * recall / precision + recall)

Rouge L, instead of using unigram or bigrams, it uses the length of the longest common subsequence between the generated summary and referenced

Recall = LCS(ref, gen) / number of words in reference
Precision = LCS (ref, gen) / number of words in summary

Advantage of rouge L over rouge 1 or 2 is that it doesn’t depend on consecutive n gram matches so it tends to capture sentence structure more accurately

Rouge L sum is computed over a whole summary while Rouge L is averaged across individual sentences.

87
Q

Strategies for Inference Time Optimization

A

Model Pruning: Trim non-essential parameters, ensuring only those crucial to performance remain. This can drastically reduce the model’s size without significantly compromising accuracy.

Includes: PEFT

Quantization: Convert the 32-bit floating-point numbers into more memory-efficient formats, such as 16-bit or 8-bit, to streamline operations without a discernible loss in quality.

Model Distillation: Use larger models to train smaller, more compact versions that can deliver similar performance with a fraction of the resource requirement. The idea is to transfer the knowledge of larger models to smaller ones with simpler architecture.

Optimized Hardware Deployment: Deploy models on specialized hardware like Tensor Processing Units (TPUs) or Field-Programmable Gate Arrays (FPGAs) designed for accelerated model inference.

Batch Inference: The above LLM optimization techniques are helpful to optimize inference time but can reduce model accuracy. Inference time and accuracy trade-off requires special attention. One way could be using batch inference.This paper presents a batch prompting approach that enables the LLM to run inference in batches instead of one sample at a time. This approach reduces both token and time costs while retaining downstream performance.

88
Q

How does LLM overcome knowledge cutoff or how to fit LLM with custom data

A

One way is to retrain model on new data but this will quickly become expensive.

Another way is to give the LLM access to additional external data at inference time using retrieval augmented generation. RAG is a framework for providing LLM access to data not seen in training by
- connecting to external data sources
- connecting to APIs

The external data source can be:
- sql database
- csv files
- web pages
- vector stores

Vector stores store embedding of words which are vector representation of words. But they also help LLMs to:
1. Retrieve relevant info or context during generation or understanding task. LLMs can query the vector store to obtain embedding for specific words or phrases, enhancing their ability to understand and generate human like texts
2. Commonly used in various task like semantic search, info retrieval, similarity analysis

Vector databases (VDB) are an implementation of a vector store, which is a collection of unstructured text broken up or split into chunks (portions) with vector embeddings generated for each chunk. Each vector is also identified by a key. This can allow the text generated by RAG to include a citation for the document from which it was received.

89
Q

How does retrieval works in RAG

A
  1. Conveet qn into embedding
  2. Do cosine similarity search in vector db containing the chunks of texts in embedding form
  3. Grab top N
90
Q

RAG steps

A
  1. Index documents
  2. Retrieving document
  3. Generating using context window
91
Q

LLM optimising techniques, when to use what techniques

A

If you want smth fast:
- Use prompt engineering and few shot training

If model performance is not good maybe there’s hallucinations:
- Use Active RAG

If you have time, money and high quality data:
- use Finetuning