Interviews Flashcards
Attention Mechanism
- The attention mechanism allows the model to “pay attention” to certain parts of the data and to give them more weights when making prediction.
- Essentially in a given sentences, it compares each vector with every other vector in the sentence and calculates an attention score.
- Attention mechanism solves a big problem that many deep learning models have which is the inability to memorise long sequences because of the fixed-length context vector
- The attention mechanism helps preserve the context of every word in a sentence by assigning an attention weight to it relative to all other words.
- This way, even if the sentence is large, the model can preserve the contextual importance of each word.
LLM - GPT & BERT
1) GPT (Generative Pre-trained Transformer): It’s a generative model trained to predict the next word in a sequence. It’s trained in an unsupervised manner using a massive amount of text and can be fine-tuned later for specific tasks.
2) BERT (Bidirectional Encoder Representations from Transformers): BERT is trained by predicting masked (or hidden) words in a sentence. It looks at the context from both the left and the right (hence, bidirectional). This pre-trained model can then be fine-tuned on a smaller dataset for specific tasks.
While GPT is often used for generative tasks like text generation, BERT shines in tasks that require understanding the context like question answering, sentiment analysis, etc.
RNNs
- Recurrent Neural Networks (RNNs) are a class of artificial neural networks designed for sequential data processing and prediction.
- Unlike traditional feedforward neural networks, RNNs have connections that form directed cycles.
- This means that information can be recycled in the network, which makes these types of networks very effective for tasks where context or chronological order is important, such as time series prediction, natural language processing, and speech recognition.
- Common variants of RNNs include LSTMs and GRUs
- RNNs does sequential processing and has some difficulty in capturing long-range dependencies in sequences due to vanishing gradient problems
Transformers
- Transformers are a type of NN architecture introduced in ~2017
- It has become a staple or foundational model in various areas and fields particularly in NLP
1) Architecture
- Its architecture consists of an encoder and a decoder each having multiple layers
- And unlike RNNs or LSTMs, Transformers allow for parallel processing of sequences and can handle long-range dependencies in data.
2) Key Components
1. Multi-Head Attention
- Allows the model to focus on different words for a given input word and can attend to all positions in the input sequence simultaneously.
- Positional Encoding
- Since Transformers lack recurrence (unlike RNNs), positional encodings are added to the input to give the model information about the position of a word in a sentence.
3) Advantages
- Parallelization: Allows faster computation as each position is processed simultaneously.
- Long-Range Dependencies: Capable of handling long sequences and maintaining long-range dependencies between inputs.
Why no Transformeres for Tabular Data
1) Overparameterization: Transformers have many parameters, which might be excessive for simple tabular data, leading to overfitting.
2) Inefficiency: The self-attention mechanism in transformers computes attention scores for every pair of data points, which is often overkill for tabular data where columns (features) have fixed semantics.
3) Lack of Inherent Sequential Nature: Unlike text or time series, tabular data doesn’t always have a sequential nature, so transformers might not leverage their full power.
Dropout Layers
- Dropout is a regularization technique where randomly selected neurons are ignored during training. They are “dropped out” randomly.
- This means that their contribution to the activation of downstream neurons is temporarily removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.
- Purpose: It helps in preventing overfitting by ensuring that the network does not rely too heavily on any specific neuron. and your errors don’t propagate
Pooling Layers
- Pooling layers are used to reduce the dimensions of the feature maps. Thus, it reduces the number of parameters to learn and the amount of computation performed in the network.
- The pooling layer summarises the features present in a region of the feature map generated by a convolution layer.
- So, further operations are performed on summarised features instead of precisely positioned features generated by the convolution layer.
- This makes the model more robust to variations in the position of the features in the input image.
Convolutional Layers
- Convolutional layers in a CNN systematically apply learned filters to input images in order to create feature maps that summarize the presence of those features in the input.
- Convolutional layers prove very effective, and stacking convolutional layers in deep models allows layers close to the input to learn low-level features (e.g. lines) and layers deeper in the model to learn high-order or more abstract features, like shapes or specific objects.
- A limitation of the feature map output of convolutional layers is that they record the precise position of features in the input.
- This means that small movements in the position of the feature in the input image will result in a different feature map.
- This can happen with re-cropping, rotation, shifting, and other minor changes to the input image.
Fully Connected Layers
- A fully connected layer refers to a neural network in which each neuron applies a linear transformation to the input vector through a weights matrix.
- As a result, every input of the input vector influences every output of the output vector.
- In CNNs, FC layers are often used at the end to perform classification based on high-level features.
- In standard NN, they can be used throughout
Activation Functions
- As a rule of thumb, you can begin with using the ReLU activation function and then move over to other activation functions if ReLU doesn’t provide optimum results.
Guidelines:
- ReLU activation function should only be used in the hidden layers.
- Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they make the model more susceptible to problems during training (due to vanishing gradients).
Activation Function should be devided based on the Prediction problem:
- Regression - Linear Activation Function
- Binary Classification—Sigmoid/Logistic Activation Function
- Multiclass Classification—Softmax
- Multilabel Classification—Sigmoid
Activation Function based on type of NN:
- Convolutional Neural Network (CNN): ReLU activation function.
- Recurrent Neural Network: Tanh and/or Sigmoid activation function.
Activation Functions Examples
- As a rule of thumb, you can begin with using the ReLU activation function and then move over to other activation functions if ReLU doesn’t provide optimum results.
Guidelines:
- ReLU activation function should only be used in the hidden layers.
- Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they make the model more susceptible to problems during training (due to vanishing gradients).
Activation Function should be devided based on the Prediction problem:
- Regression - Linear Activation Function
- Binary Classification—Sigmoid/Logistic Activation Function
- Multiclass Classification—Softmax
- Multilabel Classification—Sigmoid
Activation Function based on type of NN:
- Convolutional Neural Network (CNN): ReLU activation function.
- Recurrent Neural Network: Tanh and/or Sigmoid activation function.
How do you pick the right Activation Function
- As a rule of thumb, you can begin with using the ReLU activation function and then move over to other activation functions if ReLU doesn’t provide optimum results.
Guidelines:
- ReLU activation function should only be used in the hidden layers.
- Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they make the model more susceptible to problems during training (due to vanishing gradients).
Activation Function should be devided based on the Prediction problem:
- Regression - Linear Activation Function
- Binary Classification—Sigmoid/Logistic Activation Function
- Multiclass Classification—Softmax
- Multilabel Classification—Sigmoid
Activation Function based on type of NN:
- Convolutional Neural Network (CNN): ReLU activation function.
- Recurrent Neural Network: Tanh and/or Sigmoid activation function.
Resnet VS Mobilenet
1) ResNet (Residual Networks):
- ResNet introduced a way to train very deep networks by using “skip connections” or “shortcuts” that allow the gradient to be directly backpropagated to earlier layers.
- This architecture alleviates the vanishing gradient problem, which is prevalent in deep networks.
- The fundamental building block of ResNet is the residual block.
- Instead of trying to learn an underlying function, the block learns the residual (or difference) between the input and the desired output.
- Deeper ResNet models can be quite large in terms of parameters and computational cost.
2) MobileNet:
- As the name suggests, MobileNet is designed to be used in mobile applications, where the amount of computational resources is constrained.
- MobileNet uses depthwise separable convolutions, which divides a standard convolution into a depthwise convolution and a 1×1 convolution called pointwise convolution.
- This reduces the computational cost and model size.
- MobileNet allows for tunable performance and efficiency by adjusting parameters like input resolution or width multiplier, you can create a smaller or larger model tailored specifically to your needs.
===================
Similarities & Differences:
Purpose:
- ResNet was primarily designed to achieve high accuracy by going deeper
- MobileNet was designed with efficiency in mind for mobile and edge devices without compromising too much on accuracy.
Architecture:
- ResNet uses skip connections around every two layers
- MobileNet employs depthwise separable convolutions to reduce computation.
Model Size & Speed:
- MobileNet is generally smaller and quicker, making it more suitable for real-time applications on mobile devices.
- ResNet, especially its deeper variants, is heavier (has more parameters) and more computationally intensive.
Accuracy:
- In general, deeper ResNet architectures might achieve higher accuracy on most tasks
- MobileNet performs decently given its size and is often preferred when computational resources are at a premium.
Why CNN for Image
CNNs are designed to automatically and adaptively learn spatial hierarchies of features from images. They have three primary advantages for image processing:
1) Local Connectivity: Neurons in a layer are connected only to a small region of the layer before it, mimicking the receptive fields of the human visual system.
2) Weight Sharing: A feature detector (like an edge detector) that’s useful in one part of the image is probably useful across the entire image. This reduces the number of parameters.
3) Pooling Layers: These layers reduce spatial dimensions, leading to a hierarchy of features and invariance to small translations.
Vanishing and Exploding Gradients
1) Vanishing Gradient Problem:
- As the gradient is backpropagated through the layers of a deep network (especially in RNNs), it can become extremely small.
- This means the weight updates during training become negligible, making the network effectively stop learning or learn incredibly slowly.
==> How it Occurs:
- When using activation functions like the sigmoid or tanh, their derivatives can be small (close to 0 for values far from 0).
- In a deep network, as gradients are calculated using the chain rule, these small derivatives can be multiplied together multiple times.
- This causes the gradient to shrink exponentially as it’s propagated backward through layers.
==> Consequences:
- The earlier layers of the network (those closer to the input) learn very slowly or almost not at all, which can lead to sub-optimal or poor performance.
2) Exploding Gradient Problem:
- Opposite to the vanishing gradient problem, the gradient can become extremely large as it’s backpropagated, which can result in very large weight updates during training.
==> How it Occurs:
- This issue is often seen in recurrent neural networks (RNNs) where the accumulation of gradients across time steps can grow without bound.
- If the weights in a network are initialized with large values or the derivatives of the activation functions are significantly greater than 1, then the gradients can explode as they are backpropagated.
==> Consequences:
- Leads to oscillation in weight updates: the weights can drastically swing between large positive and negative values.
- Can cause numerical instability, with weights becoming NaN or Infinity, effectively breaking the training process.