Interviews Flashcards
Attention Mechanism
- The attention mechanism allows the model to “pay attention” to certain parts of the data and to give them more weights when making prediction.
- Essentially in a given sentences, it compares each vector with every other vector in the sentence and calculates an attention score.
- Attention mechanism solves a big problem that many deep learning models have which is the inability to memorise long sequences because of the fixed-length context vector
- The attention mechanism helps preserve the context of every word in a sentence by assigning an attention weight to it relative to all other words.
- This way, even if the sentence is large, the model can preserve the contextual importance of each word.
LLM - GPT & BERT
1) GPT (Generative Pre-trained Transformer): It’s a generative model trained to predict the next word in a sequence. It’s trained in an unsupervised manner using a massive amount of text and can be fine-tuned later for specific tasks.
2) BERT (Bidirectional Encoder Representations from Transformers): BERT is trained by predicting masked (or hidden) words in a sentence. It looks at the context from both the left and the right (hence, bidirectional). This pre-trained model can then be fine-tuned on a smaller dataset for specific tasks.
While GPT is often used for generative tasks like text generation, BERT shines in tasks that require understanding the context like question answering, sentiment analysis, etc.
RNNs
- Recurrent Neural Networks (RNNs) are a class of artificial neural networks designed for sequential data processing and prediction.
- Unlike traditional feedforward neural networks, RNNs have connections that form directed cycles.
- This means that information can be recycled in the network, which makes these types of networks very effective for tasks where context or chronological order is important, such as time series prediction, natural language processing, and speech recognition.
- Common variants of RNNs include LSTMs and GRUs
- RNNs does sequential processing and has some difficulty in capturing long-range dependencies in sequences due to vanishing gradient problems
Transformers
- Transformers are a type of NN architecture introduced in ~2017
- It has become a staple or foundational model in various areas and fields particularly in NLP
1) Architecture
- Its architecture consists of an encoder and a decoder each having multiple layers
- And unlike RNNs or LSTMs, Transformers allow for parallel processing of sequences and can handle long-range dependencies in data.
2) Key Components
1. Multi-Head Attention
- Allows the model to focus on different words for a given input word and can attend to all positions in the input sequence simultaneously.
- Positional Encoding
- Since Transformers lack recurrence (unlike RNNs), positional encodings are added to the input to give the model information about the position of a word in a sentence.
3) Advantages
- Parallelization: Allows faster computation as each position is processed simultaneously.
- Long-Range Dependencies: Capable of handling long sequences and maintaining long-range dependencies between inputs.
Why no Transformeres for Tabular Data
1) Overparameterization: Transformers have many parameters, which might be excessive for simple tabular data, leading to overfitting.
2) Inefficiency: The self-attention mechanism in transformers computes attention scores for every pair of data points, which is often overkill for tabular data where columns (features) have fixed semantics.
3) Lack of Inherent Sequential Nature: Unlike text or time series, tabular data doesn’t always have a sequential nature, so transformers might not leverage their full power.
Dropout Layers
- Dropout is a regularization technique where randomly selected neurons are ignored during training. They are “dropped out” randomly.
- This means that their contribution to the activation of downstream neurons is temporarily removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.
- Purpose: It helps in preventing overfitting by ensuring that the network does not rely too heavily on any specific neuron. and your errors don’t propagate
Pooling Layers
- Pooling layers are used to reduce the dimensions of the feature maps. Thus, it reduces the number of parameters to learn and the amount of computation performed in the network.
- The pooling layer summarises the features present in a region of the feature map generated by a convolution layer.
- So, further operations are performed on summarised features instead of precisely positioned features generated by the convolution layer.
- This makes the model more robust to variations in the position of the features in the input image.
Convolutional Layers
- Convolutional layers in a CNN systematically apply learned filters to input images in order to create feature maps that summarize the presence of those features in the input.
- Convolutional layers prove very effective, and stacking convolutional layers in deep models allows layers close to the input to learn low-level features (e.g. lines) and layers deeper in the model to learn high-order or more abstract features, like shapes or specific objects.
- A limitation of the feature map output of convolutional layers is that they record the precise position of features in the input.
- This means that small movements in the position of the feature in the input image will result in a different feature map.
- This can happen with re-cropping, rotation, shifting, and other minor changes to the input image.
Fully Connected Layers
- A fully connected layer refers to a neural network in which each neuron applies a linear transformation to the input vector through a weights matrix.
- As a result, every input of the input vector influences every output of the output vector.
- In CNNs, FC layers are often used at the end to perform classification based on high-level features.
- In standard NN, they can be used throughout
Activation Functions
- As a rule of thumb, you can begin with using the ReLU activation function and then move over to other activation functions if ReLU doesn’t provide optimum results.
Guidelines:
- ReLU activation function should only be used in the hidden layers.
- Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they make the model more susceptible to problems during training (due to vanishing gradients).
Activation Function should be devided based on the Prediction problem:
- Regression - Linear Activation Function
- Binary Classification—Sigmoid/Logistic Activation Function
- Multiclass Classification—Softmax
- Multilabel Classification—Sigmoid
Activation Function based on type of NN:
- Convolutional Neural Network (CNN): ReLU activation function.
- Recurrent Neural Network: Tanh and/or Sigmoid activation function.
Activation Functions Examples
- As a rule of thumb, you can begin with using the ReLU activation function and then move over to other activation functions if ReLU doesn’t provide optimum results.
Guidelines:
- ReLU activation function should only be used in the hidden layers.
- Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they make the model more susceptible to problems during training (due to vanishing gradients).
Activation Function should be devided based on the Prediction problem:
- Regression - Linear Activation Function
- Binary Classification—Sigmoid/Logistic Activation Function
- Multiclass Classification—Softmax
- Multilabel Classification—Sigmoid
Activation Function based on type of NN:
- Convolutional Neural Network (CNN): ReLU activation function.
- Recurrent Neural Network: Tanh and/or Sigmoid activation function.
How do you pick the right Activation Function
- As a rule of thumb, you can begin with using the ReLU activation function and then move over to other activation functions if ReLU doesn’t provide optimum results.
Guidelines:
- ReLU activation function should only be used in the hidden layers.
- Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they make the model more susceptible to problems during training (due to vanishing gradients).
Activation Function should be devided based on the Prediction problem:
- Regression - Linear Activation Function
- Binary Classification—Sigmoid/Logistic Activation Function
- Multiclass Classification—Softmax
- Multilabel Classification—Sigmoid
Activation Function based on type of NN:
- Convolutional Neural Network (CNN): ReLU activation function.
- Recurrent Neural Network: Tanh and/or Sigmoid activation function.
Resnet VS Mobilenet
1) ResNet (Residual Networks):
- ResNet introduced a way to train very deep networks by using “skip connections” or “shortcuts” that allow the gradient to be directly backpropagated to earlier layers.
- This architecture alleviates the vanishing gradient problem, which is prevalent in deep networks.
- The fundamental building block of ResNet is the residual block.
- Instead of trying to learn an underlying function, the block learns the residual (or difference) between the input and the desired output.
- Deeper ResNet models can be quite large in terms of parameters and computational cost.
2) MobileNet:
- As the name suggests, MobileNet is designed to be used in mobile applications, where the amount of computational resources is constrained.
- MobileNet uses depthwise separable convolutions, which divides a standard convolution into a depthwise convolution and a 1×1 convolution called pointwise convolution.
- This reduces the computational cost and model size.
- MobileNet allows for tunable performance and efficiency by adjusting parameters like input resolution or width multiplier, you can create a smaller or larger model tailored specifically to your needs.
===================
Similarities & Differences:
Purpose:
- ResNet was primarily designed to achieve high accuracy by going deeper
- MobileNet was designed with efficiency in mind for mobile and edge devices without compromising too much on accuracy.
Architecture:
- ResNet uses skip connections around every two layers
- MobileNet employs depthwise separable convolutions to reduce computation.
Model Size & Speed:
- MobileNet is generally smaller and quicker, making it more suitable for real-time applications on mobile devices.
- ResNet, especially its deeper variants, is heavier (has more parameters) and more computationally intensive.
Accuracy:
- In general, deeper ResNet architectures might achieve higher accuracy on most tasks
- MobileNet performs decently given its size and is often preferred when computational resources are at a premium.
Why CNN for Image
CNNs are designed to automatically and adaptively learn spatial hierarchies of features from images. They have three primary advantages for image processing:
1) Local Connectivity: Neurons in a layer are connected only to a small region of the layer before it, mimicking the receptive fields of the human visual system.
2) Weight Sharing: A feature detector (like an edge detector) that’s useful in one part of the image is probably useful across the entire image. This reduces the number of parameters.
3) Pooling Layers: These layers reduce spatial dimensions, leading to a hierarchy of features and invariance to small translations.
Vanishing and Exploding Gradients
1) Vanishing Gradient Problem:
- As the gradient is backpropagated through the layers of a deep network (especially in RNNs), it can become extremely small.
- This means the weight updates during training become negligible, making the network effectively stop learning or learn incredibly slowly.
==> How it Occurs:
- When using activation functions like the sigmoid or tanh, their derivatives can be small (close to 0 for values far from 0).
- In a deep network, as gradients are calculated using the chain rule, these small derivatives can be multiplied together multiple times.
- This causes the gradient to shrink exponentially as it’s propagated backward through layers.
==> Consequences:
- The earlier layers of the network (those closer to the input) learn very slowly or almost not at all, which can lead to sub-optimal or poor performance.
2) Exploding Gradient Problem:
- Opposite to the vanishing gradient problem, the gradient can become extremely large as it’s backpropagated, which can result in very large weight updates during training.
==> How it Occurs:
- This issue is often seen in recurrent neural networks (RNNs) where the accumulation of gradients across time steps can grow without bound.
- If the weights in a network are initialized with large values or the derivatives of the activation functions are significantly greater than 1, then the gradients can explode as they are backpropagated.
==> Consequences:
- Leads to oscillation in weight updates: the weights can drastically swing between large positive and negative values.
- Can cause numerical instability, with weights becoming NaN or Infinity, effectively breaking the training process.
Precision VS Recall VS F1 VS Accuracy
1) Precision (P):
Precision answers the question: Of all the anomalies detected by the model, how many were actual anomalies?
Precision = TP / (TP + FP)
2) Recall (R) or Sensitivity:
Recall answers the question: Of all the actual anomalies in the data, how many were detected by the model?
Recall = TP / (TP + FN)
3) F1-Score:
The F1-Score is the harmonic mean of precision and recall and provides a single metric that balances the two. It’s especially useful when the class distribution is imbalanced.
F1-Score = 2 x (Precision x Recall) / Precision + Recall
4) Accuracy:
Accuracy answers the question: Of all the predictions made by the model, how many were correct?
Accuracy = (TP + TN) / Total Predictions
Backpropagation
- Backpropagation is an optimization algorithm used for minimizing the error in a neural network. It adjusts the weights of the network in the reverse order - from the output layer to the input layer. The process involves:
1) Forward pass:
- When the neural network gives out the incorrect output, this leads to an output error.
- This error is the difference between the actual and predicted outputs.
- A cost function measures this error.
- The cost function indicates how accurately the model performs and tells us how far-off our predicted output values are from our actual values.
- Because the cost function quantifies the error, we aim to minimize the cost function.
- What we want is to reduce the output error. Since the weights affect the error, we will need to readjust the weights. We have to adjust the weights such that we have a combination of weights that minimizes the cost function.
2) Backpropagation:
- Backpropagation allows us to readjust our weights to reduce output error.
- Essentially, backpropagation aims to calculate the negative gradient of the cost function.
- This negative gradient is what helps in adjusting of the weights.
- It gives us an idea of how we need to change the weights so that we can reduce the cost function.
By propagating backwards we know how much “error” each node or layer is responsible for and we can “optimise” it later on
Gradient Descent
- The weights are adjusted using a process called gradient descent.
- Gradient descent is an optimization algorithm that is used to find the weights that minimize the cost function.
- Minimizing the cost function means getting to the minimum point of the cost function.
- So, gradient descent aims to find a weight corresponding to the cost function’s minimum point.
- To find this weight, we must navigate down the cost function until we find its minimum point.
- To know in which direction to navigate, gradient descent uses backpropagation.
- More specifically, it uses the gradients calculated through backpropagation.
- These gradients are used for determining the direction to navigate to find the minimum point.
- Specifically, we aim to find the negative gradient, this is because a negative gradient indicates a decreasing slope.
- A decreasing slope means that moving downward will lead us to the minimum point.
Step size is determined by learning rate
Linear Regression
- Linear regression is a type of statistical analysis used to predict the relationship between 2 variables.
- It assumes a linear relationship between the independent and dependent variable and aims to find the best fitting line that described the relationship.
- This line is then determined by minimizing the sum of the squared differences between the predicted values and actual values
- Essentially in Linear Regression MSE is often the cost function that is used to minimise the residual sum of squares (where residuals in LR is the difference between the predicted and actual value).
- It uses Gradient Descent to optimise cost function
Assumptions of Linear Regression
- Linearity (Linear Relationship)
- Should be Linear Relationship between Independent and Dependent Variable
- Normality
- For any fixed value of X, Y is normally distributed (Dependent Variable is
normally distributed)
- For any fixed value of X, Y is normally distributed (Dependent Variable is
- Homoscedasticity (Constant Variance)
- Variance of the residuals should be constant across all levels of the independent variables.
- This means spread of residuals should be similar across entire range of independent variables.
- This is important because it ensures normally distributed errors + constant variances —> which allows for valid hypothesis testing, confidence interval estimation and accurate prediction of response variable
- Independence of Residuals (No Autocorrelation)
- No Correlation between the residuals (Diff betw Observed and Predicted Values) of different data points
- Allows for valid hypothesis testing to be conducted
- No Multicollinearity (or low collinearity) and No Outliers
- Variables should not be highly correlated —> High Multicollinearity can lead to unstable and unreliable coefficient estimates
- Linear regression assumes that the errors (residuals) are normally distributed and have constant variance. Outliers can violate these assumptions and introduce heteroscedasticity or non-normality into the residuals. Violations of these assumptions can result in biased coefficient estimates, invalid hypothesis tests, and unreliable predictions.
OLS Regression
- Method for estimating the parameters of a Linear Regression model.
- It essentially tries to find the values of the linear regression model’s parameters that minimise the sum of the squared residuals
Assumptions:
1. Errors are normally distributed w/ 0 mean and constant variance
2. No multicollinearity among independent variables
Lasso and Ridge Regression
They are just simple regularized versions of linear regression.
Lasso Regression:
- L1 Regularization
- Adds the “absolute value of magnitude” of the coefficient as a penalty term to the loss function
Ridge Regression:
- L2 Regularization
- Adds the “squared magnitude” of the coefficient as a penalty term to the loss function
L1 Regularization:
- Shrinks coefficients to 0
- Can be used for dimension reduction and feature selection
L2 Regularization:
- Shrinks coefficients equally
- Useful when we have collinear features
L1 & L2 Regularization
L1 Regularization:
- Adds the “absolute value of magnitude” of the coefficient as a penalty term to the loss function
- Shrinks coefficients to 0
- Can be used for dimension reduction and feature selection
L2 Regularization:
- Adds the “squared magnitude” of the coefficient as a penalty term to the loss function
- Shrinks coefficients equally
- Useful when we have collinear features
Logistic Regression
Supervised Machine Learning algorithm mainly used for classification tasks where the goal is to predict the probability that an instance of belonging to a given class
It essentially takes the output of a linear regression function as an input and uses a sigmoid function to estimate the probability of a given class. Hence it outputs a probabilistic value between 0 and 1
ASSUMPTIONS of Logistic Regression
1. Independent Observations:
- Each Observation is independent meaning there is no correlation between any input variables
2. Binary Dependent variables:
- Assumes dependent variable must be binary (For more than 2 values softmax functions are used)
3. Linearity relationship between independent variables and dependent variable
4. No outliers
5. Large Sample Size