Interviews Flashcards

1
Q

Attention Mechanism

A
  • The attention mechanism allows the model to “pay attention” to certain parts of the data and to give them more weights when making prediction.
  • Essentially in a given sentences, it compares each vector with every other vector in the sentence and calculates an attention score.
  • Attention mechanism solves a big problem that many deep learning models have which is the inability to memorise long sequences because of the fixed-length context vector
  • The attention mechanism helps preserve the context of every word in a sentence by assigning an attention weight to it relative to all other words.
  • This way, even if the sentence is large, the model can preserve the contextual importance of each word.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

LLM - GPT & BERT

A

1) GPT (Generative Pre-trained Transformer): It’s a generative model trained to predict the next word in a sequence. It’s trained in an unsupervised manner using a massive amount of text and can be fine-tuned later for specific tasks.

2) BERT (Bidirectional Encoder Representations from Transformers): BERT is trained by predicting masked (or hidden) words in a sentence. It looks at the context from both the left and the right (hence, bidirectional). This pre-trained model can then be fine-tuned on a smaller dataset for specific tasks.

While GPT is often used for generative tasks like text generation, BERT shines in tasks that require understanding the context like question answering, sentiment analysis, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

RNNs

A
  • Recurrent Neural Networks (RNNs) are a class of artificial neural networks designed for sequential data processing and prediction.
  • Unlike traditional feedforward neural networks, RNNs have connections that form directed cycles.
  • This means that information can be recycled in the network, which makes these types of networks very effective for tasks where context or chronological order is important, such as time series prediction, natural language processing, and speech recognition.
  • Common variants of RNNs include LSTMs and GRUs
  • RNNs does sequential processing and has some difficulty in capturing long-range dependencies in sequences due to vanishing gradient problems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Transformers

A
  • Transformers are a type of NN architecture introduced in ~2017
  • It has become a staple or foundational model in various areas and fields particularly in NLP

1) Architecture
- Its architecture consists of an encoder and a decoder each having multiple layers
- And unlike RNNs or LSTMs, Transformers allow for parallel processing of sequences and can handle long-range dependencies in data.

2) Key Components
1. Multi-Head Attention
- Allows the model to focus on different words for a given input word and can attend to all positions in the input sequence simultaneously.

  1. Positional Encoding
    - Since Transformers lack recurrence (unlike RNNs), positional encodings are added to the input to give the model information about the position of a word in a sentence.

3) Advantages
- Parallelization: Allows faster computation as each position is processed simultaneously.
- Long-Range Dependencies: Capable of handling long sequences and maintaining long-range dependencies between inputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why no Transformeres for Tabular Data

A

1) Overparameterization: Transformers have many parameters, which might be excessive for simple tabular data, leading to overfitting.

2) Inefficiency: The self-attention mechanism in transformers computes attention scores for every pair of data points, which is often overkill for tabular data where columns (features) have fixed semantics.

3) Lack of Inherent Sequential Nature: Unlike text or time series, tabular data doesn’t always have a sequential nature, so transformers might not leverage their full power.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Dropout Layers

A
  • Dropout is a regularization technique where randomly selected neurons are ignored during training. They are “dropped out” randomly.
  • This means that their contribution to the activation of downstream neurons is temporarily removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.
  • Purpose: It helps in preventing overfitting by ensuring that the network does not rely too heavily on any specific neuron. and your errors don’t propagate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Pooling Layers

A
  • Pooling layers are used to reduce the dimensions of the feature maps. Thus, it reduces the number of parameters to learn and the amount of computation performed in the network.
  • The pooling layer summarises the features present in a region of the feature map generated by a convolution layer.
  • So, further operations are performed on summarised features instead of precisely positioned features generated by the convolution layer.
  • This makes the model more robust to variations in the position of the features in the input image.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Convolutional Layers

A
  • Convolutional layers in a CNN systematically apply learned filters to input images in order to create feature maps that summarize the presence of those features in the input.
  • Convolutional layers prove very effective, and stacking convolutional layers in deep models allows layers close to the input to learn low-level features (e.g. lines) and layers deeper in the model to learn high-order or more abstract features, like shapes or specific objects.
  • A limitation of the feature map output of convolutional layers is that they record the precise position of features in the input.
  • This means that small movements in the position of the feature in the input image will result in a different feature map.
  • This can happen with re-cropping, rotation, shifting, and other minor changes to the input image.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Fully Connected Layers

A
  • A fully connected layer refers to a neural network in which each neuron applies a linear transformation to the input vector through a weights matrix.
  • As a result, every input of the input vector influences every output of the output vector.
  • In CNNs, FC layers are often used at the end to perform classification based on high-level features.
  • In standard NN, they can be used throughout
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Activation Functions

A
  • As a rule of thumb, you can begin with using the ReLU activation function and then move over to other activation functions if ReLU doesn’t provide optimum results.

Guidelines:

  • ReLU activation function should only be used in the hidden layers.
  • Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they make the model more susceptible to problems during training (due to vanishing gradients).

Activation Function should be devided based on the Prediction problem:
- Regression - Linear Activation Function
- Binary Classification—Sigmoid/Logistic Activation Function
- Multiclass Classification—Softmax
- Multilabel Classification—Sigmoid

Activation Function based on type of NN:
- Convolutional Neural Network (CNN): ReLU activation function.
- Recurrent Neural Network: Tanh and/or Sigmoid activation function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Activation Functions Examples

A
  • As a rule of thumb, you can begin with using the ReLU activation function and then move over to other activation functions if ReLU doesn’t provide optimum results.

Guidelines:

  • ReLU activation function should only be used in the hidden layers.
  • Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they make the model more susceptible to problems during training (due to vanishing gradients).

Activation Function should be devided based on the Prediction problem:
- Regression - Linear Activation Function
- Binary Classification—Sigmoid/Logistic Activation Function
- Multiclass Classification—Softmax
- Multilabel Classification—Sigmoid

Activation Function based on type of NN:
- Convolutional Neural Network (CNN): ReLU activation function.
- Recurrent Neural Network: Tanh and/or Sigmoid activation function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you pick the right Activation Function

A
  • As a rule of thumb, you can begin with using the ReLU activation function and then move over to other activation functions if ReLU doesn’t provide optimum results.

Guidelines:

  • ReLU activation function should only be used in the hidden layers.
  • Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they make the model more susceptible to problems during training (due to vanishing gradients).

Activation Function should be devided based on the Prediction problem:
- Regression - Linear Activation Function
- Binary Classification—Sigmoid/Logistic Activation Function
- Multiclass Classification—Softmax
- Multilabel Classification—Sigmoid

Activation Function based on type of NN:
- Convolutional Neural Network (CNN): ReLU activation function.
- Recurrent Neural Network: Tanh and/or Sigmoid activation function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Resnet VS Mobilenet

A

1) ResNet (Residual Networks):

  • ResNet introduced a way to train very deep networks by using “skip connections” or “shortcuts” that allow the gradient to be directly backpropagated to earlier layers.
  • This architecture alleviates the vanishing gradient problem, which is prevalent in deep networks.
  • The fundamental building block of ResNet is the residual block.
  • Instead of trying to learn an underlying function, the block learns the residual (or difference) between the input and the desired output.
  • Deeper ResNet models can be quite large in terms of parameters and computational cost.

2) MobileNet:

  • As the name suggests, MobileNet is designed to be used in mobile applications, where the amount of computational resources is constrained.
  • MobileNet uses depthwise separable convolutions, which divides a standard convolution into a depthwise convolution and a 1×1 convolution called pointwise convolution.
  • This reduces the computational cost and model size.
  • MobileNet allows for tunable performance and efficiency by adjusting parameters like input resolution or width multiplier, you can create a smaller or larger model tailored specifically to your needs.

===================

Similarities & Differences:

Purpose:
- ResNet was primarily designed to achieve high accuracy by going deeper
- MobileNet was designed with efficiency in mind for mobile and edge devices without compromising too much on accuracy.

Architecture:
- ResNet uses skip connections around every two layers
- MobileNet employs depthwise separable convolutions to reduce computation.

Model Size & Speed:
- MobileNet is generally smaller and quicker, making it more suitable for real-time applications on mobile devices.
- ResNet, especially its deeper variants, is heavier (has more parameters) and more computationally intensive.

Accuracy:
- In general, deeper ResNet architectures might achieve higher accuracy on most tasks
- MobileNet performs decently given its size and is often preferred when computational resources are at a premium.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why CNN for Image

A

CNNs are designed to automatically and adaptively learn spatial hierarchies of features from images. They have three primary advantages for image processing:

1) Local Connectivity: Neurons in a layer are connected only to a small region of the layer before it, mimicking the receptive fields of the human visual system.

2) Weight Sharing: A feature detector (like an edge detector) that’s useful in one part of the image is probably useful across the entire image. This reduces the number of parameters.

3) Pooling Layers: These layers reduce spatial dimensions, leading to a hierarchy of features and invariance to small translations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Vanishing and Exploding Gradients

A

1) Vanishing Gradient Problem:
- As the gradient is backpropagated through the layers of a deep network (especially in RNNs), it can become extremely small.
- This means the weight updates during training become negligible, making the network effectively stop learning or learn incredibly slowly.

==> How it Occurs:
- When using activation functions like the sigmoid or tanh, their derivatives can be small (close to 0 for values far from 0).
- In a deep network, as gradients are calculated using the chain rule, these small derivatives can be multiplied together multiple times.
- This causes the gradient to shrink exponentially as it’s propagated backward through layers.

==> Consequences:

  • The earlier layers of the network (those closer to the input) learn very slowly or almost not at all, which can lead to sub-optimal or poor performance.

2) Exploding Gradient Problem:

  • Opposite to the vanishing gradient problem, the gradient can become extremely large as it’s backpropagated, which can result in very large weight updates during training.

==> How it Occurs:

  • This issue is often seen in recurrent neural networks (RNNs) where the accumulation of gradients across time steps can grow without bound.
  • If the weights in a network are initialized with large values or the derivatives of the activation functions are significantly greater than 1, then the gradients can explode as they are backpropagated.

==> Consequences:

  • Leads to oscillation in weight updates: the weights can drastically swing between large positive and negative values.
  • Can cause numerical instability, with weights becoming NaN or Infinity, effectively breaking the training process.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Precision VS Recall VS F1 VS Accuracy

A

1) Precision (P):

Precision answers the question: Of all the anomalies detected by the model, how many were actual anomalies?
Precision = TP / (TP + FP)

2) Recall (R) or Sensitivity:

Recall answers the question: Of all the actual anomalies in the data, how many were detected by the model?
Recall = TP / (TP + FN)

3) F1-Score:

The F1-Score is the harmonic mean of precision and recall and provides a single metric that balances the two. It’s especially useful when the class distribution is imbalanced.
F1-Score = 2 x (Precision x Recall) / Precision + Recall

4) Accuracy:

Accuracy answers the question: Of all the predictions made by the model, how many were correct?
Accuracy = (TP + TN) / Total Predictions

16
Q

Backpropagation

A
  • Backpropagation is an optimization algorithm used for minimizing the error in a neural network. It adjusts the weights of the network in the reverse order - from the output layer to the input layer. The process involves:

1) Forward pass:
- When the neural network gives out the incorrect output, this leads to an output error.
- This error is the difference between the actual and predicted outputs.
- A cost function measures this error.
- The cost function indicates how accurately the model performs and tells us how far-off our predicted output values are from our actual values.
- Because the cost function quantifies the error, we aim to minimize the cost function.

  • What we want is to reduce the output error. Since the weights affect the error, we will need to readjust the weights. We have to adjust the weights such that we have a combination of weights that minimizes the cost function.

2) Backpropagation:
- Backpropagation allows us to readjust our weights to reduce output error.

  • Essentially, backpropagation aims to calculate the negative gradient of the cost function.
  • This negative gradient is what helps in adjusting of the weights.
  • It gives us an idea of how we need to change the weights so that we can reduce the cost function.

By propagating backwards we know how much “error” each node or layer is responsible for and we can “optimise” it later on

17
Q

Gradient Descent

A
  • The weights are adjusted using a process called gradient descent.
  • Gradient descent is an optimization algorithm that is used to find the weights that minimize the cost function.
  • Minimizing the cost function means getting to the minimum point of the cost function.
  • So, gradient descent aims to find a weight corresponding to the cost function’s minimum point.
  • To find this weight, we must navigate down the cost function until we find its minimum point.
  • To know in which direction to navigate, gradient descent uses backpropagation.
  • More specifically, it uses the gradients calculated through backpropagation.
  • These gradients are used for determining the direction to navigate to find the minimum point.
  • Specifically, we aim to find the negative gradient, this is because a negative gradient indicates a decreasing slope.
  • A decreasing slope means that moving downward will lead us to the minimum point.

Step size is determined by learning rate

18
Q

Linear Regression

A
  • Linear regression is a type of statistical analysis used to predict the relationship between 2 variables.
  • It assumes a linear relationship between the independent and dependent variable and aims to find the best fitting line that described the relationship.
  • This line is then determined by minimizing the sum of the squared differences between the predicted values and actual values
  • Essentially in Linear Regression MSE is often the cost function that is used to minimise the residual sum of squares (where residuals in LR is the difference between the predicted and actual value).
  • It uses Gradient Descent to optimise cost function
19
Q

Assumptions of Linear Regression

A
  1. Linearity (Linear Relationship)
    • Should be Linear Relationship between Independent and Dependent Variable
  2. Normality
    • For any fixed value of X, Y is normally distributed (Dependent Variable is
      normally distributed)
  3. Homoscedasticity (Constant Variance)
    • Variance of the residuals should be constant across all levels of the independent variables.
    • This means spread of residuals should be similar across entire range of independent variables.
    • This is important because it ensures normally distributed errors + constant variances —> which allows for valid hypothesis testing, confidence interval estimation and accurate prediction of response variable
  4. Independence of Residuals (No Autocorrelation)
    • No Correlation between the residuals (Diff betw Observed and Predicted Values) of different data points
    • Allows for valid hypothesis testing to be conducted
  5. No Multicollinearity (or low collinearity) and No Outliers
    • Variables should not be highly correlated —> High Multicollinearity can lead to unstable and unreliable coefficient estimates
    • Linear regression assumes that the errors (residuals) are normally distributed and have constant variance. Outliers can violate these assumptions and introduce heteroscedasticity or non-normality into the residuals. Violations of these assumptions can result in biased coefficient estimates, invalid hypothesis tests, and unreliable predictions.
20
Q

OLS Regression

A
  • Method for estimating the parameters of a Linear Regression model.
  • It essentially tries to find the values of the linear regression model’s parameters that minimise the sum of the squared residuals

Assumptions:
1. Errors are normally distributed w/ 0 mean and constant variance
2. No multicollinearity among independent variables

21
Q

Lasso and Ridge Regression

A

They are just simple regularized versions of linear regression.

Lasso Regression:
- L1 Regularization
- Adds the “absolute value of magnitude” of the coefficient as a penalty term to the loss function

Ridge Regression:
- L2 Regularization
- Adds the “squared magnitude” of the coefficient as a penalty term to the loss function

L1 Regularization:
- Shrinks coefficients to 0
- Can be used for dimension reduction and feature selection

L2 Regularization:
- Shrinks coefficients equally
- Useful when we have collinear features

22
Q

L1 & L2 Regularization

A

L1 Regularization:
- Adds the “absolute value of magnitude” of the coefficient as a penalty term to the loss function
- Shrinks coefficients to 0
- Can be used for dimension reduction and feature selection

L2 Regularization:
- Adds the “squared magnitude” of the coefficient as a penalty term to the loss function
- Shrinks coefficients equally
- Useful when we have collinear features

23
Q

Logistic Regression

A

Supervised Machine Learning algorithm mainly used for classification tasks where the goal is to predict the probability that an instance of belonging to a given class

It essentially takes the output of a linear regression function as an input and uses a sigmoid function to estimate the probability of a given class. Hence it outputs a probabilistic value between 0 and 1

ASSUMPTIONS of Logistic Regression
1. Independent Observations:
- Each Observation is independent meaning there is no correlation between any input variables
2. Binary Dependent variables:
- Assumes dependent variable must be binary (For more than 2 values softmax functions are used)
3. Linearity relationship between independent variables and dependent variable
4. No outliers
5. Large Sample Size

24
Q

GBDTs

A
  • Gradient boosting works by ensembling weak learners to improve the performance of the model as a whole and these weak learners are usually decision trees
  • Maybe its good to talk about ensembling before I proceed:
  • Essentially, ensembling is just the act of combining a number of different models into 1 and the 2 most popular ensemble learning methods are bagging and boosting.
    a. Bagging:
  • Training a bunch of models in a parallel way and each model learns from a random subset of data
    b. Boosting
  • Training a bunch of models sequentially and each model learns from mistakes of the previous model
  • So obviously GBDTs use Boosting. So why boosting?
  • Boosting works on the principle of improving mistakes of the previous learner through the next learner.
  • In boosting, weak learner are used which perform only slightly better than a random chance.
  • Boosting focuses on sequentially adding up these weak learners and filtering out the observations that a learner gets correct at every step.
  • Basically the stress is on developing new weak learners to handle the remaining difficult observations at each step.
  • We often use Decision Trees as the weak learner. A Decision Tree is an ML model that builds upon iteratively asking questions to partition data and reach a solution
  • So the boosting process looks something like:

The boosting process looks like this:

  1. Build an initial model with the data,
  2. Run predictions on the whole data set,
  3. Calculate the error using the predictions and the actual values,
  4. Assign more weight to the incorrect predictions,
  5. Create another model that attempts to fix errors from the last model,
  6. Run predictions on the entire dataset with the new model,
  7. Create several models with each model aiming at correcting the errors generated by the previous one,
  8. Obtain the final model by weighting the mean of all the models.
  • So in GBDTs, we often combine multiple Decision Trees (weak learners) to come up with 1 strong learner. All the trees are connected in series and each tree tries to minimise the error of the previous tree and due to this sequential connection, boosting algorithms are usually slow to learn but also highly accurate.
  • The weak learners are fit in such a way that each new learner fits into the residuals of the previous step so as the model improves. The final model aggregates the result of each step and thus a strong learner is achieved. A loss function is used to detect the residuals. For instance, mean squared error (MSE) can be used for a regression task and logarithmic loss (log loss) can be used for classification tasks. It is worth noting that existing trees in the model do not change when a new tree is added. The added decision tree fits the residuals from the current model.
25
Q

Supervised VS Unsupervised Learning

A

1) Definition:

Supervised Learning:
- Supervised learning is a type of machine learning paradigm where the model is trained on labeled data. The data is provided with the answer key, and the algorithm iteratively makes predictions and is corrected by the provided labels whenever it’s wrong.

Unsupervised Learning:
- Unsupervised learning involves training the model on data that is neither classified nor labeled. The model works without guidance and groups unsorted information according to similarities, patterns, and differences without any labeled responses to guide the learning process.

2) Data:

Supervised Learning:
- Requires labeled data for training. Each example in the training dataset is paired with an output label.

Unsupervised Learning:
- Works with unlabeled data. It tries to learn the underlying structure from the input data directly.

3) Goal:

Supervised Learning:
- The goal is often prediction or classification. It aims to make predictions or infer mappings based on the input-output pairs.

Unsupervised Learning:
- The goal is to find structure in the data, like clustering or association. It tries to achieve a transformation that is subject to certain criteria, such as dimensionality reduction.

4) Feedback:

Supervised Learning:
- The model receives explicit feedback in terms of labels or correct answers during training.

Unsupervised Learning:
- There’s no feedback, and the algorithm tries to identify patterns directly from the input data.

26
Q

Handle Multicolinearity

A

1) What is Multicollinearity

  • The situation where 2 or more predictor variables in a regression model are highly correlated, such that one can be linearly predicted from the others with substantial accuracy. When multicollinearity is present:

a. It can inflate the variance of the coefficient estimates, leading to less reliable interpretations.
b. It can make the model’s estimates sensitive to minor changes in the model.

2) To handle multicollinearity:

  • Principal Component Analysis (PCA): PCA can be used to transform the original variables into a new set of uncorrelated variables.
  • Removing Variables: In some cases, based on domain knowledge and correlation analysis, I considered dropping one variable from a pair of highly correlated variables.
  • Regularization Techniques: Techniques like Ridge and Lasso regression can help in handling multicollinearity. Ridge regression adds a penalty to the coefficients, and Lasso can lead to feature selection.
27
Q

Overfitting Solution

A

To ensure models were not overfitting:

1) Cross-Validation: I utilized k-fold cross-validation, where the training set was split into ‘k’ smaller sets. For each of the k “folds”, a model was trained on k-1 of those chunks and validated on the remaining chunk.

2) Regularization: Implemented L1 (Lasso) and L2 (Ridge) regularization techniques that add penalty terms to the loss function, constraining the magnitude of the coefficients.

3) Early Stopping: When training deep learning models, I monitored the validation loss, and if it stopped decreasing (or started increasing), training was halted.

4) Pruning: In tree-based algorithms, pruning helps reduce the size of the tree, which minimizes overfitting.

5) Dropout: In neural networks, dropout layers were introduced, where during training, random subsets of neurons are dropped out to prevent reliance on any one neuron.

28
Q

Underfitting Solution

A

Underfitting occurs when the model fails to capture the underlying patterns in the data. Here’s how you can prevent it:

1) Complexity: Ensure the model has adequate complexity to capture the data patterns. This might mean adding more layers or neurons to a neural network.

2) Features: Use feature engineering to provide more informative features or to transform them in ways that make relationships more apparent.

3) Training Duration: Train for more epochs, as sometimes longer training is required to achieve convergence.

4) Regularization: If using regularization, ensure the regularization parameters aren’t set too high, which can suppress the model’s capacity.

5) Model Selection: Consider switching to a more complex model if simpler models like linear regression aren’t capturing the data patterns.

29
Q

How do you know when your model is overfitting or underfitting

A

Training vs. Validation Error:

  • Underfitting: Both training and validation errors are high.
  • Overfitting: Training error is low, but validation error is significantly higher.
  • Learning Curves: Plotting training and validation error over epochs. If they converge and plateau with a high error, it’s likely underfitting. If there’s a large gap between them, it’s overfitting.
30
Q

How do you assess the importance of each model in the ensemble

A

When using an ensemble approach, assessing the importance or contribution of each model can help understand which model brings the most value. Methods I used:

1) Permutation Importance: By shuffling the predictions of a particular model and observing the drop in the ensemble’s performance, one can gauge the importance of that model.

2) Correlation of Errors: Models that make very different errors compared to others can be considered more valuable, as they bring diversity to the ensemble. Evaluating the correlation of errors among models can be insightful.

31
Q

How do you deal with imbalanced data (For Images and Tabular Data)

A

Dealing with imbalanced data is crucial to prevent models from being biased towards the majority class:

  1. Resampling:
  • Oversampling: Increase the number of instances in the minority class by duplicating samples or generating synthetic samples (e.g., SMOTE).
  • Undersampling: Reduce the number of instances in the majority class. However, it might lead to loss of information.
  1. Weighted Loss Function: Assign higher weights to the minority class during model training.
  • Anomaly Detection: Treat the minority class as an anomaly detection problem.
  • Using Different Evaluation Metrics: Accuracy might be misleading. Instead, focus on metrics like precision, recall, F1-score, or the area under the precision-recall curve.
  1. Ensemble Methods: Bagging and boosting algorithms can help. Or Stacking / Voting etc.
32
Q

Primary Components of a Neural Netowork (NN)

A

The primary components of a neural network include:

1) Layers: These can be input, hidden, or output layers.

2) Nodes or Neurons: These are computational units in each layer.

3) Weights and Biases: Parameters that get adjusted during training.

4) Activation Function: Determines the output of a neuron, e.g., ReLU, sigmoid, tanh.

33
Q

RCNN / Fast RCNN / Faster RCNN

A

1) R-CNN (Regions with CNN Features):
Architecture:
- Region Proposal: Uses selective search to propose candidate object bounding boxes.
- Feature Extraction: For each proposed region, a CNN extracts features.
- Classification: SVM classifiers identify the object within the proposed regions.
- Bounding Box Regression: Refines the bounding boxes for better accuracy.

2) Fast R-CNN:

An improved version that addresses some inefficiencies of R-CNN.

Architecture:
- Single CNN Pass: The entire image goes through CNN to generate a feature map.
- Region of Interest (RoI) Pooling: Proposed regions from the feature map are pooled to have a fixed size.
- Classification and Bounding Box Regression: Fully connected layers followed by two output layers—one for classifying objects and the other for bounding box regression.

3) Faster R-CNN:

Builds upon Fast R-CNN by adding a Region Proposal Network (RPN).

Architecture:
- Region Proposal Network (RPN): Learns to propose candidate object bounding boxes directly, replacing selective search.
- RoI Pooling: Similar to Fast R-CNN, it has RoI pooling, followed by layers for classification and bounding box regression.

34
Q

YOLO

A

YOLO:

  • YOLO is known for its speed, performing object detection in real-time.

Architecture:
1) Grid Division:
- Divides the input image into a grid (e.g., 13x13).

2) Single Network Pass:
- Passes the image through a single neural network.

3) Predictions:
- Each grid cell predicts bounding boxes, objectness scores, and class probabilities.

4) Non-maximum Suppression:
- Reduces overlapping bounding boxes, keeping only the ones with the highest confidence scores.

35
Q

SSD

A

SSD:

  • SSD combines aspects of YOLO and Faster R-CNN, excelling in both speed and accuracy.
  • Architecture:
    1) Multiple Feature Maps:
  • Uses feature maps from multiple layers of the network for detection, allowing for objects of varying sizes to be detected.

2) Predictions:
- Each feature map cell predicts categories and bounding boxes.

3) Non-maximum Suppression:
- Similar to YOLO, it uses non-maximum suppression to reduce redundant bounding boxes and keep the most confident ones.