Theory Flashcards

Question

What can’t you do with perceptrons?

Answer 1

What if the dataset we want to learn does not have a linear separation boundary? The Perceptron does not work any more and we need alternative solutions • Non linear boundary • Alternative input representations

Answer 2

To find the minimum of a generic function, we compute the partial derivatives of the function and set them to zero Closed-form solutions are practically never available so we can use iterative solutions (gradient descent): • Initialize the weights to a random value • Iterate until convergence

Answer 3

How Gradient Descent Works in a Feed Forward Neural Network (FFNN): Gradient Descent is the optimization algorithm used to minimize the loss function by iteratively updating the weights and biases of the network. Here’s how it works in an FFNN: 1. Forward Propagation: • Input data is passed through the network layer by layer, and activations are calculated using weights, biases, and activation functions. • The loss function measures the difference between the network’s prediction and the actual target. 2. Backward Propagation (Backpropagation): • The gradient of the loss function is computed with respect to each weight and bias in the network, moving backward from the output layer to the input layer. • Gradients are calculated using the chain rule, layer by layer. 3. Weight Updates (Gradient Descent): • The weights and biases are updated using the gradients computed during backpropagation. The update rule is: w_{i,j} = w_{i,j} - \eta \cdot \frac{\partial L}{\partial w_{i,j}} where: • w_{i,j} : weight between node i and node j • \eta : learning rate (step size) • \frac{\partial L}{\partial w_{i,j}} : gradient of the loss function with respect to the weight w_{i,j} • This process is repeated for all parameters in the network until convergence (when the loss function reaches a minimum). Strategies to Avoid Local Minima: While local minima can be a concern in optimization, the high dimensionality of neural network loss surfaces usually makes the problem less about local minima and more about saddle points. Regardless, the following strategies can help: 1. Use Stochastic Gradient Descent (SGD): • Instead of calculating the gradient over the entire dataset (batch gradient descent), compute it over mini-batches of data. • This introduces noise, which helps the optimization process escape shallow local minima or saddle points. 2. Momentum: • Incorporate a fraction of the previous update into the current update: v = \gamma v - \eta \nabla L w = w + v • v : velocity (previous update) • \gamma : momentum term (e.g., 0.9) • Momentum helps smooth out updates and can push the optimization process out of local minima. 3. Adaptive Learning Rate Methods (e.g., Adam, RMSProp): • These algorithms adjust the learning rate dynamically for each parameter based on past gradients. • Adam is particularly popular for deep learning as it combines momentum and adaptive learning rates. 4. Learning Rate Scheduling: • Gradually reduce the learning rate as training progresses. • Helps avoid overshooting or getting stuck near minima. 5. Weight Initialization: • Poor initialization can lead to getting stuck in poor local minima. Techniques like Xavier initialization or He initialization can improve convergence. 6. Regularization: • Adding a penalty term to the loss function (e.g., L1, L2 regularization) can help smooth the loss surface, reducing the chance of getting stuck in sharp local minima. 7. Batch Normalization: • Helps stabilize and speed up training, which can make optimization less sensitive to local minima. 8. Gradient Noise Injection: • Add small noise to gradients during updates to encourage exploration of the loss surface. 9. Resilience of Loss Surfaces in Deep Learning: • Research suggests that in high-dimensional spaces, local minima are often “good enough” since they are statistically close to the global minimum. However, combining the above strategies improves the chances of finding better solutions. Summary: Gradient Descent in FFNN optimizes the weights by computing gradients and updating parameters iteratively. Strategies like SGD, momentum, adaptive learning rates, and proper initialization help avoid local minima and improve convergence. In practice, local minima are less of a concern in deep networks due to the smoothness and high dimensionality of their loss surfaces.

Answer 4

The main difference between standard gradient descent and its variations—batch gradient descent, stochastic gradient descent (SGD), and mini-batch gradient descent—lies in how much of the dataset is used to compute the gradients at each iteration. 1. Standard (Batch) Gradient Descent • Description: In standard gradient descent, the gradient of the loss function is computed using the entire training dataset. • Update Rule: \theta = \theta - \eta \cdot \nabla L(\theta) where \nabla L(\theta) is the gradient of the loss function computed over all training samples. • Characteristics: • Requires a pass through the entire dataset at each iteration. • Converges smoothly due to the exact gradient computation. • Computationally expensive for large datasets. • Memory-intensive since all data must be loaded. • When to Use: Small datasets where computational resources are sufficient. 2. Stochastic Gradient Descent (SGD) • Description: In SGD, the gradient is computed and the parameters are updated for each individual training sample. • Update Rule: \theta = \theta - \eta \cdot \nabla L_i(\theta) where \nabla L_i(\theta) is the gradient of the loss function for a single data point i . • Characteristics: • Performs updates more frequently (one per sample). • Introduces noise in updates, making it less smooth but often faster to converge. • Can escape shallow local minima or saddle points due to noisy updates. • Lightweight computation, making it suitable for very large datasets. • Might lead to erratic convergence or oscillation around the minimum. • When to Use: When working with extremely large datasets or when you want faster initial progress in optimization. 3. Mini-Batch Gradient Descent • Description: Mini-batch gradient descent is a compromise between batch gradient descent and SGD. Gradients are computed on small subsets (mini-batches) of the training data. • Update Rule: \theta = \theta - \eta \cdot \nabla L_B(\theta) where \nabla L_B(\theta) is the gradient of the loss function for a mini-batch B . • Characteristics: • Combines the efficiency of batch gradient descent and the noise of SGD. • Reduces memory requirements compared to full-batch gradient descent. • Makes use of vectorized operations for efficient computation. • Converges faster than SGD and is more stable than SGD due to averaging gradients within mini-batches. • When to Use: The most commonly used approach in practice, especially with modern machine learning frameworks (e.g., TensorFlow, PyTorch). Comparison Table Property Batch Gradient Descent Stochastic Gradient Descent (SGD) Mini-Batch Gradient Descent Data Used Per Iteration Entire dataset One data point Subset (mini-batch) of the dataset Update Frequency Once per epoch After each sample After each mini-batch Convergence Smoothness Smooth and stable Noisy and less stable Balance between smoothness and noise Computation Per Update High Low Moderate Memory Usage High Low Moderate Typical Use Case Small datasets Extremely large datasets Most modern machine learning tasks Which One to Use? • Small Datasets: Batch Gradient Descent (or mini-batch if memory is a concern). • Large Datasets: Mini-Batch Gradient Descent is the best choice due to its balance between efficiency and convergence stability. • Streaming Data or Online Learning: Stochastic Gradient Descent.

Answer 5

Generalization in deep learning refers to a model’s ability to perform well on unseen data (data that wasn’t used during training). Measuring generalization involves evaluating how well the trained model predicts outputs for new inputs. Here are key approaches to measure generalization: 1. Split the Dataset into Training, Validation, and Test Sets • Training Set: Used to train the model. • Validation Set: Used to tune hyperparameters and monitor the model’s performance during training. • Test Set: Used exclusively to evaluate the final model’s performance after training and validation. By evaluating the model on the test set, you estimate its ability to generalize to new data. 2. Use Performance Metrics The choice of metric depends on the task: • For Regression: • Mean Squared Error (MSE) • Mean Absolute Error (MAE) • R-squared ( R^2 ) • For Classification: • Accuracy • Precision, Recall, F1 Score • Area Under the Receiver Operating Characteristic Curve (AUC-ROC) Evaluate these metrics on the validation set during training and on the test set post-training to check generalization. 3. Compare Training and Validation/Test Loss • Plot the training loss and validation loss during training: • If the validation loss is significantly higher than the training loss, the model might be overfitting. • If the training loss is high, the model might be underfitting and failing to learn from the data. 4. Cross-Validation • Use k-fold cross-validation, where the dataset is split into k subsets (folds). Train the model on k-1 folds and evaluate it on the remaining fold, repeating this process k times. • This provides a robust measure of generalization by ensuring the model is tested on multiple unseen subsets of data. 5. Check for Overfitting and Underfitting • Overfitting: The model performs well on the training data but poorly on validation/test data. It memorizes the training data instead of learning general patterns. • Underfitting: The model performs poorly on both training and validation/test data, indicating it hasn’t captured the underlying patterns. 6. Regularization Techniques • Regularization techniques like dropout, L1/L2 regularization, and early stopping can help improve generalization by preventing overfitting. Use these to monitor and improve the model’s behavior during training. 7. Test on an Unseen Dataset • Use a completely independent dataset (not part of training, validation, or test sets) to evaluate how well the model generalizes in real-world scenarios. 8. Look at Error Distribution • Analyze the distribution of errors (residuals) on the test data: • Uniform error distribution indicates good generalization. • Systematic errors suggest the model may not have learned some key patterns. 9. Train with More Data or Augment Data • Evaluate the model’s performance as the size of the training data increases: • If adding more data consistently improves performance, the model is likely generalizing better. • Use data augmentation for tasks like image recognition to simulate new data and test the model’s generalization. 10. Use Learning Curves • Plot learning curves to visualize performance trends: • A narrowing gap between training and validation loss/accuracy suggests better generalization. • A widening gap suggests overfitting. By combining these techniques and monitoring results across multiple evaluation methods, you can effectively measure and enhance your model’s generalization capabilities.

Answer 6

What is Cross-Validation? Cross-validation is a statistical technique used to evaluate a model’s performance and generalization by partitioning the dataset into multiple subsets (folds). The key idea is to repeatedly train the model on different subsets of the data and test it on the remaining subset(s). This ensures that the model is evaluated on different portions of the data, making the evaluation more robust. In deep learning, cross-validation is used to estimate how well a model will generalize to unseen data, tune hyperparameters, and reduce overfitting. How Cross-Validation Works 1. k-Fold Cross-Validation: • The dataset is split into subsets (folds). • For each iteration: • Use folds for training. • Use the remaining fold for validation. • Repeat this process times, rotating the validation fold each time. • The final performance metric is the average of the scores across all iterations. 2. Stratified k-Fold Cross-Validation: • Similar to k-fold but ensures that the class distribution in each fold is representative of the overall dataset (useful for imbalanced data). 3. Leave-One-Out Cross-Validation (LOOCV): • Uses one data point for validation and the rest for training. Repeat for every data point. • Computationally expensive for large datasets. 4. Nested Cross-Validation: • Used for hyperparameter tuning, with an outer loop for performance evaluation and an inner loop for hyperparameter optimization. Benefits of Cross-Validation in Deep Learning 1. Improved Model Evaluation: • Cross-validation gives a more reliable estimate of the model’s performance by testing it on multiple subsets of data. It reduces the risk of bias caused by a single train-test split. 2. Better Generalization Estimates: • By validating the model on different portions of the data, cross-validation provides insights into how well the model generalizes to unseen data. 3. Robust Hyperparameter Tuning: • Cross-validation can be used to systematically tune hyperparameters (e.g., learning rate, dropout, number of layers) by finding configurations that work well across multiple folds. 4. Mitigates Overfitting: • Regularly evaluating the model on different validation sets during training reduces the likelihood of overfitting to a specific subset of data. 5. Handles Data Scarcity: • When data is limited, cross-validation makes better use of the available data by allowing all samples to be used for both training and validation at some point. 6. Bias-Variance Tradeoff Analysis: • Cross-validation helps assess whether the model is overfitting (low bias, high variance) or underfitting (high bias, low variance). Challenges of Cross-Validation in Deep Learning 1. Computational Expense: • Training deep learning models is computationally intensive, and running cross-validation involves training the model multiple times, which can be impractical for large models. 2. Need for Large Datasets: • In deep learning, models often require large datasets to learn effectively. Splitting a dataset into folds may result in insufficient data per fold. 3. Alternative Strategies: • Instead of traditional cross-validation, techniques like train-validation-test splits and bootstrapping are often preferred due to their efficiency. When to Use Cross-Validation in Deep Learning • When data is scarce, and you want to make the most of it. • When tuning hyperparameters or comparing multiple architectures. • When using smaller or simpler models where computational cost is manageable. • For imbalanced datasets, stratified cross-validation helps maintain class proportions across splits. By applying cross-validation thoughtfully, you can obtain a more robust measure of your model’s generalization ability, especially in scenarios where computational resources and dataset size permit.

Answer 7

What is Early Stopping? Early stopping is a regularization technique used during training to prevent a deep learning model from overfitting. It involves monitoring the model’s performance on a validation set and halting training when the performance stops improving (e.g., when the validation loss stops decreasing or validation accuracy plateaus). How Early Stopping Works 1. Monitor a Validation Metric: • During training, evaluate the model’s performance on the validation set after each epoch (e.g., validation loss or accuracy). 2. Define a Stopping Criterion: • If the validation metric doesn’t improve for a set number of epochs (called the patience), stop training early. 3. Save the Best Model: • Keep track of the model’s weights when the validation metric was at its best. After stopping, revert to these weights. Steps of Early Stopping 1. Train the model on the training set and evaluate on the validation set after each epoch. 2. Check if the validation metric (e.g., loss) has improved compared to the best value recorded so far. 3. If there’s no improvement for a pre-defined patience (e.g., 5 epochs), stop training. 4. Restore the model weights to the epoch with the best validation performance. Why Use Early Stopping? 1. Prevents Overfitting: • If the model continues to train beyond the point where it generalizes well, it starts memorizing the training data, leading to overfitting. Early stopping prevents this by halting training when validation performance deteriorates. 2. Improves Generalization: • By stopping training when validation performance peaks, early stopping helps the model maintain its ability to generalize to unseen data. 3. Saves Time and Resources: • Training deep learning models can be computationally expensive. Early stopping prevents unnecessary training epochs, saving time and computational resources. 4. Works with Other Regularization Techniques: • Early stopping can be combined with methods like dropout, batch normalization, or L1/L2 regularization for better generalization. Benefits of Early Stopping • Simple to Implement: Most deep learning frameworks (e.g., TensorFlow, PyTorch, Keras) have built-in support for early stopping. • Avoids Wasting Computational Power: Training halts once performance stagnates, avoiding redundant epochs. • Reduces Risk of Overfitting: By monitoring validation performance, it ensures the model doesn’t overtrain. Challenges and Considerations 1. Choice of Patience: • If the patience value is too small, training might stop prematurely (underfitting). • If it’s too large, training might continue unnecessarily (leading to overfitting or wasted time). 2. Dependent on Validation Set: • Early stopping relies on the validation set being representative of the test set. If the validation set is poorly chosen or too small, early stopping may not work effectively. 3. Overhead in Monitoring: • While minor, tracking validation metrics during training adds a small computational overhead. When to Use Early Stopping • Large Models or Datasets: When training deep models for long durations, early stopping can prevent overfitting and save resources. • Limited Data: For small datasets prone to overfitting, early stopping helps maintain generalization. • Hyperparameter Tuning: Early stopping can be useful during grid or random search to quickly terminate non-promising configurations. How to Implement Early Stopping In most frameworks, early stopping is straightforward: Example in Keras: from tensorflow.keras.callbacks import EarlyStopping # Define early stopping callback early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True) # Train the model model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=100, callbacks=[early_stopping]) • monitor: Metric to monitor (e.g., validation loss). • patience: Number of epochs to wait without improvement. • restore_best_weights: Ensures the final model uses the best weights. Summary Early stopping is a simple yet effective regularization technique that halts training once the model stops improving on a validation set. It helps reduce overfitting, save computational resources, and ensures better generalization. By carefully tuning parameters like patience, early stopping can make training more efficient and robust.

Answer 8

Your result function is going to be linear

Answer 9

It increases the differences in the prediction for each class and normalize the results (they sum to one)

Answer 10

Using Cross-Validation for Hyperparameter Tuning Hyperparameter tuning is the process of finding the best combination of hyperparameters (e.g., learning rate, batch size, number of layers) to optimize the performance of a model. Cross-validation (CV) is a robust technique to evaluate hyperparameter configurations by testing their performance across multiple splits of the data. Steps to Use Cross-Validation for Hyperparameter Tuning 1. Choose the Hyperparameters to Tune: • Identify which hyperparameters to optimize (e.g., learning rate, number of hidden layers, dropout rate, regularization strength). • Define the search space for each hyperparameter (e.g., a range of values for the learning rate). 2. Split the Dataset: • Divide the dataset into k folds for k-fold cross-validation. • Use k-1 folds for training and the remaining fold for validation. • Rotate the validation fold through all k folds. 3. Train and Evaluate: • For each combination of hyperparameters in the search space: • Train the model on k-1 folds. • Validate the model on the remaining fold. • Repeat for all k folds and compute the average validation performance (e.g., accuracy, loss, F1-score). 4. Select the Best Hyperparameters: • Choose the hyperparameter combination with the best average performance across the k -folds. 5. Train the Final Model: • Use the best hyperparameter configuration to train the final model on the entire training dataset. • Test the model on the independent test set to evaluate generalization. Cross-Validation Methods for Hyperparameter Tuning 1. Grid Search with Cross-Validation: • Perform an exhaustive search over a pre-defined grid of hyperparameter values. • For each combination, evaluate using cross-validation. Example in Scikit-Learn: from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier param_grid = { 'n_estimators': [10, 50, 100], 'max_depth': [None, 10, 20], 'min_samples_split': [2, 5, 10] } grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5) grid_search.fit(X_train, y_train) print("Best Hyperparameters:", grid_search.best_params_) • Pros: Exhaustive and thorough. • Cons: Computationally expensive for large search spaces. 2. Random Search with Cross-Validation: • Instead of searching all possible combinations, randomly sample a subset of hyperparameter combinations. • Useful when the search space is very large. Example in Scikit-Learn: from sklearn.model_selection import RandomizedSearchCV from sklearn.ensemble import RandomForestClassifier from scipy.stats import randint param_distributions = { 'n_estimators': randint(10, 200), 'max_depth': [None, 10, 20, 30], 'min_samples_split': randint(2, 10) } random_search = RandomizedSearchCV(RandomForestClassifier(), param_distributions, cv=5, n_iter=20, random_state=42) random_search.fit(X_train, y_train) print("Best Hyperparameters:", random_search.best_params_) • Pros: More efficient than grid search. • Cons: May miss optimal configurations if not enough samples are drawn. 3. Bayesian Optimization with Cross-Validation: • Uses probabilistic models to predict the performance of hyperparameter configurations and iteratively refines the search. • Often implemented using tools like Optuna or HyperOpt. • Pros: Efficient for large search spaces; focuses on promising areas of the hyperparameter space. • Cons: More complex to implement. 4. Nested Cross-Validation: • Combines hyperparameter tuning with model evaluation. • Outer loop: Splits data into training and test sets. • Inner loop: Performs cross-validation on the training set to tune hyperparameters. • Pros: Provides an unbiased estimate of model performance. • Cons: Computationally expensive. Benefits of Using Cross-Validation for Hyperparameter Tuning 1. Reduces Overfitting to Validation Set: • By testing hyperparameters on multiple folds, CV ensures that the selected configuration generalizes well across different subsets of the data. 2. Maximizes Data Utilization: • All data points are used for training and validation at some point, which is especially useful for small datasets. 3. Robust Performance Estimates: • Averaging performance across folds reduces the variability caused by a single train-test split. 4. Better Model Selection: • Ensures that the chosen hyperparameters perform consistently across different data partitions. Challenges 1. Computational Cost: • Training models repeatedly for each fold and hyperparameter combination can be expensive, especially for large datasets or deep learning models. 2. Careful Data Splitting: • Ensure proper stratification or data splitting (e.g., stratified k-fold) to maintain class balance in imbalanced datasets. 3. Risk of Overfitting to Validation Set: • If too many hyperparameters are tuned, there’s a risk of overfitting to the cross-validation procedure itself. When to Use Cross-Validation for Hyperparameter Tuning • When the dataset is small and you need to maximize its usage. • When you’re evaluating multiple models or hyperparameters and want robust comparisons. • When generalization to unseen data is critical. By leveraging cross-validation, hyperparameter tuning becomes more reliable, leading to models that generalize well to new data.

Answer 11

What is the Dropout Technique? Dropout is a regularization technique used in deep learning to reduce overfitting and improve generalization. It works by randomly “dropping out” (i.e., setting to zero) a fraction of the neurons in the network during each training iteration. The dropout fraction is typically controlled by a hyperparameter p , which represents the probability of keeping a neuron active (e.g., p = 0.8 means 80% of neurons are retained, and 20% are dropped). When applying dropout: • During training: Neurons are randomly removed from the network. • During inference: Dropout is turned off, and the full network is used, with weights scaled by p to account for the reduced activity during training. How Does Dropout Work? 1. At each training step, dropout selects a random subset of neurons to deactivate (set to zero). 2. This forces the network to rely on different combinations of neurons and prevents co-adaptation (where neurons depend too heavily on each other). 3. The result is a more robust model that generalizes better to unseen data. Effects of Dropout 1. Reduces Overfitting: • By preventing neurons from becoming overly specialized or relying too much on specific features, dropout reduces the likelihood of overfitting, especially in large or complex networks. 2. Improves Generalization: • Dropout encourages the network to learn more general patterns in the data, as it cannot rely on the same subset of neurons at every step. 3. Acts as an Ensemble: • During inference, the network effectively behaves like an ensemble of many smaller networks (one for each subset of neurons selected during training), which improves stability and accuracy. 4. Slightly Slower Convergence: • Dropout introduces noise into the training process, which can slow down convergence initially, but leads to better long-term performance. Why Does Dropout Work? Dropout works because: 1. It reduces co-adaptation between neurons, making the model less sensitive to noise or specific data points in the training set. 2. It forces redundancy in the network, as neurons must independently learn useful features. 3. It prevents over-reliance on certain neurons or pathways, which helps the model generalize better to unseen data. Mathematical Perspective Let the output of a neuron z in a layer be: z = w^T x + b Where w are the weights, x is the input, and b is the bias. When dropout is applied: z_{\text{dropout}} = (r \cdot w)^T x + b Where r is a mask vector sampled from a Bernoulli distribution: r_i \sim \text{Bernoulli}(p) During inference, dropout is disabled, but the weights are scaled by p to ensure the same expected output: z_{\text{inference}} = p \cdot w^T x + b Key Parameters 1. Dropout Rate ( 1-p ): The fraction of neurons to drop. Common values: • 0.5 for hidden layers. • 0.2 to 0.3 for input layers. 2. Where to Apply Dropout: Usually applied to hidden layers, and sometimes to input layers, but not to output layers. Advantages of Dropout 1. Simple and easy to implement. 2. Effective in reducing overfitting, especially in over-parameterized models. 3. Can be combined with other regularization techniques like weight decay or batch normalization. Disadvantages of Dropout 1. Increased Training Time: • The randomness introduced by dropout can slow down convergence. 2. Potential Underfitting: • If the dropout rate is too high, the model may fail to learn useful patterns (underfitting). 3. Not Always Necessary: • In smaller datasets or when other regularization techniques (e.g., batch normalization) are used, dropout may not provide significant benefits. When to Use Dropout • In deep neural networks prone to overfitting, especially when: • The dataset is small. • The model has a large number of parameters. • When no other regularization techniques are applied. • Less effective or unnecessary in networks with batch normalization, as both techniques address overfitting differently. Example of Dropout in Keras from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout model = Sequential([ Dense(128, activation='relu', input_shape=(input_dim,)), Dropout(0.5), # Drop 50% of neurons Dense(64, activation='relu'), Dropout(0.3), # Drop 30% of neurons Dense(10, activation='softmax') ]) Summary Dropout is a simple, effective regularization technique for preventing overfitting in deep learning. By randomly deactivating neurons during training, it encourages redundancy and independence, resulting in better generalization. However, it must be used carefully, as excessive dropout can lead to underfitting or slower training.

Answer 12

What Does It Mean That Sigmoid and Tanh Saturate? When we say that the sigmoid and tanh activation functions “saturate,” it means that their outputs approach their asymptotic values as the input becomes very large or very small. Specifically: 1. For the sigmoid function: \text{sigmoid}(x) = \frac{1}{1 + e^{-x}} • As x \to +\infty , sigmoid approaches 1. • As x \to -\infty , sigmoid approaches 0. 2. For the tanh function: \text{tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} • As x \to +\infty , tanh approaches 1. • As x \to -\infty , tanh approaches -1. In both cases, the gradients (derivatives of the activation functions) become very small for large positive or negative values of x . This is called saturation. Why Do Sigmoid and Tanh Saturate? 1. Mathematical Nature: • Both functions have a limited output range (sigmoid: [0, 1], tanh: [-1, 1]) and asymptotically approach their maximum/minimum values. • As x \to \pm\infty , the derivative of the function becomes very small: • For sigmoid: \frac{d}{dx} \text{sigmoid}(x) = \text{sigmoid}(x) (1 - \text{sigmoid}(x)) • For tanh: \frac{d}{dx} \text{tanh}(x) = 1 - \text{tanh}^2(x) • For large |x| , these derivatives approach 0, leading to vanishing gradients. 2. Initializations and Training Dynamics: • During training, weights and biases might cause the pre-activation inputs ( z = w^T x + b ) to fall into the saturated region of the activation function, where gradients are nearly zero. Consequences of Saturation 1. Vanishing Gradients: • When gradients are close to zero, the backpropagation algorithm fails to update the weights effectively, causing very slow or stalled learning for deep networks. 2. Inefficient Training: • Learning becomes slow or impossible for layers whose activations are in the saturated regions. 3. Difficulty in Training Deep Networks: • Saturation exacerbates the vanishing gradient problem in deep networks, making it hard for earlier layers to learn useful features. 4. Bias Shift: • Saturation regions cause large biases in the updates, as the activations don’t change significantly, leading to slower convergence. Solutions to the Saturation Problem 1. Use Non-Saturating Activation Functions: • Replace sigmoid and tanh with ReLU (Rectified Linear Unit) or its variants: • ReLU: \text{ReLU}(x) = \max(0, x) • No saturation for positive values of x . • Leaky ReLU: Allows small negative gradients to flow for x < 0 . • ELU (Exponential Linear Unit): Smooth version of ReLU with non-zero gradients for x < 0 . 2. Careful Weight Initialization: • Initialize weights using methods that prevent activations from falling into the saturated regions: • Xavier Initialization: Works well for tanh. • He Initialization: Works well for ReLU and its variants. 3. Batch Normalization: • Normalize activations during training so that they stay in the active (non-saturated) range of the activation functions. • This helps prevent activations from growing too large or small and stabilizes training. 4. Gradient Clipping: • Clip gradients during backpropagation to prevent extremely small or large updates. 5. Gradient Boosting Techniques: • Use optimizers like Adam, RMSprop, or momentum-based optimizers to help navigate saturated regions more effectively. 6. Residual Connections: • Use architectures like ResNets (Residual Networks) to facilitate gradient flow through skip connections, mitigating the vanishing gradient issue. Summary • Saturation in sigmoid and tanh occurs because their gradients become very small for large positive or negative inputs, leading to vanishing gradients. • This slows down or even stops learning, particularly in deep networks. • Solutions include using non-saturating activation functions (like ReLU), proper weight initialization, batch normalization, and advanced optimizers. These techniques have largely replaced sigmoid and tanh in modern deep learning architectures.

Answer 13

• Faster SGD Convergence (6x w.r.t sigmoid/tanh) • Sparse activation (only part of hidden units are activated) • Efficient gradient propagation (no vanishing or exploding gradient problems), and Efficient computation (just thresholding at zero) • Scale-invariant

Answer 14

The final result of gradient descent is affected by weight initialization: • Zeros: it does not work! All gradient would be zero, no learning will happen • Big Numbers: bad idea, if unlucky might take very long to converge • 𝑤 ∼ 𝑁 (0, 𝜎2 = 0.01) : good for small networks, but it might be a problem for deeper neural networks In deep networks: • If weights start too small, then gradient shrinks as it passes through each layer • If the weights in a network start too large, then gradient grows as it passes through each layer until it’s too massive to be useful

Answer 15

What is the Xavier Weight Initialization Technique? The Xavier weight initialization (also called Glorot initialization) is a method designed to initialize the weights of neural networks to optimize training. It was introduced by Xavier Glorot and Yoshua Bengio in their 2010 paper on deep sparse rectifier networks. The main idea behind Xavier initialization is to maintain a balance in the variance of activations and gradients across all layers of a network. This helps avoid the problems of vanishing gradients or exploding gradients, which can occur if weights are initialized poorly. How Does Xavier Initialization Work? For a given neuron in a layer: • Let n_{\text{in}} = the number of input connections (or neurons in the previous layer). • Let n_{\text{out}} = the number of output connections (or neurons in the current layer). The Xavier initialization sets the weights W by drawing values from either: 1. A uniform distribution: W \sim \mathcal{U} \left( -\frac{\sqrt{6}}{\sqrt{n_{\text{in}} + n_{\text{out}}}}, \frac{\sqrt{6}}{\sqrt{n_{\text{in}} + n_{\text{out}}}} \right) 2. A normal distribution: W \sim \mathcal{N} \left( 0, \frac{1}{\sqrt{n_{\text{in}} + n_{\text{out}}}} \right) Why Does Xavier Initialization Work? 1. Balancing Activations: • By scaling the weights based on n_{\text{in}} + n_{\text{out}} , Xavier initialization ensures that the activations are neither too large nor too small as they propagate through the layers. This helps avoid saturating activation functions like sigmoid or tanh. 2. Balancing Gradients: • Similarly, the method prevents gradients from becoming too small or too large during backpropagation, which is crucial for efficient training, especially in deeper networks. When to Use Xavier Initialization • Activation Functions: • Xavier initialization is particularly effective for activation functions like sigmoid or tanh, where keeping activations in the active region is important to avoid saturation. • For ReLU or its variants, a slightly modified version called He initialization is often more appropriate, as ReLU activations tend to cause gradients to shrink for negative values. Example of Xavier Initialization in Practice In TensorFlow/Keras from tensorflow.keras.layers import Dense from tensorflow.keras.initializers import GlorotUniform model.add(Dense(128, activation='tanh', kernel_initializer=GlorotUniform())) In PyTorch import torch import torch.nn as nn layer = nn.Linear(in_features=128, out_features=64) nn.init.xavier_uniform_(layer.weight) Advantages of Xavier Initialization 1. Prevents gradients from vanishing or exploding. 2. Improves convergence speed. 3. Helps maintain stable training dynamics. Limitations 1. Not optimal for activation functions like ReLU, which require He initialization. 2. May not be sufficient alone in very deep networks without additional techniques like batch normalization or residual connections. Summary Xavier initialization is a method for initializing weights in neural networks to maintain stable variance of activations and gradients. It is particularly effective for sigmoid and tanh activations and is based on the number of input and output connections of each layer. It is widely used in practice due to its simplicity and effectiveness in mitigating gradient-related issues.

Answer 16

What is Batch Normalization? Batch Normalization (BatchNorm) is a deep learning technique that normalizes the activations of a layer across the mini-batch during training. It was introduced by Sergey Ioffe and Christian Szegedy in 2015 to address issues such as internal covariate shift, improve training stability, and speed up convergence. In essence, BatchNorm standardizes the inputs to each layer (or neuron) so that their mean is approximately 0 and variance is approximately 1, for each mini-batch during training. This helps stabilize and accelerate the training process. How Batch Normalization Works For a given mini-batch x_1, x_2, …, x_m , the process of batch normalization involves the following steps: 1. Compute the Mean and Variance: \mu = \frac{1}{m} \sum_{i=1}^{m} x_i, \quad \sigma^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu)^2 2. Normalize the Input: Normalize each input x_i by subtracting the batch mean and dividing by the standard deviation: \hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} Here, \epsilon is a small constant added for numerical stability. 3. Apply Learnable Parameters (Scale and Shift): To allow the network to recover any necessary transformation of the normalized inputs, BatchNorm introduces two learnable parameters: y_i = \gamma \hat{x}_i + \beta • \gamma : Scale parameter. • \beta : Shift parameter. 4. Use Running Statistics During Inference: During training, the mean and variance are calculated for each mini-batch. However, during inference, BatchNorm uses a running average of the mean and variance computed during training. Advantages of Batch Normalization 1. Stabilizes Training: • By normalizing the input to each layer, BatchNorm reduces sensitivity to the initial weights and hyperparameters, making training more robust. 2. Speeds Up Convergence: • Normalized inputs ensure that the gradients flow smoothly through the network, leading to faster convergence and potentially fewer training epochs. 3. Mitigates the Vanishing/Exploding Gradient Problem: • By keeping activations in a reasonable range, BatchNorm helps prevent gradients from becoming too small (vanishing) or too large (exploding), especially in deep networks. 4. Reduces Dependence on Initialization: • BatchNorm reduces the reliance on careful weight initialization by standardizing the activations, allowing for simpler initialization schemes. 5. Acts as a Regularizer: • The randomness introduced by mini-batch statistics during training acts as a regularization effect, potentially reducing overfitting. 6. Enables Higher Learning Rates: • Normalized inputs allow for the use of larger learning rates without risking instability in training. 7. Makes Deeper Networks Easier to Train: • Deep networks often suffer from issues like internal covariate shift and vanishing gradients; BatchNorm helps address both, making it easier to train very deep architectures. Limitations of Batch Normalization 1. Dependency on Mini-Batch Size: • Small batch sizes can lead to noisy estimates of the mean and variance, reducing the effectiveness of BatchNorm. 2. Increased Computational Overhead: • BatchNorm introduces extra computations for mean, variance, and normalization, which slightly increases training time. 3. Inference Complexity: • During inference, running averages of mean and variance must be maintained and used, which adds complexity. 4. Not Always Ideal for RNNs: • BatchNorm does not work well for recurrent neural networks (RNNs) due to their sequential nature, where statistics across time steps may differ. Alternatives to Batch Normalization For specific cases where BatchNorm is less effective, alternative normalization techniques include: • Layer Normalization: Normalizes across all features of a single data point instead of across the mini-batch. • Instance Normalization: Normalizes each individual data sample (used in style transfer). • Group Normalization: Splits features into groups and normalizes each group separately (useful for small batch sizes). Example of BatchNorm in Keras from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, BatchNormalization, Activation model = Sequential([ Dense(128), BatchNormalization(), # Normalizes the activations of the Dense layer Activation('relu'), Dense(10, activation='softmax') ]) Summary Batch normalization is a widely-used technique in deep learning to normalize layer activations during training, improving stability and training speed. By reducing sensitivity to initialization and allowing higher learning rates, BatchNorm helps mitigate challenges in training deep neural networks. However, it may not always be suitable for every model (e.g., RNNs) or situation (e.g., small batch sizes).

Answer 17

What Are Adaptive Learning Rates? Adaptive learning rates are techniques used in optimization algorithms to dynamically adjust the learning rate during training based on the gradients of the loss function. Instead of using a fixed learning rate, adaptive methods modify the learning rate for each parameter, typically by increasing it for small gradients and decreasing it for large gradients. These techniques aim to optimize training efficiency by: 1. Speeding up convergence in flatter regions of the loss surface. 2. Preventing overshooting in steeper regions. Why Are Adaptive Learning Rates Important? 1. Handle Complex Loss Surfaces: • Deep learning models often have highly non-convex loss surfaces with regions of steep gradients (where large learning rates can overshoot) and flat gradients (where small learning rates can stagnate). Adaptive learning rates allow the model to adjust to these variations dynamically. 2. Improved Training Stability: • Adjusting learning rates for each parameter helps prevent unstable training caused by inappropriate step sizes. 3. Faster Convergence: • By adapting the learning rate for each parameter, these methods optimize training efficiency, leading to faster convergence, particularly in deep and complex architectures. 4. Reduced Need for Manual Tuning: • Fixed learning rates require extensive hyperparameter tuning, while adaptive methods automatically adjust the learning rate, reducing the need for manual intervention. 5. Better Handling of Sparse Data: • In cases where some features are rarely updated (e.g., sparse datasets), adaptive learning rates allow for larger updates for these features while maintaining smaller updates for frequently updated ones. Common Adaptive Learning Rate Optimizers 1. Adagrad (Adaptive Gradient): • Scales the learning rate for each parameter inversely proportional to the square root of the sum of past squared gradients. • Formula: \theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{G_{t, i} + \epsilon}} \cdot g_t Where: • G_{t, i} : Sum of past squared gradients for parameter i . • \epsilon : Small constant for numerical stability. • Advantages: Handles sparse data well. • Disadvantage: The learning rate diminishes over time, leading to slow convergence. 2. RMSprop (Root Mean Square Propagation): • Addresses Adagrad’s diminishing learning rate issue by using an exponentially weighted moving average of past squared gradients. • Formula: \theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \cdot g_t Where: • E[g^2]_t : Exponential moving average of squared gradients. • Advantages: Works well for non-stationary loss surfaces and recurrent neural networks. • Disadvantage: Requires tuning the decay hyperparameter. 3. Adam (Adaptive Moment Estimation): • Combines the benefits of RMSprop and momentum by maintaining both an exponentially decayed average of past gradients (momentum) and squared gradients. • Formula: \theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t Where: • \hat{m}_t : Bias-corrected first moment estimate (average of gradients). • \hat{v}_t : Bias-corrected second moment estimate (average of squared gradients). • Advantages: Robust, widely used, and requires minimal tuning. • Disadvantage: May not always generalize well to new data. 4. Adadelta: • A variant of Adagrad that addresses its diminishing learning rate problem by focusing on a window of past gradients instead of accumulating all past gradients. Benefits of Adaptive Learning Rates 1. Efficient Training Across Diverse Architectures: • Works well for deep neural networks with a mix of steep and flat loss surface regions. 2. Handles Sparse Gradients: • Effective for models with sparse features or data. 3. Reduced Hyperparameter Sensitivity: • Removes the need for careful selection of a global learning rate. 4. Improves Convergence: • Faster convergence compared to optimizers with fixed learning rates. Challenges and Trade-Offs 1. Computational Overhead: • Maintaining separate learning rates or gradient statistics for each parameter increases memory and computation. 2. May Not Always Generalize: • Some adaptive optimizers (e.g., Adam) may result in models that don’t generalize as well as those trained with SGD. 3. Hyperparameters Still Matter: • Decay rates and learning rate schedules still need to be tuned for optimal results. When to Use Adaptive Learning Rates 1. When training deep networks where the loss landscape is highly non-convex. 2. When working with sparse data or large-scale datasets. 3. When faster convergence is desired without extensive learning rate tuning. Summary Adaptive learning rates dynamically adjust the learning rate for each parameter during training, making optimization more efficient and robust. They are critical for training deep neural networks, especially for non-stationary loss surfaces or sparse data. Popular methods like RMSprop and Adam have become standard tools in deep learning, balancing speed, stability, and ease of use.

Answer 18

When training a deep neural network, the distinction between a static dataset and a dynamic dataset lies in whether the data remains constant or changes during training. 1. Static Dataset A static dataset is a fixed collection of data samples that does not change throughout the training process. Characteristics: • Fixed Size: The dataset is predefined, with a fixed number of samples. • Preprocessing: Data is typically preprocessed and augmented beforehand, ensuring consistency during training. • Reproducibility: Training on the same static dataset ensures the results are repeatable since the input data remains unchanged. Use Cases: • Image Classification: Training models on datasets like MNIST, CIFAR-10, or ImageNet. • Tabular Data: Predictive models for structured datasets, like customer churn or loan approvals. • Pre-processed Text Datasets: Such as pre-tokenized datasets for NLP tasks. Advantages: • Easier to manage, preprocess, and debug. • Results are more predictable and easier to replicate. • Well-suited for tasks where the dataset is finite and does not evolve over time. Disadvantages: • Limited variability, potentially leading to overfitting if the model memorizes the data instead of generalizing. • May not simulate real-world scenarios where data distribution changes over time. 2. Dynamic Dataset A dynamic dataset changes or is generated on-the-fly during training. This could involve dynamically generating new samples, augmenting data, or pulling in real-time data streams. Characteristics: • Generated or Augmented Data: Data samples might be dynamically augmented (e.g., cropping, rotating, flipping images) or synthesized using generative models. • Online Data Streams: The data may come from live or continuously updated sources, such as IoT devices, sensors, or APIs. • Shifting Distribution: In some cases, the distribution of the data may change over time, which the model needs to adapt to. Use Cases: • Data Augmentation: Dynamically applying transformations to images during training to increase diversity. • Streaming Data: Online learning models that process real-time data, such as stock prices or user activity logs. • Synthetic Data: Training GANs or reinforcement learning models where new data is generated iteratively during training. Advantages: • Improves generalization by introducing variability (e.g., data augmentation). • Can simulate real-world scenarios, such as changing environments or evolving data distributions. • Effective for large-scale or infinite data scenarios. Disadvantages: • Increased computational complexity due to on-the-fly data processing. • Harder to debug and replicate experiments since the dataset may not be consistent. • More complex data management and storage requirements. Comparison: Static vs. Dynamic Datasets Aspect Static Dataset Dynamic Dataset Size Fixed Can be infinite or variable Data Distribution Constant Can change during training Preprocessing Done before training Done on-the-fly during training Reproducibility Easy to reproduce results Harder to reproduce due to changing data Use Cases Image classification, tabular data Data augmentation, streaming data, RL Complexity Low Higher computational and implementation cost Which to Use? • Static Dataset: When the dataset is finite and preprocessed, such as classical supervised learning tasks with fixed training sets. • Dynamic Dataset: When data diversity, generalization, or adaptability is critical, such as with streaming data, reinforcement learning, or augmentation-heavy tasks. Both static and dynamic datasets are crucial in different contexts and contribute to the model’s performance depending on the problem being addressed.

Answer 19

What Are Memoryless Models? A memoryless model (also known as a feed-forward model) is a type of machine learning or deep learning model that does not retain any information about past inputs or outputs. Each input is processed independently, and the model’s predictions are based solely on the current input. Characteristics: 1. No Dependency on Previous Inputs: The model assumes that inputs are independent of each other. 2. Static Processing: The model maps input to output without retaining any historical information. 3. Simple Structure: Typically involves feed-forward neural networks (e.g., MLPs, CNNs). Use Cases: • Image classification (e.g., categorizing a single image). • Tabular data prediction (e.g., regression or classification). • Static datasets where the data points are independent of one another. Relationship to Static Data: • Memoryless models are often used with static datasets, as they do not require context or temporal relationships between data points. What Are Models with Memory? A model with memory (also known as a stateful model) is a model that retains information about past inputs or outputs, allowing it to make predictions based on historical context. These models are essential for tasks where the order or temporal relationships of the data matter. Characteristics: 1. Temporal or Sequential Processing: The model accounts for dependencies between inputs. 2. State Retention: The model maintains a “memory” of past inputs through hidden states or other mechanisms. 3. Complex Structure: Examples include recurrent neural networks (RNNs), long short-term memory networks (LSTMs), and transformers. Use Cases: • Time-series forecasting (e.g., predicting stock prices based on past trends). • Natural language processing (e.g., translating text, generating sequences). • Reinforcement learning (e.g., games, robotics). Relationship to Dynamic Data: • Models with memory are often paired with dynamic datasets, where data evolves over time or the current data point depends on previous ones. Key Differences Between Memoryless Models and Models with Memory Aspect Memoryless Models Models with Memory Dependency on Past No (independent inputs) Yes (requires sequential or historical context) Structure Feed-forward (e.g., MLPs, CNNs) Recurrent or attention-based (e.g., RNNs, LSTMs, Transformers) Data Type Static (independent data points) Dynamic (temporal or sequential relationships) Use Cases Image classification, tabular prediction Time-series forecasting, NLP, speech processing Memory Mechanism None Hidden states or attention mechanisms Relationship Between Static/Dynamic Data and Memory Models Static Data: • Typically used with memoryless models because the data points are independent. • Example: A static image dataset like CIFAR-10 where each image is classified individually. Dynamic Data: • Requires models with memory to capture sequential or temporal dependencies. • Example: A dynamic dataset like video frames or text sequences where each data point depends on previous ones. Practical Example 1. Memoryless Model (Static Data): • Task: Classify whether an email is spam or not. • Data: Independent static dataset of email features. • Model: Feed-forward neural network (no need to remember past emails). 2. Model with Memory (Dynamic Data): • Task: Predict the next word in a sentence. • Data: Sequence of words in a sentence (dynamic relationship between words). • Model: LSTM or Transformer (retains context from previous words). Summary • Memoryless Models: Process inputs independently; used for tasks with static, independent data. • Models with Memory: Retain historical context and are designed for sequential or temporal data, making them well-suited for dynamic datasets. • Choosing between these types depends on whether the problem requires modeling dependencies between data points or not.

Answer 20

What Are Recurrent Neural Networks (RNNs)? Recurrent Neural Networks (RNNs) are a type of neural network designed for processing sequential data by retaining information from previous inputs through their hidden states. Unlike feedforward neural networks (FFNNs), RNNs have connections that form cycles, allowing them to maintain “memory” of past data. Key Characteristics of RNNs: 1. Sequential Data Processing: • Designed for tasks where order and temporal dependencies matter, such as time series, natural language processing, and speech recognition. 2. Hidden State: • At each time step, the RNN maintains a hidden state that acts as memory. This hidden state is updated based on the current input and the previous hidden state. h_t = f(W \cdot h_{t-1} + U \cdot x_t + b) Where: • h_t : Hidden state at time t • h_{t-1} : Hidden state at time t-1 • x_t : Input at time t • W, U, b : Trainable weights and biases • f : Activation function (e.g., \tanh or ReLU) 3. Output: • At each time step, the network produces an output based on the hidden state: y_t = g(V \cdot h_t + c) Where g is an activation function for the output layer. How Backpropagation Happens in RNNs Backpropagation in RNNs involves a special version of backpropagation called Backpropagation Through Time (BPTT) because RNNs operate over sequences and share weights across time steps. Steps in BPTT: 1. Unrolling the RNN: • The RNN is “unrolled” across all time steps, creating a structure that resembles a deep feedforward network where each time step corresponds to one layer. 2. Forward Pass: • Compute the hidden states ( h_t ) and outputs ( y_t ) at each time step using the input sequence and the shared weights. 3. Compute Loss: • Calculate the total loss, typically as the sum of losses at each time step: L = \sum_{t=1}^{T} \mathcal{L}(y_t, \hat{y}_t) Where \hat{y}_t is the true target at time t . 4. Backward Pass (Gradient Computation): • Gradients are propagated backward through time using the chain rule. The gradient of the loss with respect to the shared weights is computed by summing over all time steps: \frac{\partial L}{\partial \theta} = \sum_{t=1}^{T} \frac{\partial L_t}{\partial \theta} Where \theta represents the shared weights W, U, V, b, c . 5. Update Weights: • After computing gradients, update the weights using gradient descent or an optimizer like Adam. Differences Between Backpropagation in FFNN and RNN: Aspect FFNN RNN Structure No temporal connections; inputs processed independently. Includes temporal connections; inputs processed sequentially. Gradient Flow Gradients flow through one layer (or time step). Gradients flow through time (via hidden states). Backpropagation Gradients computed for each layer independently. Gradients computed across all time steps using BPTT. Weight Sharing Each layer has unique weights. Weights are shared across time steps. Challenges Vanishing/exploding gradients are less pronounced. More prone to vanishing/exploding gradient problems. Challenges with BPTT 1. Vanishing/Exploding Gradients: • Gradients can diminish (vanish) or grow exponentially (explode) as they are propagated backward through time. This is particularly problematic for long sequences. 2. Computational Cost: • Unrolling the RNN and computing gradients for long sequences can be computationally expensive. 3. Long-Term Dependencies: • Standard RNNs struggle to capture long-term dependencies due to the vanishing gradient problem. Solutions to Address BPTT Challenges 1. Gradient Clipping: • Restricts the gradients to a maximum value to prevent exploding gradients. 2. Use of Specialized Architectures: • LSTMs (Long Short-Term Memory networks) and GRUs (Gated Recurrent Units) mitigate vanishing gradients by incorporating gates that help preserve long-term information. 3. Truncated BPTT: • Instead of unrolling the RNN for the entire sequence, truncate it to a fixed number of time steps, reducing computational cost. Summary Recurrent Neural Networks process sequential data by maintaining hidden states, making them ideal for tasks where temporal or sequential dependencies exist. Backpropagation in RNNs is done through Backpropagation Through Time (BPTT), which unrolls the network over time steps and computes gradients. However, BPTT is prone to vanishing and exploding gradients, which can be addressed using techniques like gradient clipping, truncated BPTT, or advanced architectures like LSTMs and GRUs.

Answer 21

A neural network is a sequence of applications of non linear functions to the weighted sum of the input of the previous layer output So if we want to know the derivative of the final function z with respect to a given weight w we want the derivative of f(g(w)) which is equal to f’(g(x)) g’(x)

Answer 22

First our goal is to compute the derivative of a given weight with respect to our error function (dE(w)/dwij). This leads us to need to calculate the derivative of our final function with respect to our targeted weight (dg(xn, w)/dwij).

Answer 23

First our goal is to compute the derivative of a given weight with respect to our error function (dE(w)/dwij). This leads us to need to calculate the derivative of our final function with respect to our targeted weight (dg(xn, w)/dwij).

Answer 24

Use the forward backward pass. In the forward pass you evaluate the output of each layer, from the weighted sum of the input of the neuron and the output of it you can compute the derivative of that neuron. The you multiply all those derivatives in a path, what results in the derivative of the error. This could be done parallely

Answer 25

The one to one problem: fixed size input to fixed size output (eg image classification) One to many problem: sequence of output (eg image captioning takes an image and outputs a sentence) Many to one: sequence input (eg sentiment analysis) Many to many: sequence input and sequence output (eg machine translation (English -> French)) Many to many: synced sequence input and output (eg video classification, label each frame of the video)

Answer 26

Column-wise unfolding

Answer 27

The geometric interpretation of a linear image classifier relates to how it separates classes in the feature space using a hyperplane. Here’s a breakdown of this interpretation: 1. Linear Classifier Overview A linear classifier predicts the class of an input based on a linear decision boundary. For an input \mathbf{x} , the classifier computes a score for each class using a linear transformation: z = W \mathbf{x} + \mathbf{b} Where: • W is the weight matrix. • \mathbf{x} is the input (e.g., flattened image features). • \mathbf{b} is the bias vector. • z represents the class scores (logits). The class with the highest score (e.g., from a softmax output) determines the prediction. 2. Geometric Interpretation Feature Space • Each input \mathbf{x} can be thought of as a point in a high-dimensional feature space (e.g., \mathbb{R}^n ). • The weights W define a set of linear decision boundaries (hyperplanes) in this space. Decision Boundaries • A decision boundary is the locus of points where the classifier is indifferent between two classes (i.e., where their scores are equal). For two classes i and j : W_i \mathbf{x} + b_i = W_j \mathbf{x} + b_j Simplifying: (W_i - W_j) \cdot \mathbf{x} + (b_i - b_j) = 0 This equation represents a hyperplane that separates the feature space into regions assigned to each class. Regions in Space • The feature space is divided into convex regions, one for each class. • Points on one side of a hyperplane are classified into one class, and points on the other side into another class. Visualization (2D Case) In a 2D feature space: • The hyperplane is a line. • The weight vector \mathbf{w} (normal to the hyperplane) points toward the direction of increasing score for a class. • The bias b shifts the hyperplane. For higher dimensions: • The hyperplane generalizes to a plane (3D) or a higher-dimensional surface. 3. Implications for Images When using images as inputs: • The feature space corresponds to the pixel space or a reduced feature space if the input passes through a feature extractor (e.g., CNN). • The hyperplanes separate image features (e.g., texture, edges) into regions associated with specific classes. 4. Limitations of Linear Classifiers Linear classifiers assume that the data is linearly separable: • If classes cannot be separated by straight hyperplanes (e.g., spiral patterns or concentric circles), the classifier struggles. • Non-linear transformations (e.g., neural network layers) are often applied to map the data to a feature space where classes are linearly separable. 5. Summary of Geometric Interpretation • A linear image classifier separates classes in the feature space using hyperplanes. • The weight vectors W define the orientation of these hyperplanes, while biases b determine their position. • The feature space is divided into regions, with each region corresponding to a specific class. For simple, linearly separable data, the geometric interpretation is intuitive and effective. However, for complex datasets like images, deeper models with non-linear transformations are often required to learn meaningful feature spaces.

Answer 28

Because the whole batch and the corresponding activations have to be stored in memory

Answer 29

Images are very high dimensional data A label might not uniquely identify the image There are many transformations that change the image dramatically, while not its label Images on the same class might be drastically different Perceptual similarity in images is not related to pixel similarity

Answer 30

Choose the parameters which maximizes the probability of the data You want the hypothesis that is likely to be observed We choose the model that maximizes the likelihood of the data

Answer 31

Yes. To maximize the likelihood of a model you can use: - analytical techniques (eg solve equations) - optimization techniques (eg lagrange multipliers) - numerical techniques (eg gradient descent)

Answer 32

Featured extracted by the convolutional part of the network are invariant to shift of the input image

Answer 33

They are sort of local modules where multiple convolutions are run in parallel. This way multiple filter sizes are exploited at the same level and then merged by concatenation. The blocks preserve the spatial dimension. Thus the outputs can be concatenated depth-wisely

Answer 34

Because this way we can reduce the depth of the input at the same time we preserve spatial dimensions. This way the number of operations needed to perform the 3x3 and 5x5 convolutions are much lower.

Answer 35

Residual connections, first introduced in the ResNet (Residual Network) architecture by He et al. in 2015, are a key innovation that addressed the issue of vanishing gradients and degradation in very deep neural networks. These connections enable the training of much deeper networks by allowing the model to learn residual mappings instead of directly learning the desired output. What are Residual Connections? A residual connection is a shortcut or skip connection that bypasses one or more layers in a neural network, directly adding the input of a layer to its output. This can be mathematically expressed as: Where: • : The input to the block (also referred to as the identity mapping). • : The output of the layers (e.g., convolutional layers, batch normalization, and activation functions) that the residual connection bypasses. • : The output of the block after the addition of and . Instead of trying to learn the full mapping , the network learns a residual mapping , which makes it easier for the network to optimize. Why Use Residual Connections? Residual connections solve two major issues associated with training deep networks: 1. Vanishing Gradient Problem: • In very deep networks, gradients during backpropagation tend to shrink as they pass through multiple layers, leading to ineffective updates in early layers. This problem worsens as the network depth increases. • Residual connections help gradients flow more easily back through the network by providing a direct path for the gradient, mitigating the vanishing gradient problem. 2. Degradation Problem: • As networks become deeper, adding more layers does not always improve accuracy and can even degrade performance due to difficulties in optimizing the deeper layers. • By introducing residual connections, deeper layers can learn corrections (residuals) rather than the entire mapping, making it easier to optimize the network. How Do Residual Connections Work? • A residual connection skips over a block of layers, allowing the input to bypass these layers and be directly added to the block’s output. • The idea is that if the deeper layers cannot improve performance, the network can simply learn an identity mapping (i.e., ), and the residual connection ensures that the input is propagated forward unchanged. • This flexibility allows deeper networks to avoid degradation and achieve better performance. Residual Block Design A basic residual block in ResNet consists of: 1. A few layers (typically convolutional layers) with batch normalization and ReLU activations. 2. A skip connection that adds the input of the block to the output of the layers. Example: For a residual block with two convolutional layers: 1. Input is passed through: • First convolutional layer → Batch normalization → ReLU. • Second convolutional layer → Batch normalization. 2. The result of these layers is added to the input (via the residual connection): 3. A final ReLU activation is applied to . Types of Residual Connections 1. Identity Mapping: • The input is added directly to without any modifications. • Common when the input and output have the same dimensions. 2. Projection Shortcut: • Used when the dimensions of and are different (e.g., due to stride in convolutional layers). • A 1x1 convolution is applied to to match dimensions before adding it to . Advantages of Residual Connections 1. Enables Training of Very Deep Networks: • ResNet successfully trained networks with over 1000 layers, which was not feasible before due to optimization difficulties. 2. Improves Gradient Flow: • The direct path created by residual connections ensures that gradients can flow backward effectively during backpropagation, even in very deep networks. 3. Prevents Degradation: • Residual connections reduce the risk of performance degradation in deeper networks by allowing the network to default to identity mappings if additional layers are not useful. 4. Efficient Learning: • The network focuses on learning residuals (corrections) rather than the full mapping, which is simpler and more efficient. Applications of Residual Connections 1. Image Classification: • Residual connections are a core component of ResNet architectures, which achieved state-of-the-art results on benchmarks like ImageNet. 2. Object Detection and Segmentation: • Residual connections are widely used in models like Faster R-CNN and Mask R-CNN. 3. Natural Language Processing (NLP): • Variants of residual connections are used in transformer architectures like BERT and GPT. 4. Generative Models: • Residual connections are utilized in GANs and autoencoders to improve stability and performance. ResNet Variants 1. ResNet-18, ResNet-34: Shallower ResNets with fewer layers. 2. ResNet-50, ResNet-101, ResNet-152: Deeper ResNets with bottleneck layers for computational efficiency. 3. Wide ResNet (WRN): Increases the number of filters in each layer for wider residual blocks. 4. ResNeXt: Incorporates group convolutions for more efficient feature extraction. Conclusion Residual connections introduced by ResNet revolutionized deep learning by enabling the successful training of very deep networks. By learning residual mappings and ensuring efficient gradient flow, these connections overcome challenges like vanishing gradients and degradation, allowing networks to scale in depth and achieve high performance across a variety of tasks.

Answer 36

It is an estimation of the generalization error

Answer 37

Identify pixels that “go together” - group together similar-looking pixel for efficiency - separate images into coherent objects There are two types of segmentation: - unsupervised - supervised (or semantic)

Answer 38

The training set is made of pairs (I, GT), where the GT is a pixel-wise annotated image over the categories in the lables

Answer 39

It would be very inefficient since we would have very small receptive fields

Answer 40

In the one hand we need to go deep to exctract high level information on the image. On the other hand we want to stay local not to loose spatial resolution in the predictions

Answer 41

To bring the data to the origin and have similar ranges (variance) for the feature components

Answer 42

Zero-center the data Normalize every pixel

Answer 43

Because of the problem of vanishing gradients

Answer 44

Use one of its variations ELU, Leackly ReLU Reduce the learning rate

Answer 45

Because it changes the loss function and this way we can not compare the two results

Answer 46

We can create models do output class activation maps by changing the last part of a CNN model to a GAP plus an output layer with the number of neurons equal to the number of classes.

Answer 47

Models that do not have memory. They are deterministic models for which the history of the input does not matter

Answer 48

They have memory, but it is limited

Answer 49

When it is able to represent an unlimited amount of time in the past

Answer 50

As a state. And you use this state to build a representation of the history

Answer 51

It uses a loop to build a state/memory

Answer 52

Autoregrassive models aims to predict the next input based on previous ones Linear models with fixed lag They are both linear memoryless models

Answer 53

Autoregressive models: take previous inputs wanted, feed into a FFNN and predict the next. No recurrency needed.

Answer 54

No we could use feed forward neural network for some kind of tasks that do not require to much steps in the past

Answer 55

Yes they are. It means that if you provide the same input twice you get the same result from the same initial configuration

Answer 56

The problem is the loop in the memory neurons. To solve we could use the backpropagation through time technique, where the loop is unrolled into the set of inputs of the previous state What is Backpropagation Through Time (BPTT)? Backpropagation Through Time (BPTT) is an extension of the backpropagation algorithm used to train recurrent neural networks (RNNs). It handles the sequential nature of RNNs by unrolling the network through time, treating it as a multi-layer feedforward network where each layer corresponds to a time step. This approach computes gradients of the loss function with respect to the weights by propagating errors backward through both time steps and the network layers. How Does BPTT Work? 1. Unrolling the Network: • An RNN is “unrolled” for a given sequence length T , where each time step t is treated as a separate layer with shared weights. • For a sequence of inputs \mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_T , the hidden state \mathbf{h}_t is updated recursively: \mathbf{h}t = f(\mathbf{h}{t-1}, \mathbf{x}_t; \mathbf{W}) where \mathbf{W} represents the shared weights. 2. Forward Pass: • The network processes the entire sequence, computing hidden states \mathbf{h}_t and predictions \hat{\mathbf{y}}_t at each time step. • The loss is computed for each time step: \mathcal{L} = \sum_{t=1}^T \mathcal{L}_t 3. Backward Pass (Error Propagation): • The error at the final time step T is propagated backward through time to earlier steps. • Gradients are computed for each time step and accumulated: \frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \sum_{t=1}^T \frac{\partial \mathcal{L}_t}{\partial \mathbf{W}} 4. Weight Update: • The computed gradients are used to update the shared weights \mathbf{W} using an optimization algorithm (e.g., gradient descent or Adam). Mathematics of BPTT Gradient Computation The loss function at time t depends not only on the input \mathbf{x}t but also on the hidden states from previous time steps. The total gradient of the loss with respect to weights \mathbf{W} involves the recursive dependency: \frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \sum{t=1}^T \frac{\partial \mathcal{L}_t}{\partial \mathbf{h}_t} \cdot \frac{\partial \mathbf{h}_t}{\partial \mathbf{W}} Recursive Dependency The gradient \frac{\partial \mathbf{h}t}{\partial \mathbf{W}} depends on all previous time steps due to the recurrent structure: \frac{\partial \mathbf{h}t}{\partial \mathbf{W}} = \frac{\partial f(\mathbf{h}{t-1}, \mathbf{x}t; \mathbf{W})}{\partial \mathbf{W}} + \frac{\partial f(\mathbf{h}{t-1}, \mathbf{x}t; \mathbf{W})}{\partial \mathbf{h}{t-1}} \cdot \frac{\partial \mathbf{h}{t-1}}{\partial \mathbf{W}} This recursive dependency is why BPTT “backpropagates” through all time steps. Challenges of BPTT 1. Vanishing/Exploding Gradients: • Gradients can shrink or grow exponentially during backpropagation through many time steps, making learning long-term dependencies difficult. • Solutions: • Use gated architectures like LSTMs or GRUs. • Apply gradient clipping for exploding gradients. 2. Computational Cost: • BPTT is computationally expensive because it requires maintaining dependencies for all time steps in memory. 3. Memory Constraints: • For long sequences, storing all intermediate states and gradients becomes resource-intensive. Variants of BPTT 1. Truncated Backpropagation Through Time (TBPTT): • Instead of backpropagating through the entire sequence, TBPTT limits the number of time steps to a fixed window k . • For example, instead of backpropagating through all T time steps, gradients are propagated through the last k steps: \mathbf{h}t \text{ updates only depend on } \mathbf{h}{t-k:t}. • Benefits: • Reduces memory and computational cost. • Mitigates gradient vanishing over long sequences. 2. Online Training: • BPTT can be applied incrementally for each input at each time step, rather than waiting for the entire sequence. Key Advantages of BPTT 1. Handles Sequential Dependencies: • Captures temporal relationships by considering previous hidden states. 2. General Framework: • Can be applied to any RNN-based architecture. Conclusion BPTT is essential for training RNNs because it accounts for temporal dependencies in sequential data. However, its challenges—especially vanishing/exploding gradients and computational cost—are mitigated through techniques like gated architectures (e.g., LSTMs, GRUs), truncated backpropagation, and gradient clipping. These improvements make BPTT more effective for training modern sequence models.

Answer 57

A dynamical system is a system that evolves over time based on specific rules. In the context of machine learning and neural networks, models with memory (e.g., recurrent neural networks, LSTMs, GRUs) are types of dynamical systems because their behavior at any given time depends not only on the current input but also on their internal state, which reflects past inputs. These models are particularly useful for time-series data, sequential data, or systems where temporal dependencies are critical. How Dynamical Systems (Models with Memory) Work? Dynamical systems in machine learning are characterized by the following: 1. Input and State Dependence • The output at a given time depends on both: • The current input. • The current state of the system, which encodes information about past inputs (memory). 2. State Update Rule • At each time step t , the model updates its internal state \mathbf{h}_t based on the current input \mathbf{x}t and the previous state \mathbf{h}{t-1} . This update is typically governed by a function: \mathbf{h}t = f(\mathbf{h}{t-1}, \mathbf{x}_t; \theta) where: • f : A non-linear function (e.g., implemented by neural network layers). • \theta : Model parameters (weights). 3. Output Generation • The system produces an output \mathbf{y}_t , which can depend on both the internal state \mathbf{h}_t and the input \mathbf{x}_t : \mathbf{y}_t = g(\mathbf{h}_t, \mathbf{x}_t; \phi) where g is a function (e.g., a fully connected layer). Types of Memory-Based Models Here’s how memory is handled in some common dynamical systems used in machine learning: 1. Recurrent Neural Networks (RNNs) • RNNs are the simplest memory-based neural models. • State Update Rule: \mathbf{h}_t = \tanh(\mathbf{W}h \mathbf{h}{t-1} + \mathbf{W}_x \mathbf{x}_t + \mathbf{b}) • Problem: RNNs struggle to retain long-term dependencies due to vanishing/exploding gradient problems during backpropagation. 2. Long Short-Term Memory Networks (LSTMs) • LSTMs improve upon RNNs by introducing gates (forget, input, and output gates) to control the flow of information and selectively retain or discard memory. • State Update Rule: • The cell state \mathbf{c}_t acts as a long-term memory. • Gates allow the network to: • Forget irrelevant information. • Add new information from \mathbf{x}_t . • Output relevant information to \mathbf{h}_t . 3. Gated Recurrent Units (GRUs) • GRUs simplify LSTMs by combining some gates while retaining the ability to capture long-term dependencies. • State Update Rule: • GRUs have reset gates (control how much past information to forget) and update gates (control how much new information to add). 4. Transformer Models (Memory in Attention) • Transformer models use self-attention mechanisms instead of recurrent connections to model dependencies across sequences. • Memory is captured explicitly through attention scores, which determine how much one element of the sequence influences another. How Dynamical Systems Handle Memory 1. Recurrent Connections: • In RNNs, LSTMs, and GRUs, recurrent connections between time steps allow the model to propagate state information across the sequence. 2. Cell State or Context Vector: • LSTMs and GRUs explicitly maintain a cell state or a compressed memory representation that evolves over time. 3. Attention Mechanisms: • In transformer models, attention layers allow the model to directly access all previous inputs, effectively “remembering” without recurrent connections. Applications of Dynamical Systems in Machine Learning 1. Time-Series Forecasting: • Predicting future values based on historical trends (e.g., stock prices, weather patterns). 2. Natural Language Processing: • Handling sequential data like text (e.g., translation, text generation). 3. Speech Recognition: • Analyzing audio signals, which are inherently sequential. 4. Control Systems: • Training models to control systems that evolve dynamically, such as robots or autonomous vehicles. Key Challenges 1. Vanishing/Exploding Gradients: • RNNs struggle with long-term dependencies due to the difficulty of propagating gradients over many time steps. 2. Memory Length: • Models like RNNs have a limited memory capacity, which is why LSTMs and GRUs were introduced to capture longer-term dependencies. 3. Efficiency: • Sequential models (like RNNs) can be computationally expensive for long sequences compared to parallelized models like transformers. Geometric Perspective In dynamical systems, the state space can be seen as a geometric representation of all possible states of the system. As the system evolves (over time steps), it traces a trajectory in this space, influenced by the inputs and the state update rules. In summary, dynamical systems with memory are designed to process sequential data by maintaining and updating a state that captures past context. Their effectiveness depends on their ability to model dependencies across time, making them essential for tasks involving sequential or time-varying data.

Answer 58

In theory yes, but in practice no. This happens because of the vanishing gradient problem, which is due to the size of these models. The solution proposed to increase the memory capacity of those networks was to use ReLU or linear functions as the activation function in the recurrent neurons

Answer 59

Input gate: At each time based on the current state of the memory and the current input I decide to write something in the memory or not. Forget gate: At each time based on the current state of the memory and the current input I decide to erase something of the memory or not. Output of the memory: non linear functions of the memory, but you decide whether to output this content of not, you decide whether to read it or not. Principle of Long Short-Term Memory (LSTM) Models Long Short-Term Memory (LSTM) models are a special type of recurrent neural network (RNN) architecture designed to learn long-term dependencies in sequential data. They were introduced by Hochreiter and Schmidhuber (1997) to address the vanishing gradient problem that standard RNNs face during training. Core Idea of LSTMs The LSTM architecture introduces a memory cell and a set of gates to regulate the flow of information. These gates allow the model to remember or forget information over long sequences, effectively learning dependencies that span across many time steps. Key Components of LSTMs 1. Cell State (C_t): • The cell state is the memory of the network. It acts as a highway, carrying information across time steps with minimal modification, which mitigates the vanishing gradient issue. 2. Gates: • Gates are the core mechanism of LSTMs, allowing the network to decide which information to keep, forget, or output. They are controlled by sigmoid activations, which output values between 0 and 1 (indicating how much information should pass through). • Forget Gate (f_t): • Decides what information to discard from the cell state. • Formula: f_t = \sigma(\mathbf{W}f \cdot [\mathbf{h}{t-1}, \mathbf{x}_t] + \mathbf{b}_f) f_t values close to 0 discard information, while values close to 1 retain it. • Input Gate (i_t): • Decides what new information to add to the cell state. • Formula: i_t = \sigma(\mathbf{W}i \cdot [\mathbf{h}{t-1}, \mathbf{x}_t] + \mathbf{b}_i) • Candidate Cell State (\tilde{C}_t): • Represents the new information to be added. • Formula: \tilde{C}_t = \tanh(\mathbf{W}c \cdot [\mathbf{h}{t-1}, \mathbf{x}_t] + \mathbf{b}_c) • Output Gate (o_t): • Decides what part of the cell state to output as the hidden state. • Formula: o_t = \sigma(\mathbf{W}o \cdot [\mathbf{h}{t-1}, \mathbf{x}_t] + \mathbf{b}_o) 3. Hidden State (h_t): • Represents the current output of the LSTM at time t, derived from the cell state and output gate: \mathbf{h}_t = o_t \cdot \tanh(C_t) How LSTMs Work Step by Step 1. Forget Old Information: • The forget gate decides which parts of the previous cell state (C_{t-1}) to retain or discard. C_t = f_t \cdot C_{t-1} 2. Add New Information: • The input gate determines how much of the candidate cell state (\tilde{C}_t) should be added to the current cell state. C_t = C_t + i_t \cdot \tilde{C}_t 3. Compute Hidden State: • The output gate determines what information from the cell state is output as the current hidden state. \mathbf{h}_t = o_t \cdot \tanh(C_t) How LSTMs Solve the Vanishing Gradient Problem 1. Cell State as a Memory Highway: • The cell state (C_t) enables the network to maintain information across many time steps without significant modification. • Gradients flow directly through the cell state during backpropagation, bypassing non-linear activations like sigmoid or tanh that typically cause gradient vanishing. 2. Gates for Controlled Updates: • Gates regulate the flow of information in a controlled manner, preventing gradients from becoming too small or too large. • This prevents the network from overwriting long-term information stored in the cell state. 3. Mitigating Vanishing Gradients: • During backpropagation, the partial derivatives of the loss with respect to the cell state remain consistent because the forget and input gates allow selective retention of information. • Gradients are preserved across many time steps, enabling the network to learn long-term dependencies. Summary of Benefits of LSTMs • Mitigates vanishing gradient by using the cell state as a persistent memory. • Learns long-term dependencies effectively, even in very long sequences. • Can adaptively forget irrelevant information and remember important features via gating mechanisms. • Handles complex temporal patterns, making it effective for tasks like natural language processing, time-series analysis, and speech recognition. In essence, the LSTM architecture allows the model to address the limitations of standard RNNs, making it a powerful tool for sequential data tasks.

Answer 60

Having multiple LSTM layers in a model, also known as a stacked LSTM architecture, is a way to build deep recurrent neural networks. Each layer processes the sequence data and passes its output to the next LSTM layer, allowing the network to learn more complex, hierarchical features from the data. Reasons for Using Multiple LSTM Layers 1. Hierarchical Feature Learning: • Each LSTM layer can capture different levels of abstraction: • The lower layers typically extract lower-level features from the input sequence, such as short-term dependencies or simple patterns. • The higher layers build on these features to learn more complex, long-term dependencies or higher-order representations. Example: In speech recognition: • Lower LSTM layers might learn phonemes. • Higher LSTM layers might combine phonemes to understand words or sentences. 2. Increased Model Capacity: • Adding more layers increases the capacity of the model, enabling it to model more complex relationships in the data. • This is particularly useful for tasks with highly non-linear and hierarchical patterns, such as language modeling, machine translation, or video understanding. 3. Improved Long-Term Dependency Modeling: • Deeper architectures allow the model to learn longer-term dependencies by passing information through multiple layers of processing, which can capture higher-level temporal features. 4. Specialized Layers for Intermediate Processing: • In some cases, different LSTM layers might focus on processing different aspects of the sequence: • The first layer processes the raw input (e.g., embedding representations for text or raw sensor data). • Later layers refine the learned features for specific tasks, such as classification or prediction. 5. Task-Specific Feature Transformation: • In multi-task learning or sequence-to-sequence models, intermediate LSTM layers might transform features for specific sub-tasks, like encoding in an encoder-decoder framework. How Stacked LSTMs Work • Output of a Layer: The output of each LSTM layer at each time step is a sequence of hidden states, which becomes the input to the next LSTM layer. • Processing Pipeline: For a stacked LSTM with N layers: • The first layer processes the raw input sequence. • The second layer processes the output of the first layer. • This process repeats for all layers, with each layer refining the learned representation further. Example of Stacked LSTM Consider a sequence-to-sequence model for machine translation: 1. Input sentence: “I am learning.” 2. Layer 1: Encodes the sentence into low-level linguistic features (e.g., word relationships). 3. Layer 2: Refines features into phrases or grammatical structures. 4. Layer 3: Maps refined features into higher-level semantic meaning for translation. Considerations for Using Multiple LSTM Layers 1. Overfitting: • More layers increase the risk of overfitting, especially with small datasets. Regularization techniques like dropout, weight decay, or early stopping should be applied. 2. Computational Cost: • Stacked LSTMs are computationally expensive in terms of both memory and training time, as each additional layer increases the number of parameters. 3. Gradient Stability: • While LSTMs mitigate vanishing gradients, stacking too many layers can still cause gradient issues. Proper initialization and techniques like gradient clipping can help. 4. Depth vs. Performance Trade-Off: • Adding layers does not always improve performance; the ideal number of layers depends on the dataset and task complexity. Key Applications of Stacked LSTMs • Natural Language Processing: Tasks like machine translation, text generation, and sentiment analysis benefit from stacked LSTMs to learn linguistic patterns at different levels of abstraction. • Time-Series Analysis: For tasks like stock price prediction or energy demand forecasting, multiple LSTM layers can capture short-term trends and long-term seasonality. • Speech Recognition: Stacked LSTMs can model low-level audio features (e.g., phonemes) and high-level speech structures (e.g., words or sentences). • Video Understanding: In tasks like action recognition or video captioning, stacked LSTMs process sequential visual features to learn spatial and temporal dynamics. Summary Having multiple LSTM layers in a model enables it to: 1. Learn hierarchical and complex features from sequential data. 2. Capture both short-term and long-term dependencies. 3. Handle complex tasks with structured patterns, like natural language, speech, or video analysis. While effective, the use of stacked LSTMs requires balancing model depth with considerations like overfitting, computational cost, and task-specific needs.

Answer 61

What Are Bidirectional RNNs? A Bidirectional Recurrent Neural Network (Bidirectional RNN) is a type of RNN architecture that processes input sequences in both forward and backward directions. This allows the model to consider both past (previous inputs) and future (subsequent inputs) context when making predictions. In a standard RNN, the hidden states are updated sequentially from the beginning to the end of the input sequence, meaning it only captures past dependencies. Bidirectional RNNs address this limitation by combining information from both directions, making them particularly useful for tasks where the context from both the start and end of the sequence is important. How Does a Bidirectional RNN Work? 1. Two Separate RNNs: • A Bidirectional RNN consists of two RNNs: • A forward RNN that processes the input sequence from start to end. • A backward RNN that processes the input sequence from end to start. 2. Hidden States: • At each time step t, the hidden states from both the forward RNN (\overrightarrow{h_t}) and the backward RNN (\(\overleftarrow{h_t}\)) are concatenated (or combined) to form the final hidden representation. \[ h_t = [\overrightarrow{h_t}; \overleftarrow{h_t}] \] 3. Output: • The outputs of the two RNNs are combined (e.g., via concatenation, addition, or averaging) to produce the final output for each time step. Mathematical Representation Given an input sequence X = \{x_1, x_2, \dots, x_T\}: 1. The forward RNN computes: \overrightarrow{h_t} = \text{RNN}{\text{forward}}(\overrightarrow{h{t-1}}, x_t) 2. The backward RNN computes: \[ \overleftarrow{h_t} = \text{RNN}{\text{backward}}(\overleftarrow{h{t+1}}, x_t) \] 3. The final hidden state at time t is: \[ h_t = [\overrightarrow{h_t}; \overleftarrow{h_t}] \] 4. The final output is derived from the combined hidden states. Advantages of Bidirectional RNNs 1. Captures Context in Both Directions: • Unlike standard RNNs, Bidirectional RNNs use both past and future context, which is critical for understanding relationships within the entire sequence. Example: In a sentence like “The cat chased the mouse”, understanding the word “chased” requires knowledge of both “The cat” (past) and “the mouse” (future). 2. Improved Accuracy for Sequential Tasks: • Bidirectional RNNs are particularly effective for tasks that require analyzing the entire sequence, such as natural language understanding and speech processing. 3. Better Handling of Ambiguities: • When the context of an input depends on future inputs (e.g., word sense disambiguation in NLP), Bidirectional RNNs help disambiguate more effectively. Disadvantages of Bidirectional RNNs 1. Increased Computational Cost: • Since the model processes the sequence in both directions, the computational and memory requirements

Answer 62

Specifying the initial state of an RNN is important as it serves as the starting point for the hidden states that are updated throughout the sequence processing. The way the initial state is specified can significantly impact the model’s performance, especially in tasks where prior information or context is crucial. Below are the common ways to initialize the state of an RNN: 1. Zero Initialization • Description: The most common and straightforward approach is to initialize the initial state (h_0) to a vector of zeros. • Formula: h_0 = \mathbf{0} • Advantages: • Simple and computationally inexpensive. • Often sufficient when the RNN’s hidden state can “warm up” by processing the sequence. • Disadvantages: • May not be effective for sequences where prior knowledge or context is essential. • Can slow down convergence in certain tasks. 2. Random Initialization • Description: The initial state is initialized with small random values, often drawn from a normal or uniform distribution. • Advantages: • Adds variability and breaks symmetry in training. • Disadvantages: • The randomness may introduce instability, especially in tasks sensitive to initial conditions. 3. Trainable Parameters • Description: The initial state is treated as a learnable parameter, updated during training alongside the model weights. • Advantages: • Allows the model to learn an optimal starting point for the hidden states. • Useful in tasks where a fixed initial state needs to encode task-specific information. • Disadvantages: • Adds additional parameters, increasing the model’s complexity. • Can lead to overfitting if not regularized properly. 4. Pre-trained or Contextual Initialization • Description: The initial state is set based on pre-trained features or contextual information from another model or dataset. • For example, in sequence-to-sequence models, the initial state of the decoder RNN is often set to the final hidden state of the encoder RNN. • Another example is using embeddings or features learned from another network as the initial state. • Advantages: • Allows the RNN to start with a meaningful representation of the context. • Essential for models like encoder-decoder architectures (e.g., in machine translation or summarization). • Disadvantages: • Requires additional computation or a pre-trained model to generate the initial state. 5. Application-Specific Initialization • Description: The initial state is tailored to the specific task or data. For example: • In time-series forecasting, the initial state may encode seasonal patterns or trends. • In personalized models, the initial state could represent user-specific information. • Advantages: • Aligns with the domain requirements and can improve task performance. • Disadvantages: • Requires careful design and additional domain knowledge. • May introduce complexity during implementation. 6. Batch-Specific Initialization • Description: The initial state is determined dynamically for each batch of input sequences, such as: • Using a summary of the input sequence (e.g., mean or max-pooling over the embeddings of the sequence). • Computing the initial state from auxiliary inputs or features. • Advantages: • Provides adaptive initialization for each input sequence, improving generalization. • Disadvantages: • Adds computational overhead to compute dynamic initial states. Which Method to Choose? The choice of initialization method depends on the task, data, and architecture: • Zero Initialization: Suitable for most standard tasks unless prior context is crucial. • Trainable Initialization: Effective for tasks requiring a learned starting point. • Pre-trained Initialization: Best for models like encoder-decoder architectures or transfer learning. • Task-Specific Initialization: Ideal for specialized applications like time-series forecasting or personalized models. By carefully selecting or designing the initial state, the RNN’s performance and convergence can be improved significantly for various applications.

Answer 63

Data where order is important and for which you have information from the order of the data

Answer 64

Text is a sequence of words. And a language model is that one that tries to reflect the probability distribution of all the sentences of the language. Conditional language models otherwise tries to build the probability distribution of all possible sentences in the language, given another sentence (the one we would like to translate)

Answer 65

We can model them as encoder-decoder architectures. You encode the input into a vector and the decoder will use this source representation to generate the target sentence.

Answer 66

It tries to capture the meaning of the input sentence

Answer 67

We first create the vector representation of the input phrase (what it means), then based on what it means what is the first word to output? Then given that I just outputted a word and this was the meaning of what I am trying to translate what is the next word to output

Answer 68

With the greedy approach we get the most likely toke that follows our output sequence plus our context. What generates a linear complexity with respect to the input size In the exact approach we should compute the probability of all possible sentences. So the complexity is exponential in the length of sentence. In the beam search otherwise we pick a given number of most probable tokens and compute different sentences parallelly. We also drop the less likely tokens at each step so the complexity do not explode (became exponential). This way the complexity is linear with the constant as the branching factor.

Answer 69

We compute the cross entropy for each step. Because it is a classification problem. At each step you have to predict what is the right word to predict. And then you sum the cross entropy over the entire sequence. Also in training time, at each token prediction, beyond the context vector, the model inputs the correct value, not the predicted one. So, this way, we have many classification problems.

Answer 70

Look back (to the last) when decoding your sequence. In the encoder, as the process progresses, in each step, you have a representation of the decoding so far. Now looking at the final embedding you can also look at past representations of it that might be important. Focusing on the most important part is attention. This mechanism to select among a set of items which one is the most relevant is attention.

Answer 71

It uses an attention function which given the encode state at a certain time and the current decoder state, will tell you how relevant your past item is with respect to the current state. The attention mechanism is also trainable.

Answer 72

Because it allows you to put your attention to words that might go togheter. Two tokens that represent a single concept. Like Neural Networks

Answer 73

The attention functions could be: - simple dot product. Does not add additional parameter to train. - bilinear function (Luong attention). More flexible, increases the number of parameters, but is linear. - multi-layer perceptron (Bahdanau attention). Add even more parameters, and also add non linearity. Takes as input the current state and a past state for which you want to find the attention, concatenated them as input to the NN.

Answer 74

Generative and retrieval The context handling varies from single-turn to multi-turn

Answer 75

You have the input sequence, you process it in parallel. For each of the inputs tokens you rewrite it taking into account all the others (do this multiple times). Then you have a sequence of tokens (which representation was built like described before) and you use them all to predict the next token.

Answer 76

We use the attention mechanism. But now instead of having the decoder state to build an attention vector on the input, all the input tokens that built attention with each other.

Answer 77

Because each reconstruction of the encodes of each token is like a FFNN and can be done in parallel. This is achieved by the query, key, value strategy One token is transformed into a query, the query together with the key value is used to form attention, and the attention is over the values. For each token you used the query, which is a projection of the token, you compute the score, which is the dot product with the value, and you use the softmax to weight the value.

Answer 78

What we do is generate all token in parallel by just looking at the token before (it does not look in the future!) The training of RNNs scale linearly with the length of the source sequence and target sequence. While on transformer training you can do it in constant time, because you can do it in parallel

Answer 79

Also the decoding can be done in parallel with the encoding.

Theory Flashcards

(103 cards)