Build, Train and Tune Model Flashcards
Activation Function
A mathematical function applied to the output of each neuron in a neural network to introduce non-linearity and enable the network to learn complex patterns and relationships in the data. Common activation functions include sigmoid, tanh, ReLU (Rectified Linear Unit), Leaky ReLU, and softmax. Activation functions play a crucial role in determining the output of neural networks and affect the network’s training dynamics, convergence speed, and performance.
Activation map
Also known as a feature map, is a two-dimensional array or tensor that represents the output of a layer in a convolutional neural network (CNN) after applying an activation function. Each element in the activation map corresponds to the activation value of a specific neuron in the layer, capturing the presence of certain features or patterns in the input data. Activation maps are used for visualizing and interpreting the learned representations in CNNs and are instrumental in understanding how the network processes and transforms input data.
Adam Optimization
Adam (Adaptive Moment Estimation) is an optimization algorithm commonly used to update the parameters of neural networks during training. It combines the benefits of adaptive learning rate methods (such as RMSprop) and momentum-based optimization techniques to achieve faster convergence and better generalization performance. Adam computes adaptive learning rates for each parameter based on past gradients and stores exponentially decaying averages of past gradients and squared gradients. It is widely used in deep learning frameworks for training various types of neural network architectures.
Backpropagation
A fundamental algorithm for training neural networks. It involves computing the gradient of a loss function with respect to the network’s parameters, then using this gradient to update the parameters in the direction that minimizes the loss. This process is repeated iteratively to optimize the network’s performance. Backpropagation enables neural networks to learn from data by adjusting their internal parameters to better approximate the desired output for a given input. For example, in a simple feedforward neural network used for image classification, backpropagation adjusts the weights connecting neurons in each layer to reduce the difference between the predicted class and the actual class of each image in the training dataset.
Backpropagation through time
An extension of the backpropagation algorithm specifically designed for training recurrent neural networks (RNNs) over sequential data. It unfolds the network through time, treating each time step as a layer, and computes gradients using the chain rule of calculus. BPTT is widely used in tasks such as speech recognition, natural language processing, and time series prediction, where the input data is sequential and has temporal dependencies. However, BPTT suffers from the vanishing gradient problem, where gradients diminish exponentially over long sequences, making it challenging to learn dependencies over extended periods.
Batch Normalization
A technique used to improve the training of deep neural networks by normalizing the input of each layer across mini-batches of data. It reduces internal covariate shift and accelerates convergence by stabilizing the distributions of layer inputs. Batch Normalization helps mitigate issues such as vanishing gradients, enables the use of higher learning rates, and acts as a regularizer, reducing the need for other regularization techniques. It is typically applied after the activation function in each layer of a neural network, normalizing the output of the preceding layer before passing it to the next layer.
Bayesian hyperparameters optimization
A method used to efficiently search for optimal hyperparameters of machine learning algorithms by modeling the objective function as a probabilistic surrogate model. It leverages Bayesian techniques to iteratively update a probabilistic model of the objective function based on observed evaluations, allowing for more effective exploration of the hyperparameter space.
Purpose: The goal of Bayesian Hyperparameters Optimization is to find hyperparameters that maximize the performance of a machine learning model while minimizing the number of evaluations required.
Example: In practice, Bayesian optimization is used in tasks such as tuning the hyperparameters of support vector machines, random forests, and deep neural networks, where manually searching the hyperparameter space would be prohibitively time-consuming.
Bootstrapping
A resampling technique used in statistics and machine learning to estimate the sampling distribution of a statistic by repeatedly sampling with replacement from the observed data. In bootstrapping, multiple samples (bootstrap samples) are drawn from the original dataset, and statistical estimates or models
Bottleneck layer
Bottleneck layers in neural networks act like compression belts for information flow. Imagine a large crowd trying to squeeze through a narrow tunnel. The bottleneck layer is that tunnel, forcing the network to compress its data into a lower-dimensional representation.
A bottleneck layer has fewer neurons compared to the layers before and after it. This “bottleneck” forces the network to identify the most critical information and discard redundancy. Despite the size reduction, the bottleneck layer aims to capture the essence of the data. It does this by applying filters that highlight the most significant features learned by previous layers. By reducing data size, bottleneck layers make the network more efficient. They require fewer calculations and can help prevent overfitting, a situation where the network memorizes specifics instead of learning general patterns. Often, bottleneck layers are used in conjunction with residual connections. These connections allow the network to bypass the bottleneck entirely and add the original, uncompressed data to the output. This ensures the network retains important details while still benefiting from the efficiency gains.
Used for example in autoencoders.
Bounding box
A rectangular or cuboidal area used to encapsulate objects or regions of interest in an image or scene. It is defined by its coordinates, typically represented as (xmin, ymin, xmax, ymax) for 2D bounding boxes in image space. Bounding boxes are commonly used in computer vision tasks such as object detection, instance segmentation, and object tracking to localize and identify objects within images or video frames.
Checkpoints in models
Checkpoints are snapshots of a model’s parameters saved during training. These checkpoints include the model’s architecture, weights, optimizer state, and other relevant parameters. Checkpoints are crucial for resuming training from a specific point, fine-tuning models, or deploying trained models for inference. They allow practitioners to monitor training progress, prevent data loss in case of interruptions, and facilitate model evaluation and experimentation.
Checkpoints are saved in standardized formats (e.g., TensorFlow’s SavedModel, PyTorch’s .pth files) and managed using tools like callbacks in TensorFlow or torch.save() in PyTorch. They ensure reproducibility, scalability, and reliability in machine learning workflows.
Convolving
Convolving refers to the process of applying a convolution operation to input data using a convolutional kernel or filter. In the context of image processing and computer vision, convolution is used to extract features from images by sliding a kernel over the input image and computing the dot product between the kernel and local regions of the image. Convolving is a fundamental operation in convolutional neural networks (CNNs) and is used to detect patterns, edges, textures, and other visual features in images.
Criterion
An objective function or measure used to evaluate the performance of a model or algorithm. The criterion quantifies how well the model’s predictions match the true outcomes or how effectively the algorithm achieves its objectives. Common criteria in machine learning include loss functions, accuracy, precision, recall, F1-score, and mean squared error. The choice of criterion depends on the specific task, dataset, and optimization goals.
Data Loader
A component or module in machine learning frameworks and libraries used to load, preprocess, and batch input data for training or inference. Data loaders are responsible for reading data from storage (e.g., disk, database), applying data transformations (e.g., normalization, augmentation), and organizing data into batches suitable for efficient processing by machine learning models. Data loaders play a critical role in managing large datasets, handling data pipelines, and optimizing the training process for deep learning models.
Decision boundary
A hypersurface or boundary that separates different classes or categories in the feature space of a classification problem. It represents the region where the decision function changes from predicting one class to another. In binary classification tasks, the decision boundary is typically a line, curve, or hyperplane that partitions the feature space into two regions corresponding to different class labels. Decision boundaries are learned by machine learning algorithms based on the training data and model parameters and are used to make predictions on new or unseen data points.
Cenoising autoencoder
A type of artificial neural network used for learning efficient representations of data by removing noise or corruption from input samples. Unlike traditional autoencoders, denoising autoencoders are trained to reconstruct clean or uncorrupted versions of input data from noisy observations. They learn robust features that capture the underlying structure of the data while filtering out irrelevant or noisy information. Denoising autoencoders find applications in dimensionality reduction, feature learning, and unsupervised pretraining in machine learning and deep learning.
Dropout
A regularization technique used in neural networks to prevent overfitting and improve generalization performance. During training, dropout randomly deactivates (sets to zero) a proportion of neurons in a layer with a specified dropout rate. This prevents individual neurons from relying too heavily on specific features or co-adapting with other neurons and encourages the network to learn more robust and generalizable representations. Dropout is commonly used in deep learning models, especially fully connected and convolutional neural networks.
EOS Token
End of sentence. EOS token acts like a full stop in a sentence. It’s a special symbol that signals the end of an output sequence.
During training, the model learns to associate the EOS token with the end of a coherent sentence or translation. When generating text, the model keeps producing words or tokens until it outputs the EOS token. The EOS token is crucial because it allows models to generate sequences of different lengths. Without it, the model wouldn’t have a clear signal for when to stop generating text. In machine translation, the EOS token tells the decoder (the part generating the target language) that the source sentence has been fully processed, and it’s time to wrap up the translation.
Evolutionary hyperparameters optimization techinques
methods inspired by principles of natural selection and evolution to search for optimal hyperparameters in machine learning models. These techniques typically involve the use of evolutionary algorithms, such as genetic algorithms, evolutionary strategies, or genetic programming, to explore the hyperparameter space and find combinations that result in improved model performance. Evolutionary hyperparameters optimization techniques are useful when dealing with complex optimization problems or when traditional methods such as grid search or random search are impractical or inefficient.
Inspiration for mechanism:
- Population of Solutions: Instead of manually trying different hyperparameter combinations, you start with a population of random solutions (sets of hyperparameters).
- Fitness Evaluation: Each solution is evaluated, often by training a model with those hyperparameters and seeing how it performs on a validation set.
- Survival of the Fittest: The best-performing solutions have a higher chance of being selected for the next generation.
- Crossover and Mutation: New solutions are created by:
- Crossover: Combining elements from two good parent solutions
- Mutation: Making small random changes to existing solutions. This helps explore the search space.
- Repeat: This process repeats for several generations, with the aim that better solutions evolve over time.
Examples of Evolutionary Algorithms
- Genetic Algorithms (GA): Solutions are represented like chromosomes; crossover and mutation are modeled after biological processes.
- Particle Swarm Optimization (PSO): Solutions are like particles moving through space, influenced by their own best-found position and the globally best positions.
- Differential Evolution (DE): New solutions are generated based on differences between existing solutions in the population.
Exploding gradient
During training, neural networks update their weights using backpropagation, which calculates the error (the difference between the predicted and true value) and propagates it backward through the layers to compute gradients. Gradients tell us how much to adjust the weights. In deep neural networks, these gradients can get multiplied through many layers.
If the weights are initialized too large, or certain conditions arise within the network (too many layers, too large weight initialized, some activation functions like sigmoid saturate functions around specifiv values like 0 or 1), these multiplied gradients can become extremely large, leading to the exploding gradient problem.
Huge gradients result in massive updates to the network weights during training. This can lead to instability: The model may wildly overshoot the optimal solution orweights might become so large they overflow to NaN (Not a Number), breaking your training entirely.
Techniques such as gradient clipping and normalization are often used to mitigate the problem of exploding gradients.
Exploration vs. Exploitation
A fundamental trade-off in decision-making and optimization, particularly in reinforcement learning and multi-armed bandit problems.
Exploration refers to the process of gathering information about the environment or exploring different options to discover potentially better solutions.
Exploitation, on the other hand, involves leveraging known information or exploiting current knowledge to maximize immediate rewards or benefits.
Balancing exploration and exploitation is essential for learning and decision-making in dynamic environments, where the goal is to achieve a balance between gathering new information and exploiting existing knowledge to optimize long-term performance. Machine learning models learn from data. Your dataset is often a mere snapshot of all possible scenarios. A model too focused on exploiting the knowledge in your current data may perform poorly on new (overfitting), unseen data (this is overfitting). Exploration is needed to help it generalize better.
The exploration-exploitation dilemma is most pronounced in reinforcement learning (RL), where an agent learns through interacting with an environment and receiving rewards. The same principles apply for example in:
- Epsilon-greedy (particularly in RL): Take a random, exploratory action with a small probability.
- Decaying Epsilon: Start with a lot of exploration and decrease it over time.
- Optimism in the face of uncertainty: Favor under-explored actions or parts of the parameter space, providing an incentive for the model to try new things.
Feature detector (kernel, filter)
A feature detector, also known as a kernel or filter, is a small matrix or template used in convolutional neural networks (CNNs) to extract specific features or patterns from input data. Feature detectors are applied to input data using a convolution operation, where the filter is convolved with the input to produce feature maps. Different types of feature detectors (e.g., edge detectors, texture detectors, shape detectors) are designed to capture different aspects of the input data and are learned or manually defined during the training process.
Fold
In cross-validation, a fold refers to a distinct subset of data used for training and validation. The dataset is divided into multiple folds, typically of equal size, where each fold is used as a validation set exactly once while the remaining folds are used for training. The cross-validation process is repeated for each fold, ensuring that every data point is used for both training and validation.
For example, in k-fold cross-validation, the dataset is divided into k folds. The model is trained k times, with each fold used once as a validation set and the remaining k-1 folds used for training. The performance metrics are averaged over all k runs to provide an overall estimate of the model’s performance. Cross-validation helps in assessing the generalization performance of a model, detecting overfitting, and tuning hyperparameters.
Fully connected layers
Also known as dense layers or fully connected neural networks (FCNNs), are layers in artificial neural networks where each neuron is connected to every neuron in the preceding layer. In a fully connected layer, each neuron receives input from all neurons in the previous layer and computes a weighted sum of these inputs, followed by an activation function to produce the output. Fully connected layers are commonly used in feedforward neural networks and deep learning architectures for tasks such as classification, regression, and feature learning.
Gating
A process of controlling or modulating the flow of information within neural networks using gating mechanisms. Gating mechanisms selectively filter, amplify, or suppress information based on learned or predefined criteria, allowing networks to focus on relevant features or suppress irrelevant noise. Gating is commonly used in recurrent neural networks (RNNs) through mechanisms such as LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) cells to regulate the flow of information over sequential data.
Gausian noise
It’s a type of statistical noise where the probability distribution of the noise, meaning the values the noise can take on, follows a Gaussian distribution (also known as a normal distribution). A Gaussian distribution forms the classic bell-shaped curve, where values near the mean (average) are the most likely, and the the probability decreases as values move further away.
It is characterized by random fluctuations with a mean of zero and a constant variance, resulting in a symmetric distribution around the mean. Gaussian noise is often added to signals or data to simulate random variability or uncertainty, model measurement errors, or introduce randomness in stochastic processes. In machine learning, Gaussian noise is sometimes injected into input data or model parameters to regularize the learning process, prevent overfitting, or augment the training data.
Gini index (Gini impurity)
A measure of the impurity or randomness of a set of elements in a classification problem. It quantifies the probability of misclassifying an element randomly chosen from the set if it were labeled according to the class distribution in the set. A lower Gini index indicates higher purity and better separation of classes, while a higher Gini index indicates higher impurity and mixing of classes. The Gini index is commonly used as a criterion for splitting nodes in decision trees and evaluating the quality of splits in decision tree algorithms such as CART (Classification and Regression Trees).
Gradient clipping
Technique used to deal with exploding gradients.The central idea is simple: If the gradient exceeds a certain threshold, you clip its magnitude to stay within a reasonable range. Here are the common methods:
Clipping by Value:
- You define a minimum and maximum threshold.
- If a gradient component is less than the minimum, clip it to the minimum value.
- If a gradient component is larger than the maximum, clip it to the maximum value.
Clipping by Norm:
- Calculate the norm of the gradient vector (e.g., L2 norm)
- If the norm exceeds a threshold, rescale the entire gradient vector so its norm is equal to the threshold. This preserves the direction of the gradient while limiting its magnitude.
The ideal threshold is problem-dependent, but some experimentation often helps. While it helps with exploding gradients, clipping doesn’t address vanishing gradients. Other techniques (e.g., careful weight initialization, LSTMs) may be needed as well.
Gradient clipping helps prevent giant updates that derail your learning process. But it makes the model less sensitive to the choice of learning rate since you’re capping how much change can occur in a single update.
Gradient descent
An iterative optimization algorithm used to minimize the loss function and find the optimal parameters (weights and biases) of a machine learning model. It works by iteratively adjusting the model parameters in the opposite direction of the gradient of the loss function with respect to the parameters. By following the gradient, the algorithm seeks to descend along the steepest path towards the minimum of the loss function. During GD NN parameters recive an update proportional to the partial derivative of the cost function with respect to hte current parameter in each iteration of training.
Gradient descent comes in different variants, including batch gradient descent, stochastic gradient descent, and mini-batch gradient descent, each with its own trade-offs in terms of convergence speed, memory usage, and computational efficiency.
The magnitude of the gradient for a specific weight or bias signifies how sensitive the error is to changes in that parameter. A larger gradient indicates that a change in that weight or bias will have a more significant impact on the error. The sign of the gradient tells us whether to increase or decrease the parameter. A positive gradient suggests increasing the parameter will reduce the error, while a negative gradient means decreasing it will be helpful.
Gradient-based hyperparameters tuning
Process of optimizing the hyperparameters of machine learning models using gradient-based optimization algorithms. Instead of manually tuning hyperparameters or using grid search techniques, gradient-based methods leverage the gradients of a chosen performance metric (e.g., validation loss) with respect to the hyperparameters. By iteratively updating the hyperparameters in the direction that minimizes the loss, these methods efficiently search the hyperparameter space and find optimal or near-optimal configurations. Examples include Bayesian optimization, which models the performance metric as a probabilistic surrogate function and uses its gradients to guide the search, and gradient-based meta-learning approaches, which learn to adapt hyperparameters during training.
Grid Search CV
Grid Search Cross-Validation (CV) is a technique used to tune the hyperparameters of a machine learning model by exhaustively searching through a specified grid of hyperparameter values and evaluating each combination using cross-validation to determine the optimal set of hyperparameters.
HOG
Histogram of Oriented Gradients (HOG) is a feature extraction technique used in computer vision and image processing to represent the local texture and shape information of an image. HOG computes histograms of gradient orientations within localized regions of the image and concatenates these histograms to form a feature vector that describes the overall structure of the image. HOG features are commonly used in object detection, pedestrian detection, and other tasks where capturing shape and texture information is important.
Holdout sets
Also known as validation sets or validation data, are subsets of the dataset used to evaluate the performance of a machine learning model during training. Holdout sets are distinct from the training set and are not used for model parameter estimation. Instead, they are used to assess the generalization performance of the model on unseen data and to tune hyperparameters such as learning rate, regularization strength, and model architecture. Holdout sets are typically held out from the training process and only used intermittently to monitor the model’s performance and prevent overfitting.
You might use a holdout set iteratively throughout development, adjusting your model based on its performance. The test set is meant to be used only once. If you use a holdout set repeatedly to tune your model, you risk it subtly influencing your choices and biasing your estimation. The test set, held strictly separate, avoids this.
Holdout set is used primarily during the model development process. Test set is used for final, rigorous assessment reserved until the very end of the development process. This is meant to give an unbiased estimate of the final model’s performance to help you decide if it’s ready for deployment.
Hooks
In the context of deep learning frameworks such as PyTorch and TensorFlow, hooks are callback functions or mechanisms used to intercept and observe internal states or operations of neural network modules during the forward and backward passes. Hooks allow users to inspect and manipulate intermediate activations, gradients, and other internal variables of the network for debugging, visualization, and research purposes. Hooks are commonly used for feature visualization, gradient-based optimization, and model interpretation in deep learning workflows.
Hyperparameter C
The C hyperparameter in SVMs acts as a regularization parameter. It navigates the trade-off between:
Large C: Enforces a stricter decision boundary, aiming to classify training examples correctly even if it leads to a more complex model (risks overfitting).
Small C: Allows for a wider margin around the decision boundary, accepting some misclassifications on the training data for the sake of better generalizing to unseen data (risks underfitting).
When training an SVM, the goal is to find a hyperplane that separates the classes in your data while maximizing the margin (the distance between the hyperplane and the closest data points for each class). The C parameter controls the penalty applied for having data points within the margin or on the wrong side of the hyperplane.
Hyperparameters to tune in gradient boosting
Tree-Specific Parameters:
max_depth: Maximum depth of each tree.
min_child_weight: Minimum sum of instance weight needed in a child node.
subsample: Subsample ratio of the training instance.
colsample_bytree: Subsample ratio of columns when constructing each tree.
colsample_bylevel: Subsample ratio of columns for each level.
colsample_bynode: Subsample ratio of columns for each split.
max_delta_step: Maximum delta step allowed for each tree’s weight estimation.
gamma: Minimum loss reduction required to make a further partition on a leaf node.
lambda: L2 regularization term on weights.
alpha: L1 regularization term on weights.
scale_pos_weight: Control the balance of positive and negative weights.
Learning Task Parameters:
objective: The learning objective or loss function.
eval_metric: Evaluation metric for validation data.
num_class: Number of classes in a multi-class classification.
Learning Control Parameters:
eta or learning_rate: Step size shrinkage used to prevent overfitting.
n_estimators or num_boost_round: Number of boosting rounds or trees to build.
early_stopping_rounds: Early stopping to prevent overfitting based on a validation dataset.
verbose: Verbosity level.
silent: Whether to print messages during training.
Additional Parameters (Specific to Certain Implementations):
Parameters specific to XGBoost: e.g., tree_method, booster, gpu_id, max_bin.
Parameters specific to LightGBM: e.g., boosting_type, num_leaves, max_bin, device.
Parameters specific to CatBoost: e.g., depth, border_count, l2_leaf_reg.