Build, Train and Tune Model Flashcards

1
Q

Activation Function

A

A mathematical function applied to the output of each neuron in a neural network to introduce non-linearity and enable the network to learn complex patterns and relationships in the data. Common activation functions include sigmoid, tanh, ReLU (Rectified Linear Unit), Leaky ReLU, and softmax. Activation functions play a crucial role in determining the output of neural networks and affect the network’s training dynamics, convergence speed, and performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Activation map

A

Also known as a feature map, is a two-dimensional array or tensor that represents the output of a layer in a convolutional neural network (CNN) after applying an activation function. Each element in the activation map corresponds to the activation value of a specific neuron in the layer, capturing the presence of certain features or patterns in the input data. Activation maps are used for visualizing and interpreting the learned representations in CNNs and are instrumental in understanding how the network processes and transforms input data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Adam Optimization

A

Adam (Adaptive Moment Estimation) is an optimization algorithm commonly used to update the parameters of neural networks during training. It combines the benefits of adaptive learning rate methods (such as RMSprop) and momentum-based optimization techniques to achieve faster convergence and better generalization performance. Adam computes adaptive learning rates for each parameter based on past gradients and stores exponentially decaying averages of past gradients and squared gradients. It is widely used in deep learning frameworks for training various types of neural network architectures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Backpropagation

A

A fundamental algorithm for training neural networks. It involves computing the gradient of a loss function with respect to the network’s parameters, then using this gradient to update the parameters in the direction that minimizes the loss. This process is repeated iteratively to optimize the network’s performance. Backpropagation enables neural networks to learn from data by adjusting their internal parameters to better approximate the desired output for a given input. For example, in a simple feedforward neural network used for image classification, backpropagation adjusts the weights connecting neurons in each layer to reduce the difference between the predicted class and the actual class of each image in the training dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Backpropagation through time

A

An extension of the backpropagation algorithm specifically designed for training recurrent neural networks (RNNs) over sequential data. It unfolds the network through time, treating each time step as a layer, and computes gradients using the chain rule of calculus. BPTT is widely used in tasks such as speech recognition, natural language processing, and time series prediction, where the input data is sequential and has temporal dependencies. However, BPTT suffers from the vanishing gradient problem, where gradients diminish exponentially over long sequences, making it challenging to learn dependencies over extended periods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Batch Normalization

A

A technique used to improve the training of deep neural networks by normalizing the input of each layer across mini-batches of data. It reduces internal covariate shift and accelerates convergence by stabilizing the distributions of layer inputs. Batch Normalization helps mitigate issues such as vanishing gradients, enables the use of higher learning rates, and acts as a regularizer, reducing the need for other regularization techniques. It is typically applied after the activation function in each layer of a neural network, normalizing the output of the preceding layer before passing it to the next layer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Bayesian hyperparameters optimization

A

A method used to efficiently search for optimal hyperparameters of machine learning algorithms by modeling the objective function as a probabilistic surrogate model. It leverages Bayesian techniques to iteratively update a probabilistic model of the objective function based on observed evaluations, allowing for more effective exploration of the hyperparameter space.

Purpose: The goal of Bayesian Hyperparameters Optimization is to find hyperparameters that maximize the performance of a machine learning model while minimizing the number of evaluations required.
Example: In practice, Bayesian optimization is used in tasks such as tuning the hyperparameters of support vector machines, random forests, and deep neural networks, where manually searching the hyperparameter space would be prohibitively time-consuming.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Bootstrapping

A

A resampling technique used in statistics and machine learning to estimate the sampling distribution of a statistic by repeatedly sampling with replacement from the observed data. In bootstrapping, multiple samples (bootstrap samples) are drawn from the original dataset, and statistical estimates or models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Bottleneck layer

A

Bottleneck layers in neural networks act like compression belts for information flow. Imagine a large crowd trying to squeeze through a narrow tunnel. The bottleneck layer is that tunnel, forcing the network to compress its data into a lower-dimensional representation.

A bottleneck layer has fewer neurons compared to the layers before and after it. This “bottleneck” forces the network to identify the most critical information and discard redundancy. Despite the size reduction, the bottleneck layer aims to capture the essence of the data. It does this by applying filters that highlight the most significant features learned by previous layers. By reducing data size, bottleneck layers make the network more efficient. They require fewer calculations and can help prevent overfitting, a situation where the network memorizes specifics instead of learning general patterns. Often, bottleneck layers are used in conjunction with residual connections. These connections allow the network to bypass the bottleneck entirely and add the original, uncompressed data to the output. This ensures the network retains important details while still benefiting from the efficiency gains.

Used for example in autoencoders.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Bounding box

A

A rectangular or cuboidal area used to encapsulate objects or regions of interest in an image or scene. It is defined by its coordinates, typically represented as (xmin, ymin, xmax, ymax) for 2D bounding boxes in image space. Bounding boxes are commonly used in computer vision tasks such as object detection, instance segmentation, and object tracking to localize and identify objects within images or video frames.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Checkpoints in models

A

Checkpoints are snapshots of a model’s parameters saved during training. These checkpoints include the model’s architecture, weights, optimizer state, and other relevant parameters. Checkpoints are crucial for resuming training from a specific point, fine-tuning models, or deploying trained models for inference. They allow practitioners to monitor training progress, prevent data loss in case of interruptions, and facilitate model evaluation and experimentation.

Checkpoints are saved in standardized formats (e.g., TensorFlow’s SavedModel, PyTorch’s .pth files) and managed using tools like callbacks in TensorFlow or torch.save() in PyTorch. They ensure reproducibility, scalability, and reliability in machine learning workflows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Convolving

A

Convolving refers to the process of applying a convolution operation to input data using a convolutional kernel or filter. In the context of image processing and computer vision, convolution is used to extract features from images by sliding a kernel over the input image and computing the dot product between the kernel and local regions of the image. Convolving is a fundamental operation in convolutional neural networks (CNNs) and is used to detect patterns, edges, textures, and other visual features in images.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Criterion

A

An objective function or measure used to evaluate the performance of a model or algorithm. The criterion quantifies how well the model’s predictions match the true outcomes or how effectively the algorithm achieves its objectives. Common criteria in machine learning include loss functions, accuracy, precision, recall, F1-score, and mean squared error. The choice of criterion depends on the specific task, dataset, and optimization goals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Data Loader

A

A component or module in machine learning frameworks and libraries used to load, preprocess, and batch input data for training or inference. Data loaders are responsible for reading data from storage (e.g., disk, database), applying data transformations (e.g., normalization, augmentation), and organizing data into batches suitable for efficient processing by machine learning models. Data loaders play a critical role in managing large datasets, handling data pipelines, and optimizing the training process for deep learning models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Decision boundary

A

A hypersurface or boundary that separates different classes or categories in the feature space of a classification problem. It represents the region where the decision function changes from predicting one class to another. In binary classification tasks, the decision boundary is typically a line, curve, or hyperplane that partitions the feature space into two regions corresponding to different class labels. Decision boundaries are learned by machine learning algorithms based on the training data and model parameters and are used to make predictions on new or unseen data points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Cenoising autoencoder

A

A type of artificial neural network used for learning efficient representations of data by removing noise or corruption from input samples. Unlike traditional autoencoders, denoising autoencoders are trained to reconstruct clean or uncorrupted versions of input data from noisy observations. They learn robust features that capture the underlying structure of the data while filtering out irrelevant or noisy information. Denoising autoencoders find applications in dimensionality reduction, feature learning, and unsupervised pretraining in machine learning and deep learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Dropout

A

A regularization technique used in neural networks to prevent overfitting and improve generalization performance. During training, dropout randomly deactivates (sets to zero) a proportion of neurons in a layer with a specified dropout rate. This prevents individual neurons from relying too heavily on specific features or co-adapting with other neurons and encourages the network to learn more robust and generalizable representations. Dropout is commonly used in deep learning models, especially fully connected and convolutional neural networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

EOS Token

A

End of sentence. EOS token acts like a full stop in a sentence. It’s a special symbol that signals the end of an output sequence.

During training, the model learns to associate the EOS token with the end of a coherent sentence or translation. When generating text, the model keeps producing words or tokens until it outputs the EOS token. The EOS token is crucial because it allows models to generate sequences of different lengths. Without it, the model wouldn’t have a clear signal for when to stop generating text. In machine translation, the EOS token tells the decoder (the part generating the target language) that the source sentence has been fully processed, and it’s time to wrap up the translation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Evolutionary hyperparameters optimization techinques

A

methods inspired by principles of natural selection and evolution to search for optimal hyperparameters in machine learning models. These techniques typically involve the use of evolutionary algorithms, such as genetic algorithms, evolutionary strategies, or genetic programming, to explore the hyperparameter space and find combinations that result in improved model performance. Evolutionary hyperparameters optimization techniques are useful when dealing with complex optimization problems or when traditional methods such as grid search or random search are impractical or inefficient.

Inspiration for mechanism:
- Population of Solutions: Instead of manually trying different hyperparameter combinations, you start with a population of random solutions (sets of hyperparameters).
- Fitness Evaluation: Each solution is evaluated, often by training a model with those hyperparameters and seeing how it performs on a validation set.
- Survival of the Fittest: The best-performing solutions have a higher chance of being selected for the next generation.
- Crossover and Mutation: New solutions are created by:
- Crossover: Combining elements from two good parent solutions
- Mutation: Making small random changes to existing solutions. This helps explore the search space.
- Repeat: This process repeats for several generations, with the aim that better solutions evolve over time.

Examples of Evolutionary Algorithms
- Genetic Algorithms (GA): Solutions are represented like chromosomes; crossover and mutation are modeled after biological processes.
- Particle Swarm Optimization (PSO): Solutions are like particles moving through space, influenced by their own best-found position and the globally best positions.
- Differential Evolution (DE): New solutions are generated based on differences between existing solutions in the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Exploding gradient

A

During training, neural networks update their weights using backpropagation, which calculates the error (the difference between the predicted and true value) and propagates it backward through the layers to compute gradients. Gradients tell us how much to adjust the weights. In deep neural networks, these gradients can get multiplied through many layers.
If the weights are initialized too large, or certain conditions arise within the network (too many layers, too large weight initialized, some activation functions like sigmoid saturate functions around specifiv values like 0 or 1), these multiplied gradients can become extremely large, leading to the exploding gradient problem.

Huge gradients result in massive updates to the network weights during training. This can lead to instability: The model may wildly overshoot the optimal solution orweights might become so large they overflow to NaN (Not a Number), breaking your training entirely.

Techniques such as gradient clipping and normalization are often used to mitigate the problem of exploding gradients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Exploration vs. Exploitation

A

A fundamental trade-off in decision-making and optimization, particularly in reinforcement learning and multi-armed bandit problems.

Exploration refers to the process of gathering information about the environment or exploring different options to discover potentially better solutions.
Exploitation, on the other hand, involves leveraging known information or exploiting current knowledge to maximize immediate rewards or benefits.

Balancing exploration and exploitation is essential for learning and decision-making in dynamic environments, where the goal is to achieve a balance between gathering new information and exploiting existing knowledge to optimize long-term performance. Machine learning models learn from data. Your dataset is often a mere snapshot of all possible scenarios. A model too focused on exploiting the knowledge in your current data may perform poorly on new (overfitting), unseen data (this is overfitting). Exploration is needed to help it generalize better.

The exploration-exploitation dilemma is most pronounced in reinforcement learning (RL), where an agent learns through interacting with an environment and receiving rewards. The same principles apply for example in:
- Epsilon-greedy (particularly in RL): Take a random, exploratory action with a small probability.
- Decaying Epsilon: Start with a lot of exploration and decrease it over time.
- Optimism in the face of uncertainty: Favor under-explored actions or parts of the parameter space, providing an incentive for the model to try new things.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Feature detector (kernel, filter)

A

A feature detector, also known as a kernel or filter, is a small matrix or template used in convolutional neural networks (CNNs) to extract specific features or patterns from input data. Feature detectors are applied to input data using a convolution operation, where the filter is convolved with the input to produce feature maps. Different types of feature detectors (e.g., edge detectors, texture detectors, shape detectors) are designed to capture different aspects of the input data and are learned or manually defined during the training process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Fold

A

In cross-validation, a fold refers to a distinct subset of data used for training and validation. The dataset is divided into multiple folds, typically of equal size, where each fold is used as a validation set exactly once while the remaining folds are used for training. The cross-validation process is repeated for each fold, ensuring that every data point is used for both training and validation.

For example, in k-fold cross-validation, the dataset is divided into k folds. The model is trained k times, with each fold used once as a validation set and the remaining k-1 folds used for training. The performance metrics are averaged over all k runs to provide an overall estimate of the model’s performance. Cross-validation helps in assessing the generalization performance of a model, detecting overfitting, and tuning hyperparameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Fully connected layers

A

Also known as dense layers or fully connected neural networks (FCNNs), are layers in artificial neural networks where each neuron is connected to every neuron in the preceding layer. In a fully connected layer, each neuron receives input from all neurons in the previous layer and computes a weighted sum of these inputs, followed by an activation function to produce the output. Fully connected layers are commonly used in feedforward neural networks and deep learning architectures for tasks such as classification, regression, and feature learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Gating

A

A process of controlling or modulating the flow of information within neural networks using gating mechanisms. Gating mechanisms selectively filter, amplify, or suppress information based on learned or predefined criteria, allowing networks to focus on relevant features or suppress irrelevant noise. Gating is commonly used in recurrent neural networks (RNNs) through mechanisms such as LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) cells to regulate the flow of information over sequential data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Gausian noise

A

It’s a type of statistical noise where the probability distribution of the noise, meaning the values the noise can take on, follows a Gaussian distribution (also known as a normal distribution). A Gaussian distribution forms the classic bell-shaped curve, where values near the mean (average) are the most likely, and the the probability decreases as values move further away.

It is characterized by random fluctuations with a mean of zero and a constant variance, resulting in a symmetric distribution around the mean. Gaussian noise is often added to signals or data to simulate random variability or uncertainty, model measurement errors, or introduce randomness in stochastic processes. In machine learning, Gaussian noise is sometimes injected into input data or model parameters to regularize the learning process, prevent overfitting, or augment the training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Gini index (Gini impurity)

A

A measure of the impurity or randomness of a set of elements in a classification problem. It quantifies the probability of misclassifying an element randomly chosen from the set if it were labeled according to the class distribution in the set. A lower Gini index indicates higher purity and better separation of classes, while a higher Gini index indicates higher impurity and mixing of classes. The Gini index is commonly used as a criterion for splitting nodes in decision trees and evaluating the quality of splits in decision tree algorithms such as CART (Classification and Regression Trees).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Gradient clipping

A

Technique used to deal with exploding gradients.The central idea is simple: If the gradient exceeds a certain threshold, you clip its magnitude to stay within a reasonable range. Here are the common methods:

Clipping by Value:
- You define a minimum and maximum threshold.
- If a gradient component is less than the minimum, clip it to the minimum value.
- If a gradient component is larger than the maximum, clip it to the maximum value.

Clipping by Norm:
- Calculate the norm of the gradient vector (e.g., L2 norm)
- If the norm exceeds a threshold, rescale the entire gradient vector so its norm is equal to the threshold. This preserves the direction of the gradient while limiting its magnitude.

The ideal threshold is problem-dependent, but some experimentation often helps. While it helps with exploding gradients, clipping doesn’t address vanishing gradients. Other techniques (e.g., careful weight initialization, LSTMs) may be needed as well.
Gradient clipping helps prevent giant updates that derail your learning process. But it makes the model less sensitive to the choice of learning rate since you’re capping how much change can occur in a single update.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Gradient descent

A

An iterative optimization algorithm used to minimize the loss function and find the optimal parameters (weights and biases) of a machine learning model. It works by iteratively adjusting the model parameters in the opposite direction of the gradient of the loss function with respect to the parameters. By following the gradient, the algorithm seeks to descend along the steepest path towards the minimum of the loss function. During GD NN parameters recive an update proportional to the partial derivative of the cost function with respect to hte current parameter in each iteration of training.
Gradient descent comes in different variants, including batch gradient descent, stochastic gradient descent, and mini-batch gradient descent, each with its own trade-offs in terms of convergence speed, memory usage, and computational efficiency.

The magnitude of the gradient for a specific weight or bias signifies how sensitive the error is to changes in that parameter. A larger gradient indicates that a change in that weight or bias will have a more significant impact on the error. The sign of the gradient tells us whether to increase or decrease the parameter. A positive gradient suggests increasing the parameter will reduce the error, while a negative gradient means decreasing it will be helpful.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Gradient-based hyperparameters tuning

A

Process of optimizing the hyperparameters of machine learning models using gradient-based optimization algorithms. Instead of manually tuning hyperparameters or using grid search techniques, gradient-based methods leverage the gradients of a chosen performance metric (e.g., validation loss) with respect to the hyperparameters. By iteratively updating the hyperparameters in the direction that minimizes the loss, these methods efficiently search the hyperparameter space and find optimal or near-optimal configurations. Examples include Bayesian optimization, which models the performance metric as a probabilistic surrogate function and uses its gradients to guide the search, and gradient-based meta-learning approaches, which learn to adapt hyperparameters during training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Grid Search CV

A

Grid Search Cross-Validation (CV) is a technique used to tune the hyperparameters of a machine learning model by exhaustively searching through a specified grid of hyperparameter values and evaluating each combination using cross-validation to determine the optimal set of hyperparameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

HOG

A

Histogram of Oriented Gradients (HOG) is a feature extraction technique used in computer vision and image processing to represent the local texture and shape information of an image. HOG computes histograms of gradient orientations within localized regions of the image and concatenates these histograms to form a feature vector that describes the overall structure of the image. HOG features are commonly used in object detection, pedestrian detection, and other tasks where capturing shape and texture information is important.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Holdout sets

A

Also known as validation sets or validation data, are subsets of the dataset used to evaluate the performance of a machine learning model during training. Holdout sets are distinct from the training set and are not used for model parameter estimation. Instead, they are used to assess the generalization performance of the model on unseen data and to tune hyperparameters such as learning rate, regularization strength, and model architecture. Holdout sets are typically held out from the training process and only used intermittently to monitor the model’s performance and prevent overfitting.

You might use a holdout set iteratively throughout development, adjusting your model based on its performance. The test set is meant to be used only once. If you use a holdout set repeatedly to tune your model, you risk it subtly influencing your choices and biasing your estimation. The test set, held strictly separate, avoids this.

Holdout set is used primarily during the model development process. Test set is used for final, rigorous assessment reserved until the very end of the development process. This is meant to give an unbiased estimate of the final model’s performance to help you decide if it’s ready for deployment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Hooks

A

In the context of deep learning frameworks such as PyTorch and TensorFlow, hooks are callback functions or mechanisms used to intercept and observe internal states or operations of neural network modules during the forward and backward passes. Hooks allow users to inspect and manipulate intermediate activations, gradients, and other internal variables of the network for debugging, visualization, and research purposes. Hooks are commonly used for feature visualization, gradient-based optimization, and model interpretation in deep learning workflows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Hyperparameter C

A

The C hyperparameter in SVMs acts as a regularization parameter. It navigates the trade-off between:
Large C: Enforces a stricter decision boundary, aiming to classify training examples correctly even if it leads to a more complex model (risks overfitting).
Small C: Allows for a wider margin around the decision boundary, accepting some misclassifications on the training data for the sake of better generalizing to unseen data (risks underfitting).

When training an SVM, the goal is to find a hyperplane that separates the classes in your data while maximizing the margin (the distance between the hyperplane and the closest data points for each class). The C parameter controls the penalty applied for having data points within the margin or on the wrong side of the hyperplane.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Hyperparameters to tune in gradient boosting

A

Tree-Specific Parameters:
max_depth: Maximum depth of each tree.
min_child_weight: Minimum sum of instance weight needed in a child node.
subsample: Subsample ratio of the training instance.
colsample_bytree: Subsample ratio of columns when constructing each tree.
colsample_bylevel: Subsample ratio of columns for each level.
colsample_bynode: Subsample ratio of columns for each split.
max_delta_step: Maximum delta step allowed for each tree’s weight estimation.
gamma: Minimum loss reduction required to make a further partition on a leaf node.
lambda: L2 regularization term on weights.
alpha: L1 regularization term on weights.
scale_pos_weight: Control the balance of positive and negative weights.

Learning Task Parameters:
objective: The learning objective or loss function.
eval_metric: Evaluation metric for validation data.
num_class: Number of classes in a multi-class classification.

Learning Control Parameters:
eta or learning_rate: Step size shrinkage used to prevent overfitting.
n_estimators or num_boost_round: Number of boosting rounds or trees to build.
early_stopping_rounds: Early stopping to prevent overfitting based on a validation dataset.
verbose: Verbosity level.
silent: Whether to print messages during training.

Additional Parameters (Specific to Certain Implementations):
Parameters specific to XGBoost: e.g., tree_method, booster, gpu_id, max_bin.
Parameters specific to LightGBM: e.g., boosting_type, num_leaves, max_bin, device.
Parameters specific to CatBoost: e.g., depth, border_count, l2_leaf_reg.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Kernel Function

A

A mathematical function used to compute the similarity or dot product between pairs of data points in a higher-dimensional space without explicitly mapping them to that space. Kernel functions are commonly used in kernel methods, such as support vector machines (SVMs) and kernel ridge regression, to transform input data into a higher-dimensional feature space where linear separation or regression is easier to achieve. Popular kernel functions include linear, polynomial, radial basis function (RBF), and sigmoid kernels, each with its own characteristics and suitability for different types of data.

38
Q

Kernel trick

A

A technique used in machine learning to implicitly map input data into a higher-dimensional feature space using kernel functions without explicitly computing the transformed feature vectors. By applying the kernel trick, kernel methods such as support vector machines (SVMs) and kernel ridge regression can operate in the original input space while benefiting from the advantages of working in a higher-dimensional feature space, such as increased expressiveness and improved separability of classes or patterns. The kernel trick allows these algorithms to efficiently handle non-linear relationships in the data and perform complex pattern recognition tasks without explicitly computing the feature vectors.

39
Q

Kernels

A

While both kernels in Linear algebra and ML involve inner products (dot products), ML kernels aren’t strictly about finding the null space like in linear algebra.
ML kernels serve as generalized similarity measures, with high values indicating strong relatedness, often within a high-dimensional space. This enables algorithms to discover nonlinear patterns in the data, which is essential for many real-world tasks. In other words, at their heart, machine learning kernels are functions that compute the degree of similarity between two data points and they often unlock the power of working in higher-dimensional spaces without the computational cost of explicitly transforming the data. This is known as the “kernel trick.”

A kernel’s output signifies the degree of similarity. This could be based on a simple linear relationship or a complex, nonlinear pattern. Many kernels implicitly correspond to similarity measures in high-dimensional or even infinite-dimensional spaces. The beauty is, these calculations happen without directly transforming the data, saving computational resources.

Many kernels rely on dot products. This connection stems from the dot product’s role in measuring vector alignment. Not All Similarity Measures Are Kernels: Kernels must satisfy the mathematical property of positive definiteness to be valid and useful in ML algorithms.

  • Linear Kernel: The simple dot product.
  • Polynomial Kernel: Dot product raised to a power, implicitly capturing feature combinations.
  • Gaussian Kernel (RBF): Computes similarity based on scaled distance, interpretable as a transformation and then a dot product.

Beyond Dot Products
* Not All Similarity Measures Are Kernels: Kernels must satisfy the mathematical property of positive definiteness to be valid and useful in ML algorithms.
* Custom Kernels: Researchers design specialized kernels to compare complex structures like text, graphs, and trees.

Understanding the Kernel Trick
* The power of the kernel trick lies in calculating a complex transformation’s results without explicitly performing it. This is massively computationally efficient.
* This implicit mapping often helps find nonlinear patterns and decision boundaries, making complex problems tractable.

40
Q

Learning Rate

A

A hyperparameter that determines the step size or rate at which the parameters (weights and biases) of a machine learning model are updated during training using optimization algorithms such as gradient descent. A higher learning rate results in faster convergence but may risk overshooting the optimal solution or causing instability, while a lower learning rate may lead to slower convergence or getting stuck in local minima. The learning rate is a critical hyperparameter that requires careful tuning to ensure optimal performance and convergence speed of the training process.

41
Q

Loss function

A

Measures the error of a model’s prediction on a single data point or example. However, it is often used interchangably with “Cost function” with measures general error of the model across the data set

42
Q

Loss vs Cost function

A

The terms “loss function” and “cost function” are often used interchangeably to refer to the function that measures the error or discrepancy between the model’s predictions and the true labels in the training data. However, in some contexts, the term “loss function” is used to refer to the function applied to a single training example, while the term “cost function” refers to the aggregate or average loss over the entire training dataset.

43
Q

Many-to-Many (different lengths)

A

In data modeling, a many-to-many relationship is when multiple instances of one entity can be related to multiple instances of another entity, and vice-versa.

44
Q

Max Pooling

A

A pooling operation used in convolutional neural networks (CNNs) to downsample feature maps and reduce spatial dimensions while retaining important features. Max pooling divides the input feature map into non-overlapping regions (typically squares) and outputs the maximum value within each region, discarding the rest. This process reduces the spatial size of the feature maps, making them more computationally efficient to process and less sensitive to small spatial variations. Max pooling is commonly used after convolutional layers in CNN architectures to progressively reduce the spatial resolution of feature maps while preserving important features.

45
Q

Mini-batch

A

Mini-batch refers to the practice of dividing your large dataset into smaller, fixed-size groups of samples called mini-batches. Instead of updating the model’s parameters based on the entire dataset at once (batch gradient descent) or using a single example at a time (stochastic gradient descent), the model is updated after processing each mini-batch. This approach strikes a balance between the stability of batch gradient descent and the speed of stochastic gradient descent, leading to faster convergence and better generalization for most deep learning problems.

46
Q

Momentum

A

A technique used in optimization algorithms, particularly in gradient descent variants, to accelerate convergence and improve optimization performance. In momentum-based optimization, the update to the model parameters (weights and biases) is not only influenced by the current gradient but also by a momentum term that accumulates previous gradients. It incorporates a fraction of the previous update into the current update, creating a kind of ‘rolling snowball’ effect. By incorporating momentum, the optimization algorithm gains inertia and smooths out oscillations in the gradient descent trajectory, allowing it to overcome local minima and escape saddle points more effectively. Momentum helps accelerate convergence, improve stability, and navigate complex optimization landscapes in machine learning models.

47
Q

Multitask loss

A

Also known as joint loss or composite loss, is a loss function used in multitask learning to optimize multiple learning objectives simultaneously. In multitask learning, a single model is trained to perform multiple related tasks simultaneously, leveraging shared representations and transfer learning to improve performance on each task. The multitask loss function aggregates the individual losses from each task into a single composite loss, which is minimized during training using gradient-based optimization algorithms. Multitask loss encourages the model to learn task-specific features while sharing knowledge and information across tasks, leading to more efficient learning and better generalization performance.

48
Q

Non-linearity in NN

A

Achived by activation functions that take linear function and perform a non-linear operation on it. Neural networks by themselves wouldn’t be very powerful without the concept of non-linearity. Each layer in a neural network performs a linear combination of its inputs, which is like drawing a straight line through the data. But the real world is full of complex relationships, not straight lines. Here’s where non-linearity comes in: it’s achieved through activation functions applied after each layer’s linear operation. These functions introduce bends and curves, allowing the network to model complex patterns. Imagine stacking multiple curved functions together – like building a curvy road – the network can learn to represent very intricate relationships between the input data and the desired output, making it suitable for tasks like image recognition or speech translation.

49
Q

Optimization

A

Optimizing means finding min or max of function.

Process of adjusting the parameters (weights and biases) of a machine learning model to minimize a predefined objective function or loss function. The goal of optimization is to find the optimal set of parameters that best fits the training data and generalizes well to unseen data. Optimization algorithms, such as gradient descent and its variants, iteratively update the model parameters based on the gradients of the loss function with respect to the parameters. Optimization techniques play a critical role in training machine learning models, ensuring convergence, stability, and efficiency in the learning process.

50
Q

Parametrizing model

A

Parametrizing a model is more akin to setting up the skeletal structure, which determines the types and potential number of parameters the model will learn. Parameterizing a model refers to the process of defining and setting the parameters (also known as weights and biases) that govern the behavior of the model. In the context of machine learning, parameters are the variables that the model learns from the training data to make predictions or perform a specific task. The goal of parameterizing a model is to find the optimal values for these parameters that minimize the difference between the model’s predictions and the actual outcomes.

The model’s structure and parameters determine the kinds of patterns it could learn, much like a recipe defines a range of possible cakes. The actual power of the model comes from finding the best possible parameter settings during training.

51
Q

Pooling

A

Pooling works similarily to convolution but instead of applying trainable filter it applies a fixed operator like max or average. Pooling has only hyperparameters (stride, type(max, average), size of the filter). By reducing information it contributes to speeding the network training.

52
Q

Principal Component Analysis (PCA)

A

A dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while preserving the most important information. PCA identifies the principal components, which are the orthogonal axes that capture the maximum variance in the data. By projecting the data onto these principal components, PCA reduces the dimensionality of the dataset while minimizing information loss. PCA is widely used for data visualization, noise reduction, and feature extraction in various machine learning and data analysis tasks.

53
Q

Propagation

A

Propagation generally refers to how information or changes flow through a complex, interconnected system. Essentialy it means consequential calculations. In machine learning, it occurs in two key areas:

  1. Forward Propagation (Input to Output)

Input data is fed into the first layer of a neural network. Calculations propagate forward, layer by layer, with each layer applying weights, biases, and activation functions. Finally, the output layer produces the prediction or classification. Think of it as a chain reaction, where the output of one stage triggers calculations in the next.

  1. Backward Propagation (Errors for Training)

Backpropagation (the heart of training neural networks) is where “propagation” becomes truly crucial. The prediction error (difference between the desired output and the network’s actual output) is calculated. This error signal is propagated backward through the network, layer by layer. Using calculus (chain rule), the contribution of each weight and bias to the error is determined.
Weights and biases are adjusted slightly in a direction that minimizes the error (gradient descent).

54
Q

Prunning

A

A technique used in machine learning and neural networks to reduce the size and complexity of the model by removing unnecessary or redundant parameters, connections, or nodes. Pruning helps improve model efficiency, reduce overfitting, and enhance generalization performance by simplifying the model structure and removing irrelevant features. Pruning can be applied during training or after training by setting small weights or connections to zero based on certain criteria, such as magnitude, importance, or contribution to the overall model performance.

55
Q

Random hyperparameters search

A

In the vast space of possible hyperparameter combinations, random search is often surprisingly effective compared to exhaustive searches. Unlike grid search, it doesn’t assume that hyperparameters impact performance in a smooth or monotonic way. This can be helpful when the relationship between hyperparameters and performance is complex. By sampling randomly, it can often find good-enough hyperparameter values much faster than trying every possible combination in a grid.

To do it you specify a distribution (ranges) for each hyperparameter to search over (e.g., uniform distribution between a minimum and a maximum value). The algorithm randomly samples combinations of hyperparameters from the defined distributions.

When to Use It: Good at the beginning when you have little knowledge about which hyperparameters are important. Useful when you have many hyperparameters to tune. Useful when you’re limited by time or computational resources.

56
Q

RBF Kernel

A

The Radial Basis Function (RBF) kernel, also known as the Gaussian kernel, is a popular kernel function used in kernel methods, particularly in support vector machines (SVMs) and kernelized regression models. The RBF kernel computes the similarity or distance between data points in a higher-dimensional space using the Gaussian distribution.
It is defined as:

K(x, x′) = exp(−2σ2 ||x − x′||2),

where x and x′ are data points, ||⋅||2 denotes the squared Euclidean distance, and σ is a hyperparameter that controls the spread of the kernel. The RBF kernel is versatile and effective for capturing non-linear relationships in the data.

57
Q

Relu Function

A

The Rectified Linear Unit (ReLU) function is a non-linear activation function commonly used in neural networks to introduce non-linearity and enable the network to learn complex patterns and relationships in the data. The ReLU function is defined as:

f(x) = max(0, x)

This means it outputs the input x if it is positive and zero otherwise. ReLU activation is computationally efficient, easy to implement, and helps mitigate the vanishing gradient problem during training. It’s widely used in deep learning architectures like convolutional neural networks (CNNs) and feedforward neural networks.

58
Q

Resampling methods

A

Resampling methods in machine learning are techniques used to modify or create new training data sets by randomly sampling from the original data set. These methods are particularly useful for tasks such as model evaluation, model selection, and dealing with imbalanced data sets.

They can improve the performance of models when certain classes are severely underrepresented. Also they can reduces the variance of model evaluation when you have limited data. Resampling shouldn’t be the first solution for imbalance. Sometimes collecting more data or algorithmic adjustments are better. Especially with over-sampling, there’s a risk that the model will memorize the specific oversampled examples.

Over-Sampling: Used for imbalanced datasets where some classes are underrepresented. Techniques include:
Random Over-Sampling: Replicating examples from the minority class (can lead to overfitting).
SMOTE: Generating synthetic new examples in the minority class by interpolating between existing minority examples.
ADASYN: Similar to SMOTE, but focuses on creating more difficult minority examples near class boundaries.

Under-Sampling: Used when you have an abundance of data in certain classes. Techniques include:
Random Under-Sampling: Randomly removing examples from the majority class (risks discarding valuable information).
Cluster Centroids: Replacing groups of majority class samples with the cluster centroid.
Tomek Links: Identifying and removing borderline majority examples that might be mislabeled or noisy.

Bootstrapping: Used to estimate the statistical properties of a model, such as its accuracy or confidence intervals.
It involves randomly sampling with replacement from the original dataset to create multiple bootstrap samples, each of the same size as the original dataset.
The model is trained on each bootstrap sample, and the aggregated results are used to estimate the model’s performance or statistical properties.
Bootstrapping is particularly useful when the dataset is limited or when estimating uncertainty in model predictions.

Cross-Validation: Used to evaluate the performance of a machine learning model.
The dataset is divided into multiple subsets, or folds, with a portion of the data reserved for training and the rest for testing.
The model is trained on the training set and evaluated on the test set. This process is repeated multiple times, with each fold used as the test set exactly once.
Common types of cross-validation include k-fold cross-validation, stratified k-fold cross-validation, leave-one-out cross-validation, and repeated cross-validation.

59
Q

Sequential model

A

A type of neural network architecture in which the layers are arranged sequentially, one after the other, with each layer feeding its output as the input to the next layer. Sequential models are simple and easy to understand, making them suitable for a wide range of machine learning tasks, including classification, regression, and sequence prediction. SEquential models are an one of the alternatives among non-sequential models, tree-based models and others

60
Q

Sigmoid Function

A

The sigmoid function, also known as the logistic function, is a non-linear activation function commonly used in neural networks to introduce non-linearity and squash the output of a neuron into the range (0, 1). The sigmoid function is defined as:

f(x) = 1 / (1 + e^-x)

where e is the base of the natural logarithm. The sigmoid function produces a smooth S-shaped curve, suitable for binary classification and outputting probabilities. However, it’s prone to saturation and vanishing gradients for extreme input values, potentially slowing learning in deep neural networks.

61
Q

Skip connections

A

In deep neural networks, a skip connection is a direct link that bypasses one or more layers, allowing information to “jump” ahead. Instead of data only flowing sequentially through layers, skip connections introduce alternative paths, creating a multi-layered, less linear structure.

When networks get very deep, gradients used for updating weights during backpropagation can vanish or explode. Skip connections ease the flow of gradients back through the layers, helping with training very deep models. Skip connections allow earlier layers’ information to directly reach later layers. This can help preserve important features that might otherwise get diluted as data passes through many transformations. Skip connections create networks that resemble ensembles of shallower models. This can improve performance and reduce overfitting.

One of the most famous examples of using skip connections is ResNets. In ResNets, a block of layers’ output gets added to the input before entering the next block, creating these skip paths.

62
Q

Sliding window

A

A technique used in signal processing, computer vision, and time series analysis to process data in a sequential manner by moving a fixed-size window or kernel over the input data one step at a time. The sliding window approach allows for local feature extraction, segmentation, or analysis of data streams by capturing temporal or spatial patterns within the window. Sliding windows are commonly used in tasks such as object detection, edge detection, motion estimation, and feature extraction from time series data. They offer flexibility, adaptability, and efficiency in processing large datasets and streaming data in real-time. (is it the same thing as sliding a kernel?)

63
Q

Softmax function

A

Softmax function is a generalization of the sigmoid function to multidimensional outputs. a mathematical function that converts a vector of raw scores or logits into a probability distribution over multiple classes. It is commonly used as the output activation function in multi-class classification problems, where the goal is to predict the probability of each class given a set of input features. The softmax function computes the probability of each class as the exponential of the input score divided by the sum of the exponentials of all scores, ensuring that the output probabilities sum to one. Softmax activation produces a smooth probability distribution that can be interpreted as the model’s confidence in each class prediction.

64
Q

State (in RNN)

A

The state is a dynamic representation of the sequence the RNN has seen so far. Think of it as the current output of the memory cell, reflecting what it “remembers” from the past inputs. The state changes with every step in the sequence. The state is calculated using the current input AND the weights of the RNN.

Analogy:
Recipe (Weights): The instructions for how to bake a cake.
Bowl of Batter (State): What’s currently in the bowl after following some of the recipe steps. The batter changes as you add ingredients (inputs), but the recipe itself remains the same.

In Recurrent Neural Networks (RNNs), the “state” (often called the “hidden state”) refers to a piece of information that the network carries forward through time as it processes a sequence. Think of it as the RNN’s memory. At each step, the RNN takes the current input and the previous state, combines them, and produces an output and an updated state. Representation: The state is a vector of numbers, which can be thought of as a learned representation of the sequence up to that point.

65
Q

Stochastic Gradient Descent

A

Stochastic Gradient Descent (SGD) is a variant of the gradient descent optimization algorithm used to train machine learning models by updating the model parameters (weights and biases) iteratively based on the gradients of the loss function computed on small random subsets of the training data. Unlike batch gradient descent, which computes the gradients using the entire training dataset, SGD approximates the gradients using mini-batches of data, leading to faster convergence and reduced memory requirements. SGD is widely used in deep learning and large-scale machine learning tasks due to its computational efficiency and ability to handle large datasets.

66
Q

Stride

A

In convolutional neural networks (CNNs), the stride refers to the step size or displacement with which the convolutional kernel slides or moves across the input data during the convolution operation. The stride determines the amount of overlap between adjacent receptive fields and affects the spatial dimensions of the output feature maps. A stride of one (default) means that the kernel moves one pixel at a time, resulting in output feature maps with the same spatial dimensions as the input. Larger strides reduce the spatial dimensions of the output feature maps, leading to spatial downsampling, while smaller strides increase the spatial dimensions, leading to spatial upsampling. Stride plays a crucial role in controlling the spatial resolution and receptive field size in CNN architectures.

67
Q

Target

A

The desired output or ground truth associated with each input data point in the training dataset. The target represents the correct label, class, or value that the machine learning model aims to predict or approximate during training. In classification tasks, the target is typically a categorical label or class label indicating the correct category or class membership of the input data. In regression tasks, the target is a continuous numerical value representing the true output or target variable associated with the input data. The goal of supervised learning is to train the model to predict the target accurately based on the input features and minimize the discrepancy between the predicted output and the true target value.

68
Q

Unkown word token

A

During Training a vocabulary is built. Even very large datasets can’t cover every possible word, especially with misspellings, new terms, slang, and proper nouns. If a model encounters an unseen word in the wild, it would break down if it didn’t have a way to handle it. That is why we need additional token <UNK> . It's a placeholder used to represent words that were not encountered during the training phase of a language model. These words are considered out-of-vocabulary (OOV). Generally, there's one <UNK> token in the vocabulary, and all unknown words are mapped to this token.</UNK></UNK>

it is therfore not exacly unknown word token: it is a token vector of averaged, generalized features of all words that were unknown. Think of unknown words as being ‘collapsed’ into the <UNK> token. It doesn't store individual unknown words, but the <UNK> representation is shaped by the statistical patterns of how diverse unknown words are used in the contexts where they appear. Also, often words below a certain frequency threshold are replaced with the <UNK> token.</UNK></UNK></UNK>

It is usefule because even though the model will never output the word that was unknown, it can still properly learn the features of other words. If those words have features that tell us how they relate to the words that are unknown having this token help us preserve those features.

However it is not perfect. Multiple unknown words with contradictory sentiment in the same sentence confuse the model, meaning: The more different the unknown words are, the less effective a single <UNK> token becomes. In general unkown word relies on the scenario where there are not many unnown words or at least they are similar. Otherwise the vector is unspecific, captures all types of features and points in all sorts of directions.</UNK>

This is why techniques like character-level modeling or subword tokenization are helpful, especially when you expect diverse unknown words. Here are some more developed techniques that address the limitations of the basic <UNK> token approach:</UNK>

  1. Character-Level Models
    * No Unknown Words: Instead of operating on word-level tokens, these models break down text into individual characters or sequences of characters.
    * Robustness: Any word combination can be represented, eliminating the need for an <UNK> token altogether.
    * Trade-offs:
    </UNK>
    • Complexity: Computationally more expensive.
    • May need more data to learn meaningful character-level patterns.
  2. Subword Tokenization
    * Smaller Units: Words are split into subword units, such as common prefixes, suffixes, and roots (e.g., “undesirable” might become “un-,” “desir-,” “-able”).
    * Reduced Unknowns: Significantly decreases the vocabulary size needed, leading to fewer instances of unknown words.
    * Flexibility: Techniques like Byte Pair Encoding (BPE) or SentencePiece learn the subword vocabulary directly from your data.
  3. Leveraging Context Dynamically
    * Attention-Based Models (e.g., Transformers): These models excel at dynamically weighting the importance of different words in a sentence. This makes them more robust to unknown words, as they can focus on the known words that provide a stronger signal.
    * Pretrained Language Models: Like BERT or GPT-3, are trained on massive text datasets, giving them a wider vocabulary and better ability to understand context around unknown words.
  4. Hybrid Approaches
    * Combining Techniques: Using character-level or subword models for unknown words while maintaining a word-level vocabulary for frequent terms can be a powerful hybrid approach.
    * Data Augmentation: Strategically replacing known words with <UNK> during training can improve the model's ability to handle unseen words.</UNK>
69
Q

Vanishing gradient

A

A problem encountered during the training of deep neural networks, where the gradients of the loss function with respect to the model parameters become extremely small as they are backpropagated through many layers of the network. In deep networks, as gradients are propagated backward from the output layer to the input layer during backpropagation, they can diminish exponentially with each layer due to the repeated application of activation functions with small derivatives, such as sigmoid or hyperbolic tangent functions. As a result, the updates to the parameters in the early layers become negligible, leading to slow convergence and difficulty in learning meaningful representations from the data. Vanishing gradient can impede the training of deep networks and affect their ability to capture complex patterns in the data.

70
Q

volume (CNN)

A

In the context of Convolutional Neural Networks (CNNs), “volume” typically refers to the three-dimensional structure of input data, intermediate representations (feature maps), and learnable parameters (filters or kernels) within the network architecture.. In CNNs, we don’t just deal with flat images but 3-dimensional block of data: height, width, colour chanels ( 224 pixels x 224 pixel x 3). As data flows through the convolutional and pooling layers of a CNN, the dimensions of the representations change. However, they still maintain this 3D volume-like structure.

But “Volume” can also refer to the number of images processed simultaneously during training or inference, known as the batch size. “Volume” might refer to the rate at which images are processed per unit of time, such as images processed per second (IPS) or frames per second (FPS), especially in real-time applications like video processing. Additionally, “volume” can represent the amount of data contained within a single image, particularly when dealing with high-dimensional data such as medical images or satellite imagery.

71
Q

Types of Activation Functions

A
  1. Sigmoid:
    Output range: (0, 1)
    Best usage: Output layer for binary classification tasks. Less common in hidden layers due to vanishing gradient problem.
  2. Hyperbolic Tangent (tanh):
    Output range: (-1, 1)
    Best usage: Hidden layers in neural networks for general-purpose tasks.
  3. Rectified Linear Unit (ReLU):
    Output range: [0, +∞)
    Best usage: Hidden layers in deep neural networks. Effective in alleviating the vanishing gradient problem.
  4. Leaky ReLU:
    Output range: (-∞, +∞)
    Best usage: Alternative to ReLU, providing a small, non-zero gradient for negative inputs, helpful in preventing dying ReLU problem.
  5. Parametric ReLU (PReLU):
    Output range: (-∞, +∞)
    Best usage: Similar to Leaky ReLU but with the negative slope α learned during training.
  6. Exponential Linear Unit (ELU):
    Output range: (-∞, +∞)
    Best usage: An alternative to ReLU, with smoother transitions for negative inputs.
  7. Scaled Exponential Linear Unit (SELU):
    Output range: (-∞, +∞)
    Best usage: Introduced for maintaining mean and variance during training, works well for deeper architectures.
  8. Softmax:
    Output range: (0, 1) for each element, with the sum being 1.
    Best usage: Output layer for multi-class classification tasks to obtain probabilities for each class.
  9. Softplus:
    Output range: (0, +∞)
    Best usage: A smooth approximation of ReLU, often used in shallow networks or when smoothness is preferred over sparsity.
  10. Swish:
    Output range: (-∞, +∞)
    Best usage: Introduced as a more effective alternative to ReLU, tends to perform well across a range of tasks, particularly in deeper models.
72
Q

Elastic Net Regularization

A

A technique used in machine learning to prevent overfitting and improve the generalization of models by adding a regularization term to the loss function. It combines the penalties of L1 (Lasso) and L2 (Ridge) regularization methods, allowing both feature selection and coefficient shrinkage. Elastic Net regularization is particularly useful when dealing with high-dimensional datasets where there are many features, as it helps to automatically select relevant features while shrinking coefficients towards zero to avoid model complexity.

73
Q

How to deal with overfitting?

A

Overfitting occurs when a machine learning model learns to perform well on the training data but fails to generalize to unseen data. Several techniques can help address overfitting:

  1. Data-Focused Techniques
    Gather More Data
    Data Augmentation: Increasing the diversity of the training data through techniques such as data augmentation helps the model learn more robust features.
    Feature Selection: Remove irrelevant or redundant features that might be causing your model to focus on spurious patterns.
  2. Regularization: Techniques like L1 and L2 regularization add penalty terms to the loss function, discouraging the model from learning overly complex patterns.
  3. Model-Based Strategies
    Early Stopping: Monitoring the model’s performance on a validation set during training and stopping when performance starts to degrade can prevent overfitting.
    Ensemble Methods: Combining predictions from multiple models (e.g., bagging, boosting) can reduce overfitting by leveraging the wisdom of crowds.
    Simpler Models: Start with less complex models (e.g., fewer layers, fewer neurons). They’re less prone to memorize intricate patterns of the training data.
  4. Other Approaches
    Cross-Validation: Trains multiple models on different splits of the data, helping in selecting hyperparameters and getting a more robust sense of the model’s generalization.
    Hyperparameter Tuning: Experiment with learning rates, regularization strength, batch sizes, etc., to find settings that help prevent overfitting.
74
Q

How to deal with underfitting?

A

Underfitting occurs when a machine learning model is too simple to capture the underlying patterns in the data. It hasn’t learned enough complexity to capture the patterns. It will have poor performance on both the training set and the validation set.

  1. Increase Model Complexity:
    More Layers/Neurons: Add layers or increase the number of neurons in existing layers in your neural network.
    More Complex Models: Switch to a model with higher inherent capacity
    Reduce Regularization: If your regularization terms (L1, L2, dropout) are too strong, reduce their influence.
  2. Feature Engineering:
    Create New Features: Extract more informative features from your existing data
    Reduce Noise: Remove irrelevant features or outliers in your data that might confuse the model.
    Adjust Hyperparameters: Experiment with different hyperparameter settings, such as learning rate, batch size, and network architecture, to find a configuration that reduces underfitting.
  3. Train Longer:
    Increase Epochs: Allow the model more passes over the training data to learn more complex patterns.
    Monitor: Use a validation set to check if performance continues to improve, and consider early stopping if not.
    Increase Training Data: Collect more training data to provide the model with more examples to learn from.
75
Q

Hyperparameters for CatBoost

A

learning_rate: Controls step size during gradient-based updates. Smaller values lead to slower, potentially more accurate training.
depth: Maximum depth of each decision tree. Deeper trees can model more complex patterns but are prone to overfitting.
iterations: The number of trees to train. More trees usually improve performance but increase training time.
l2_leaf_reg: L2 regularization term to prevent overfitting.
Number of trees: Number of boosting iterations.
Bagging temperature: Controls the randomness in sampling during training.

76
Q

Hyperparameters for Decision Trees

A

max_depth: Maximum depth allowed for the tree. Restricts complexity and helps prevent overfitting.
min_samples_split: Minimum samples needed to consider a split at a node. Makes the tree less sensitive to noise.
criterion: Function to measure the quality of a split (e.g., “gini” for impurity, “entropy” for information gain).
Minimum samples leaf: Minimum number of samples required to be at a leaf node.

77
Q

Hyperparameters for ElasticNet

A

alpha: Overall regularization strength. A higher alpha combines more L1 and L2 regularization.
l1_ratio: The mix between L1 and L2 regularization. Closer to 1 promotes sparsity (feature selection), closer to 0 favors grouped feature selection.
max_iter: Maximum number of iterations for the optimization algorithm.
Tolerance: Convergence threshold for optimization.

78
Q

Hyperparameters for Gboost

A

learning_rate: Controls the contribution of each tree, promoting gradual learning and reducing overfitting.
n_estimators: Number of trees in the ensemble. More trees generally improve accuracy.
subsample: Fraction of data to sample for each tree, promoting diversity and reducing overfitting.
Maximum depth: Maximum depth of the individual trees.

79
Q

Hyperparameters for K-Means Clustering

A

Initialization method: Method for initializing cluster centroids (e.g., random or K-means++).
n_clusters: The “k” - the number of clusters you want the algorithm to find.
Maximum iterations: Maximum number of iterations for convergence.
Tolerance: Convergence threshold for termination.

80
Q

Hyperparameters for K-Nearest Neighbors (KNN)

A

n_neighbors: The “k” - number of neighbors to consider for classification or regression.
weights: How to weight neighbors (e.g., ‘uniform’ for equal, ‘distance’ to give closer ones more influence).
p: Parameter for the Minkowski distance metric. p=1 is Manhattan distance, p=2 is Euclidean distance. (other distances are Cosine Similarity, Jaccard Distance)
Algorithm: Algorithm used to compute nearest neighbors (e.g., brute force or kd-tree).

81
Q

Hyperparameters for LightGBM

A

learning_rate: Step size for updates. Lower values lead to slower but potentially more accurate training.
num_leaves: Maximum number of leaves in each tree, controlling model complexity.
max_depth: Another way to limit tree depth and prevent overfitting.
feature_fraction: Fraction of features to randomly select at each tree split, promoting diversity.

82
Q

Hyperparameters for Linear Regression

A

Regularization: Type (L1, L2, or none) and strength (controlled by an alpha/lambda parameter) to prevent overfitting.
Intercept: Whether to include an intercept term in the model.
Solver: Optimization algorithm for fitting the model (e.g., ordinary least squares or gradient descent).
Tolerance: Convergence threshold for optimization.

83
Q

Hyperparameters for Logistic Regression

A

Reguralization type (L1, L2, or none)
Regularization strength
C: Inverse regularization strength (smaller C means stronger regularization).
solver:Optimization algorithm for fitting the model (e.g., Newton’s method or stochastic gradient descent).
Maximum iterations: Maximum number of iterations for optimization.

84
Q

Hyperparameters for Naive Bayes

A

smoothing: Add pseudo-counts to avoid zero-probability issues (common with categorical variables).
Distribution assumption: Assumption about the distribution of the features (e.g., Gaussian or multinomial).
Prior probabilities: Prior probabilities of the classes (if not estimated from the data).

85
Q

Hyperparameters for Polynomial Regression

A

degree: Degree of the polynomial fit to the data. Higher degrees model more complex relationships.
Regularization strength: Strength of L1 or L2 regularization (if used).
Interaction terms: Whether to include interaction terms between features.
Solver: Optimization algorithm for fitting the model (e.g., ordinary least squares or gradient descent).

86
Q

Hyperparameters for Random Forest

A

n_estimators: Number of trees in the forest. More trees improve accuracy but slow down prediction.
max_depth: Maximum tree depth, regulating complexity.
min_samples_leaf: Minimum samples needed in a leaf node to allow further splitting.
max_features: Number of features to consider at each split, introducing randomness.
Criterion: Function to measure the quality of a split (e.g., Gini impurity or entropy).

87
Q

Hyperparameters for Support Vector Machines (SVM)

A

Type of kernel (“linear”, “rbf”, “poly”) to define the similarity space. Complexity depends on the kernel choice.
C: Regularization parameter, controlling the trade-off between fitting the training data and keeping a smooth decision boundary.
Kernel function: Type of kernel used for mapping data into higher-dimensional space (e.g., linear, polynomial, or radial basis function).
Kernel coefficient (Gamma): Coefficient for non-linear kernel functions.
Degree: Degree of the polynomial kernel function (if polynomial kernel is used).

88
Q

Hyperparameters for XGBoost

A

eta: Equivalent to learning_rate.
Learning rate: Controls the step size during boosting.
gamma: Minimum loss reduction needed to create a new split, promoting conservative trees.
Subsample: Fraction of samples used for fitting the individual trees.
Number of estimators: Number of boosting iterations.
Maximum depth: Maximum depth of the trees.

89
Q

L1 Regularization (Lasso)

A

Machine learning models want to “learn” patterns from data to make predictions. Overfitting happens when a model becomes too complex and starts fitting the random noise in the training data instead of the true underlying pattern.

L1 regularization adds a penalty term to the model’s cost function (the thing it’s trying to minimize). This penalty is based on the size of the model’s coefficients (weights). The L1 penalty encourages the model to shrink coefficients towards zero. Some coefficients might even become exactly zero. By setting coefficients to zero, L1 regularization effectively performs feature selection – it helps identify the most important features for making predictions.

Lasso leads to models that are easier to interpret and understand. It helps prevent overfitting, improving the model’s ability to perform well on new data. Lasso helps pick out the most important features, simplifying the model and potentially improving performance.

90
Q

L2 Regularization (Ridge)

A

A technique used in machine learning to prevent overfitting. Overfitting happens when a model becomes too complex and fits the noise in your training data rather than the underlying pattern. L2 regularization adds a penalty term to the model’s cost function, which is based on the square of the size of the model’s coefficients (weights). This penalty is based on the square of the size of the model’s coefficients (weights).

Instead of driving some coefficients to zero like L1 regularization, L2 regularization encourages them to shrink, making them smaller but not necessarily eliminating them. This leads to smoother models that are less likely to overfit. While L2 regularization doesn’t perform explicit feature selection like L1, it can still help reduce model complexity and make it easier to understand. However, due to the squaring of values, L2 regularization is a bit more sensitive to outliers than L1.

91
Q

Step size (learning rate)

A

The learning rate is a crucial hyperparameter in machine learning that dictates how aggressively a model adjusts its parameters during training. Gradient descent, a common training algorithm, calculates the direction of steepest increase in error (the gradient). The learning rate controls how big of a step your model takes in the opposite direction of the gradient, aiming to descend towards the lowest point on the landscape, representing the minimum error. A too-small learning rate leads to slow, timid steps, making convergence take a long time or possibly causing the model to get trapped in local minima (small dips, but not the global minimum). Conversely, an overly large learning rate can result in reckless leaps, potentially causing the model to overshoot the valleys or oscillate without ever converging. Finding the right learning rate is tricky as there’s no single magic number – it depends on your specific problem, dataset, and even where you are in the training process. Often, practitioners start with a reasonable default value (like 0.01) and experiment. Techniques like learning rate schedulers (which decrease the learning rate over time) and adaptive optimizers like Adam (which adjust learning rates somewhat automatically) can significantly simplify finding an effective learning rate.

92
Q
A