General Knowlage (ML) Flashcards
Active learning
A machine learning paradigm where a model is able to interactively query the user (or an oracle) to obtain labels for new data points. The key idea behind active learning is that the model can choose the most informative instances to query labels for, thereby maximizing the learning efficiency with fewer labeled examples.
Applied when obtaining labels is costly (for example requires medical expert opinion). We start with small number of labeled examples and large number of unlabeled ones, and then label only those examples that contribute the most to the model quality. We map the examples, we search for clusters around labeled features and we compute importance of unlabeled example to model. We then take the most important and only ask expert to label this examples. Then rebuild the model.
Association Rule Learning
A machine learning technique used to discover interesting relationships or associations between variables in large datasets. It aims to identify patterns such as frequent itemsets or rules that describe the co-occurrence of items or events. Common algorithms for association rule learning include Apriori and FP-Growth, which are widely used in market basket analysis, recommendation systems, and customer behavior analysis.
Binary classification (binomial)
A type of classification task in supervised learning, where the goal is to categorize data points into one of two possible classes or outcomes (e.g., positive/negative, spam/not spam, present/absent).
Computational graph
A graphical representation of mathematical dependencies and operations or computations performed by a machine learning model. It consists of nodes representing mathematical operations and edges representing the flow of data between these operations. It’s essential for efficient calculation of gradients during backpropagation, the core algorithm for training neural networks.
Convolution
Convolution is a mathematical operation that allows the merging of two sets of information. The result represents the amount of overlap between two functions. In the case of CNN, convolution is applied to the input data, to filter the information and it produces a feature map. This filter is also called a kernel, or feature detector.
(CNNs), convolutional layers use learned filters or kernels to extract features from input data. These filters slide over the input data, computing a dot product between the filter weights and the local regions of the input at each position. This process captures spatial hierarchies and patterns in the data, enabling CNNs to learn hierarchical representations and perform tasks such as image recognition, object detection, and natural language processing.
Convolution in CNNs plays a crucial role in feature extraction, where the learned filters act as feature detectors, detecting edges, textures, and other patterns in the input data. By stacking multiple convolutional layers and combining them with activation functions and pooling operations, CNNs can learn complex representations of the input data, making them powerful tools for various machine learning tasks.
Deep learning
Training Neural Networks that have more than two non-output layers
Deep models
Neural Networks that have 2 or more hidden layers
Discriminative vs. Generative model
Generative models model the underlying probability distribution of the data. Generative models learn the joint probability distribution of both the input and output variables. They can generate new samples from the learned distribution.
Discriminative models directly model the decision boundary between different classes in the input space. They learn the conditional probability distribution of the output variables given the input variables. They trying to predict class of an example.
Eager
An approach where computations or actions are performed immediately, without delay or postponement. Eager evaluation involves eagerly executing statements or expressions as soon as they are encountered in the program flow. This contrasts with lazy evaluation, where computations are deferred until their results are explicitly needed. Eager execution is commonly used in imperative programming languages and eager loading strategies in database systems, where the goal is to eagerly retrieve and process data to improve responsiveness and efficiency.
Empirical risk
Empirical risk, also known as empirical loss or training loss, measures the average error of a model’s predictions on the training dataset. Empirical risk is typically calculated as the average of a loss function applied to the predictions made by the model on the training data. Common loss functions include mean squared error (MSE) for regression tasks and cross-entropy loss for classification tasks. It quantifies how well the model fits the training data.
Ensemble models
Simple models may be too simple. NN need too much labeled data. But we can train many simple (weak) models and then combine them to obtain high-accuracy meta-model
Random Forest: A Random Forest is an ensemble learning method that operates by constructing a multitude of decision trees during training and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Gradient Boosting Machines (GBM): GBM is a boosting technique where new models are added to correct the errors made by existing models. Each new model focuses on the examples that were misclassified by previous models.
AdaBoost (Adaptive Boosting): AdaBoost is a boosting algorithm that combines multiple weak classifiers to create a strong classifier. It works by iteratively training weak classifiers on various distributions of the data, with each subsequent classifier giving more weight to the examples that were misclassified by previous classifiers.
XGBoost (Extreme Gradient Boosting): XGBoost is an optimized implementation of gradient boosting. It is known for its efficiency, scalability, and performance. It incorporates additional features such as regularization to prevent overfitting.
Stacking: Stacking, also known as Stacked Generalization, involves training a meta-model that combines the predictions of multiple base models. Instead of directly averaging or voting on predictions, stacking learns how to best combine the predictions of the base models to make the final prediction.
Bagging (Bootstrap Aggregating): Bagging involves training multiple instances of the same base learning algorithm on different subsets of the training data, typically using bootstrapping, and then averaging the predictions to reduce variance and improve generalization.
Voting Classifier/Regressor: This technique combines the predictions from multiple different models (e.g., decision trees, support vector machines, logistic regression) and outputs the majority (for classification) or average (for regression) prediction.
Feature
Features are individual measurable properties or characteristics extracted from raw data that are relevant for solving a particular task in machine learning. These could be numeric, categorical, or binary variables that represent different aspects of the data. Features are used as input variables for training machine learning models.
Features can be informative or not. they can be:
- discriminative (help distinguish between different classes or categories in the dataset)
- domain-specific
- Noise (do not contain any meaningful information and are purely random or noisy)
- redundant (highly correlated with other features in the dataset and do not provide additional information. Including redundant features can lead to overfitting and increased computational overhead. Redundant features often arise from feature transformations or combinations.)
- constant features (have the same value across all instances in the dataset. These features do not contribute any variability to the data and are thus irrelevant for modeling)
By the book we would devide features into:
- Numerical vs. Categorical
- Binary vs. Multi-valued Categorical
- Derived (Transformed and Engineered)
- Temporal
- Spatial
- Textual
- Image
- Meta-Features
- Composite
Functional Programming (FP)
programming paradigm centered around the concept of functions as first-class citizens. In functional programming languages like Haskell, Lisp, and Clojure, functions are treated as values that can be passed as arguments to other functions, returned as results, or stored in data structures.
Key characteristics of functional programming include:
- Immutability: Data is immutable, meaning that once defined, it cannot be changed. Functions operate on immutable data structures, ensuring referential transparency and avoiding side effects.
- Higher-order Functions: Functions can take other functions as arguments or return functions as results. This allows for powerful abstractions and composition of functions.
- Pure Functions: Functions are pure, meaning that they have no side effects and produce the same output for the same input. Pure functions are easier to reason about, test, and parallelize.
Functional programming promotes declarative and concise code, emphasizing the expression of computations as compositions of functions and transformations of data. It is particularly well-suited for parallel and distributed computing, concurrency, and building scalable and maintainable software systems.
Greedy
A greedy algorithm is a strategy for solving problems where you make a series of choices, each time selecting the option that seems best at that moment. It focuses on short-term gains, hoping the final accumulation of these local decisions will lead to a good overall solution.
While greedy algorithms are simple and efficient, they may not always yield the optimal solution. They tend to find local minima. In complex optimization problems where global optimization is required they may fail. In other words: while they might find a decent solution quickly, there’s no guarantee it will be the best possible solution.
Greedy strategies are commonly used in algorithms such as RFE(REcursive feature elimination), greedy search, greedy heuristics, and greedy algorithms for solving problems like the Knapsack problem, Minimum Spanning Tree, and Shortest Path problems.
Image Captioning
Image captioning is the task of automatically generating a natural language description (a caption) of an image. It combines CNN with NLP. The model is trained on a large dataset of image-caption pairs. The goal is to learn how to generate captions that accurately and fluently describe the visual content of the images. Image Captioning is often done by Encoder-Decoder models. Here is how they work:
- The Encoder: Extracting Visual Information
CNN Pre-trained on ImageNet: A powerful Convolutional Neural Network (often ResNet or VGG) pre-trained on a massive image classification task (like ImageNet) is utilized.
Removing the Final Layer: The last classification layer of the CNN is removed, allowing it to function as a rich feature extractor.
Image to Feature Vector: The CNN takes your input image and transforms it into a dense feature vector that encapsulates the visual essence of the image. - The Decoder: Translating Features into Language
LSTM as Word Generator: A Long Short-Term Memory (LSTM) network, a type of RNN, is tasked with generating the caption word by word.
Start Token: A special “<start>" token signals the beginning of the caption.
Step-by-Step Generation:
- The LSTM takes the image's feature vector and the previously generated word as input.
- It predicts probabilities for the next word in the vocabulary.
- A word is sampled from this probability distribution (or the word with the highest probability is chosen).
- The process repeats until an "<end>" token is generated or a max caption length is reached.</end></start> - Key Refinement: Attention
An attention mechanism allows the LSTM decoder to selectively focus on specific regions or features of the image while generating each word. This dynamic attention helps the model generate more contextually relevant and descriptive captions.
Inference
Inferences are steps in reasoning, moving from specific premises to logical consequences and broader generalizations. They are probabilistic in nature. Conclusions are likely, based on evidence, but not necessarily certain. (deduction is similar but deals with certanities)
Inference is the process of applying a trained machine learning model to new, unseen data to make prediction and draw insights. It usually happenes in deployment state.
Types of Inference:
Batch Inference: Making predictions on a group of data points all at once. This is often more efficient computationally.
Real-time Inference: Making predictions on individual data points immediately as they become available (e.g., fraud detection systems).
Instance based learning alghoritm
Instance-based learning algorithms, also known as lazy learning algorithms, make predictions based on the similarity of input instances to instances in the training dataset. These algorithms do not explicitly build a model during training; instead, they store the training instances and use them for making predictions at runtime. Examples include k-nearest neighbors (KNN) and kernel density estimation (KDE).
Integral image concept
A technique used for fast computation of rectangular features in object detection tasks, particularly in Viola-Jones face detection algorithm. It involves calculating the sum of pixel intensities within rectangular regions of an image to efficiently compute features used for classification without the need for repeated calculations.
Meta Learning
Also known as “learning to learn,” refers to a subfield of machine learning focused on understanding and developing algorithms capable of learning new tasks or adapting to new environments rapidly and efficiently. Unlike traditional machine learning approaches that focus on learning from a fixed dataset to solve a specific task, meta learning aims to enable models to learn from a variety of tasks or experiences and generalize that knowledge to new tasks or domains. Meta learning algorithms often involve training a meta-learner on a distribution of tasks or datasets, allowing it to extract common patterns or principles that can be applied to unseen tasks. Meta learning has applications in few-shot learning, transfer learning, and reinforcement learning, among others, and it holds promise for enabling AI systems to learn more autonomously and adaptively in diverse and dynamic environments.
Meta-learning systems are trained on a large and diverse collection of tasks. Each task often has a small amount of data associated with it. The meta-learner extracts patterns on how to approach different types of problems. It learns a strategy to update its own parameters rapidly when given a new task. The idea is to learn a similarity metric between data points. A new task is then solved by comparing new data points to examples from the training tasks using this learned metric (similar to k-nearest neighbors). the meta-learner is a recurrent neural network (RNN) or similar architecture with internal “memory.” It’s trained to update its own parameters quickly to adapt to a new task, utilizing its past experiences. The focus is on learning a good initialization point for the model’s parameters and/or designing an optimization algorithm that quickly converges to a solution on new tasks.
Multi-Label Classification
In most classification tasks, each data point belongs to one and only one class (e.g., an image is either a “cat” or a “dog”, but not both). In multi-label classification, a single data point can be associated with multiple labels simultaneously. Traditional classifiers predict a single class. Multi-label classifiers predict a set of relevant labels. Multi-label problems are more complex because you need to model potential correlations between labels. A cat and a dog in an image are likely, but certain disease combinations might be rare. Single-Label are Focus on boundaries between classes. Multi-Label must also consider relationships between possible labels (some labels might co-occur frequently, others might be mutually exclusive).
Approaches
1. Problem Transformation: The core idea behind problem transformation is to convert a multi-label classification problem into a format that traditional single-label classification algorithms can understand.
* Binary Relevance: Train a separate binary classifier for each possible label. Treats each possible label as a separate yes/no classification problem. For a dataset with ‘N’ possible labels, you would train ‘N’ independent binary classifiers. To classify a new data point, you run it through each of the binary classifiers. Any classifier that outputs “yes” (or a probability above a threshold) gives you a label. It is easy to implement using existing single-label classifiers. Training the binary classifiers can be done in parallel. The biggest drawback – it treats each label in complete isolation, missing out on potentially helpful relationships between labels.
* Label Powerset: Create a new class for every possible label combination. Explicitly models every possible combination of labels as a unique class. You train a single (now traditional) multi-class classifier on this transformed problem. It can capture potential correlations between labels. The number of new classes grows exponentially with the number of labels. This quickly becomes computationally demanding and can lead to data sparsity (few examples for each combination). Works decently when labels are mostly independent, or when computational efficiency is paramount. Useful for problems with a small number of possible labels and where correlations between those labels are crucial.
- Specialized Algorithms:
- Classifier Chains: Train classifiers sequentially, where each classifier’s predictions are used as features for the next one in the chain. Involves training a series of binary classifiers, but unlike binary relevance, they are linked together. The first classifier is trained just on the input features. The second classifier is trained on the input features AND the predictions made by the first classifier. This continues for each subsequent classifier, building a chain where previous predictions become inputs. The idea is to gradually learn dependencies between labels. Labels that frequently occur together should influence each other within the chain. The core advantage over binary relevance is modeling some degree of relationship between labels. The order of labels in the chain can be determined strategically for potential performance gains. Mistakes made early in the chain can cascade down and negatively affect the predictions of later classifiers. Performance can be affected by the chosen label order in the chain.
- ML-KNN (Multi-Label K-Nearest Neighbors): Find the ‘K’ nearest neighbors of a new data point. Examine the label sets associated with those neighbors. Determine relevant labels for the new point based on statistics about the neighbors’ labels (e.g., frequency, average probabilities). Easy to grasp and explain. Doesn’t make strong assumptions about the underlying data distribution. Finding nearest neighbors, especially in large datasets, can be slow. Outliers or incorrect labels in the training data can throw off its predictions.
- Deep Learning Methods: Adapted neural network architectures for handling multiple outputs.
Multiclass classification (multinomial)
Multiclass classification is a type of supervised learning task where the goal is to classify input data into one of three or more classes or categories. It involves predicting a single output variable with multiple possible discrete values. This classification method produces a range of probabilities for each class and selects the most probable class as the final prediction. Multinomial logistic regression and softmax activation in neural networks are common approaches for handling multiclass classification tasks.
Multitask learning
Multitask learning (MTL) is a machine learning paradigm where a model is trained to perform multiple tasks simultaneously, leveraging shared information across tasks to improve generalization performance. In MTL, the model is trained on a joint objective that combines the loss functions of all tasks, encouraging the model to learn representations that are beneficial for all tasks. By sharing knowledge across related tasks, multitask learning can lead to better generalization, especially when individual tasks have limited amounts of training data or when tasks are related in some way (e.g., semantic similarity or shared underlying structure). Multitask learning has applications in various domains, including natural language processing, computer vision, and healthcare, where different tasks may benefit from leveraging common features or learning from related data sources.
Nodes
Node is the same as Neuron. However because neuron’s name was inspired by real neurons and because they dont really work like neurons we use “node” to highlight that Nodes work differently to neurons
“A neuron is the most basic processing unit within an artificial neural network. The concept of artificial neurons in neural networks is loosely inspired by biological neurons in the brain. Biological neurons receive signals (inputs) through connections called dendrites, process them, and send an output signal through the axon if a certain threshold is met.
Neural networks learn by adjusting the weights and biases during training. The goal is to find the optimal values that produce the desired output given a specific input. Artificial neural networks are organized into layers: an input layer, one or more hidden layers, and an output layer. Neurons in one layer are connected to neurons in the next, creating a complex network of calculations.
In a neural network, a neuron is a mathematical function that performs the following:
1) Inputs: A neuron receives multiple input values. These inputs could come from raw data (e.g., pixel values of an image) or be the outputs of neurons from a previous layer in the neural network.
2) Weights: Each input is multiplied by a corresponding weight. Weights are like knobs that determine how much influence each input has on the neuron’s output.
3) Summation: The weighted inputs are summed together.
4) Bias: A bias term is added to the sum. The bias is like an adjustment that helps the neuron learn how much we want to activate this neuron
5) Activation Function: The result of the summation (and bias) is passed through a non-linear activation function. This function introduces non-linearity into the model, which is essential for neural networks to learn complex patterns. Common activation functions include:
- Sigmoid
- Tanh
- ReLU (Rectified Linear Unit)
6) Output: The output of the activation function is the final output of the neuron. This output can then be sent to neurons in the next layer of the neural network.
Simple Analogy
Imagine a neuron like a decision-maker. Consider the decision of whether to wear a coat outside:
1) Inputs: Temperature, wind speed, likelihood of rain.
2) Weights: How heavily you weigh each factor (you might care more about temperature than wind, etc.)
3) Bias: Your general predisposition towards wearing a coat (some people are more likely to get cold).
4) Activation Function: Your mental model deciding if the combined factors cross a threshold for putting on a coat.
5) Output: The decision – coat or no coat.”
Non-parametric model
A type of statistical model that does not make explicit assumptions about the functional form or distribution of the underlying data. Instead of estimating fixed parameters, nonparametric models estimate the underlying data distribution directly from the observed data.
Examples include k-nearest neighbors (KNN), decision trees, and kernel density estimation (KDE). Nonparametric models are flexible and can capture complex relationships in the data without making strong assumptions.
Non-sequential models
Non-sequential models refer to machine learning models that do not inherently rely on sequential data or temporal dependencies. Unlike sequential models such as recurrent neural networks (RNNs) or transformers, which are designed to process sequential data with explicit temporal order, non-sequential models can handle input data without assuming any specific order or sequence. Examples of non-sequential models include feedforward neural networks (e.g., multilayer perceptrons), convolutional neural networks (CNNs), and graph neural networks (GNNs). Non-sequential models are commonly used in tasks such as image classification, object detection, and graph analytics, where the input data does not have an inherent temporal or sequential structure.
Numerical overflow
Occurs when a numerical computation results in a value that exceeds the maximum representable value for a numeric data type. This can lead to inaccuracies or errors in computations and is particularly common in floating-point arithmetic. Overflow can occur due to large intermediate results or excessively large inputs.
That is why most computers use exponential to denote large numbers
Objective function
Also known as a loss function or cost function, measures how well a model’s predictions match the true values in the training data. It quantifies the discrepancy between predicted and actual values and is used to optimize model parameters during training. The goal is to minimize the value of the objective function to improve the model’s performance.
Parallelize
Parallelization in machine learning means splitting computationally intensive tasks across multiple processors or cores to speed up model training or processing. This can involve parallelizing data processing (e.g., loading and transforming data simultaneously on multiple cores), parallelizing the calculations within model algorithms (e.g., matrix operations in neural networks spread across multiple GPUs), or parallelizing multiple model training experiments (e.g., hyperparameter tuning with different configurations running concurrently). The primary goal is to reduce the time it takes to learn from data, allowing for faster experimentation, handling larger datasets, or exploring more complex models.