Machine Learning Flashcards
Terms and concepts related to machine learning
ablation
A technique for evaluating the importance of a feature or component by temporarily removing it from a model. You then retrain the model without that feature or component, and if the retrained model performs significantly worse, then the removed feature or component was likely important.
For example, suppose you train a classification model on 10 features and achieve 88% precision on the test set. To check the importance of the first feature, you can retrain the model using only the nine other features. If the retrained model performs significantly worse (for instance, 55% precision), then the removed feature was probably important. Conversely, if the retrained model performs equally well, then that feature was probably not that important.
Ablation can also help determine the importance of:
- Larger components, such as an entire subsystem of a larger ML system
- Processes or techniques, such as a data preprocessing step
In both cases, you would observe how the system’s performance changes (or doesn’t change) after you’ve removed the component.
A/B testing
A statistical way of comparing two (or more) techniques—the A and the B. Typically, the A is an existing technique, and the B is a new technique. A/B testing not only determines which technique performs better but also whether the difference is statistically significant.
A/B testing usually compares a single metric on two techniques; for example, how does model accuracy compare for two techniques? However, A/B testing can also compare any finite number of metrics.
accelerator chip
A category of specialized hardware components designed to perform key computations needed for deep learning algorithms.
Accelerator chips (or just accelerators, for short) can significantly increase the speed and efficiency of training and inference tasks compared to a general-purpose CPU. They are ideal for training neural networks and similar computationally intensive tasks.
Examples of accelerator chips include:
- Google’s Tensor Processing Units (TPUs) with dedicated hardware for deep learning.
- NVIDIA’s GPUs which, though initially designed for graphics processing, are designed to enable parallel processing, which can significantly increase processing speed.
accuracy
The number of correct classification predictions divided by the total number of predictions. That is:
Accuracy = correct predictions ÷ (correct predictions + incorrect predictions)
For example, a model that made 40 correct predictions and 10 incorrect predictions would have an accuracy of:
Accuracy = 40 ÷ (40 + 10) = 80%
Binary classification provides specific names for the different categories of correct predictions and incorrect predictions. So, the accuracy formula for binary classification is as follows:
Accuracy = TP + TN ÷ (TP + TN + FP + FN)
where:
- TP is the number of true positives (correct predictions).
- TN is the number of true negatives (correct predictions).
- FP is the number of false positives (incorrect predictions).
- FN is the number of false negatives (incorrect predictions).
Compare and contrast accuracy with precision and recall.
action
In reinforcement learning, the mechanism by which the agent transitions between states of the environment. The agent chooses the action by using a policy.
activation function
A function that enables neural networks to learn nonlinear (complex) relationships between features and the label. The plots of activation functions are never single straight lines. Popular activation functions include:
- ReLU
- Sigmoid
active learning
A training approach in which the algorithm chooses some of the data it learns from. Active learning is particularly valuable when labeled examples are scarce or expensive to obtain. Instead of blindly seeking a diverse range of labeled examples, an active learning algorithm selectively seeks the particular range of examples it needs for learning.
AdaGrad
A sophisticated gradient descent algorithm that rescales the gradients of each parameter, effectively giving each parameter an independent learning rate. AdaGrad was one of the first algorithms to use adaptive learning rates and set the stage for further development in this area.
agent
In reinforcement learning, the entity that uses a policy to maximize the expected return gained from transitioning between states of the environment.
agglomerative clustering
A form of hierarchical clustering, agglomerative clustering first assigns every example to its own cluster, and iteratively merges the closest clusters to create a hierarchical tree.
anomaly detection
The process of identifying outliers. For example, if the mean for a certain feature is 100 with a standard deviation of 10, then anomaly detection should flag a value of 200 as suspicious.
artificial general intelligence
A non-human mechanism that demonstrates a broad range of problem solving, creativity, and adaptability. For example, a program demonstrating artificial general intelligence could translate text, compose symphonies, and excel at games that have not yet been invented.
artificial intelligence
A non-human program or model that can solve sophisticated tasks. For example, a program or model that translates text or a program or model that identifies diseases from radiologic images both exhibit artificial intelligence.
Formally, machine learning is a sub-field of artificial intelligence. However, in recent years, some organizations have begun using the terms artificial intelligence and machine learning interchangeably.
attention
A mechanism used in a neural network that indicates the importance of a particular word or part of a word. Attention compresses the amount of information a model needs to predict the next token/word. A typical attention mechanism might consist of a weighted sum over a set of inputs, where the weight for each input is computed by another part of the neural network.
Refer also to self-attention and multi-head self-attention, which are the building blocks of Transformers.
attribute
Synonym for feature.
In machine learning fairness, attributes often refer to characteristics pertaining to individuals.
attribute sampling
A tactic for training a decision forest in which each decision tree considers only a random subset of possible features when learning the condition. Generally, a different subset of features is sampled for each node. In contrast, when training a decision tree without attribute sampling, all possible features are considered for each node.
AUC (Area under the ROC curve)
Area under the receiver operating characteristic curve. A number between 0.0 and 1.0 representing a binary classification model’s ability to separate positive classes from negative classes. The closer the AUC is to 1.0, the better the model’s ability to separate classes from each other. AUC ignores any value you set for classification threshold. Instead, AUC considers all possible classification thresholds.
augmented reality
A technology that superimposes a computer-generated image on a user’s view of the real world, thus providing a composite view.
autoencoder
A system that learns to extract the most important information from the input. Autoencoders are a combination of an encoder and decoder. Autoencoders rely on the following two-step process:
- The encoder maps the input to a (typically) lossy lower-dimensional (intermediate) format.
- The decoder builds a lossy version of the original input by mapping the lower-dimensional format to the original higher-dimensional input format.
Autoencoders are trained end-to-end by having the decoder attempt to reconstruct the original input from the encoder’s intermediate format as closely as possible. Because the intermediate format is smaller (lower-dimensional) than the original format, the autoencoder is forced to learn what information in the input is essential, and the output won’t be perfectly identical to the input.
For example:
- If the input data is a graphic, the non-exact copy would be similar to the original graphic, but somewhat modified. Perhaps the non-exact copy removes noise from the original graphic or fills in some missing pixels.
- If the input data is text, an autoencoder would generate new text that mimics (but is not identical to) the original text.
See also variational autoencoders.
automation bias
When a human decision maker favors recommendations made by an automated decision-making system over information made without automation, even when the automated decision-making system makes errors.
AutoML
Any automated process for building machine learning models. AutoML can automatically do tasks such as the following:
- Search for the most appropriate model.
- Tune hyperparameters.
- Prepare data (including performing feature engineering).
- Deploy the resulting model.
AutoML is useful for data scientists because it can save them time and effort in developing machine learning pipelines and improve prediction accuracy. It is also useful to non-experts, by making complicated machine learning tasks more accessible to them.
auto-regressive model
A model that infers a prediction based on its own previous predictions. For example, auto-regressive language models predict the next token based on the previously predicted tokens. All Transformer-based large language models are auto-regressive.
In contrast, GAN-based image models are usually not auto-regressive since they generate an image in a single forward-pass and not iteratively in steps. However, certain image generation models are auto-regressive because they generate an image in steps.
auxiliary loss
A loss function—used in conjunction with a neural network model’s main loss function—that helps accelerate training during the early iterations when weights are randomly initialized.
Auxiliary loss functions push effective gradients to the earlier layers. This facilitates convergence during training by combating the vanishing gradient problem.
average precision
A metric for summarizing the performance of a ranked sequence of results. Average precision is calculated by taking the average of the precision values for each relevant result (each result in the ranked list where the recall increases relative to the previous result).
See also Area under the PR Curve.
axis-aligned condition
In a decision tree, a condition that involves only a single feature.
backpropagation
The algorithm that implements gradient descent in neural networks.
Training a neural network involves many iterations of the following two-pass cycle:
During the forward pass, the system processes a batch of examples to yield prediction(s). The system compares each prediction to each label value. The difference between the prediction and the label value is the loss for that example. The system aggregates the losses for all the examples to compute the total loss for the current batch. During the backward pass (backpropagation), the system reduces loss by adjusting the weights of all the neurons in all the hidden layer(s).
Neural networks often contain many neurons across many hidden layers. Each of those neurons contribute to the overall loss in different ways. Backpropagation determines whether to increase or decrease the weights applied to particular neurons.
The learning rate is a multiplier that controls the degree to which each backward pass increases or decreases each weight. A large learning rate will increase or decrease each weight more than a small learning rate.
In calculus terms, backpropagation implements calculus’ chain rule. That is, backpropagation calculates the partial derivative of the error with respect to each parameter. For more details, see this tutorial in Machine Learning Crash Course.
Years ago, ML practitioners had to write code to implement backpropagation. Modern ML APIs like TensorFlow now implement backpropagation for you. Phew!
bagging
A method to train an ensemble where each constituent model trains on a random subset of training examples sampled with replacement. For example, a random forest is a collection of decision trees trained with bagging.
The term bagging is short for bootstrap aggregating.
bag of words
A representation of the words in a phrase or passage, irrespective of order. For example, bag of words represents the following three phrases identically:
- the dog jumps
- jumps the dog
- dog jumps the
Each word is mapped to an index in a sparse vector, where the vector has an index for every word in the vocabulary. For example, the phrase the dog jumps is mapped into a feature vector with non-zero values at the three indices corresponding to the words the, dog, and jumps. The non-zero value can be any of the following:
- A 1 to indicate the presence of a word.
- A count of the number of times a word appears in the bag. For example, if the phrase were the maroon dog is a dog with maroon fur, then both maroon and dog would be represented as 2, while the other words would be represented as 1.
- Some other value, such as the logarithm of the count of the number of times a word appears in the bag.
baseline
A model used as a reference point for comparing how well another model (typically, a more complex one) is performing. For example, a logistic regression model might serve as a good baseline for a deep model.
For a particular problem, the baseline helps model developers quantify the minimal expected performance that a new model must achieve for the new model to be useful.
batch
The set of examples used in one training iteration. The batch size determines the number of examples in a batch.
See epoch for an explanation of how a batch relates to an epoch.
batch inference
The process of inferring predictions on multiple unlabeled examples divided into smaller subsets (“batches”).
Batch inference can leverage the parallelization features of accelerator chips. That is, multiple accelerators can simultaneously infer predictions on different batches of unlabeled examples, dramatically increasing the number of inferences per second.
batch normalization
Normalizing the input or output of the activation functions in a hidden layer. Batch normalization can provide the following benefits:
- Make neural networks more stable by protecting against outlier weights.
- Enable higher learning rates, which can speed training.
- Reduce overfitting.
batch size
The number of examples in a batch. For instance, if the batch size is 100, then the model processes 100 examples per iteration.
The following are popular batch size strategies:
- Stochastic Gradient Descent (SGD), in which the batch size is 1.
- full batch, in which the batch size is the number of examples in the entire training set. For instance, if the training set contains a million examples, then the batch size would be a million examples. Full batch is usually an inefficient strategy.
- mini-batch in which the batch size is usually between 10 and 1000. Mini-batch is usually the most efficient strategy.
Bayesian neural network
A probabilistic neural network that accounts for uncertainty in weights and outputs. A standard neural network regression model typically predicts a scalar value; for example, a standard model predicts a house price of 853,000. In contrast, a Bayesian neural network predicts a distribution of values; for example, a Bayesian model predicts a house price of 853,000 with a standard deviation of 67,200.
A Bayesian neural network relies on Bayes’ Theorem to calculate uncertainties in weights and predictions. A Bayesian neural network can be useful when it is important to quantify uncertainty, such as in models related to pharmaceuticals. Bayesian neural networks can also help prevent overfitting.
Bayesian optimization
A probabilistic regression model technique for optimizing computationally expensive objective functions by instead optimizing a surrogate that quantifies the uncertainty via a Bayesian learning technique. Since Bayesian optimization is itself very expensive, it is usually used to optimize expensive-to-evaluate tasks that have a small number of parameters, such as selecting hyperparameters.
Bellman equation
A necessary condition for optimality associated with the mathematical optimization method known as dynamic programming. It writes the “value” of a decision problem at a certain point in time in terms of the payoff from some initial choices and the “value” of the remaining decision problem that results from those initial choices. This breaks a dynamic optimization problem into a sequence of simpler subproblems, as Bellman’s “principle of optimality” prescribes. The equation applies to algebraic structures with a total ordering; for algebraic structures with a partial ordering, the generic Bellman’s equation can be used.
BERT (Bidirectional Encoder Representations from Transformers)
A model architecture for text representation. A trained BERT model can act as part of a larger model for text classification or other ML tasks.
BERT has the following characteristics:
- Uses the Transformer architecture, and therefore relies on self-attention.
- Uses the encoder part of the Transformer. The encoder’s job is to produce good text representations, rather than to perform a specific task like classification.
- Is bidirectional.
- Uses masking for unsupervised training.
BERT’s variants include:
- ALBERT, which is an acronym for A Light BERT.
- LaBSE.
bias (ethics/fairness)
- Stereotyping, prejudice or favoritism towards some things, people, or groups over others. These biases can affect collection and interpretation of data, the design of a system, and how users interact with a system. Forms of this type of bias include:
- automation bias
- confirmation bias
- experimenter’s bias
- group attribution bias
- implicit bias
- in-group bias
- out-group homogeneity bias
- Systematic error introduced by a sampling or reporting procedure. Forms of this type of bias include:
- coverage bias
- non-response bias
- participation bias
- reporting bias
- sampling bias
- selection bias
Not to be confused with the bias term in machine learning models or prediction bias.
bias (math) or bias term
An intercept or offset from an origin. Bias is a parameter in machine learning models, which is symbolized by either of the following:
- b
- w0
For example, bias is the b in the following formula:
y’ = b + w1x1 + w2x2 + … w_n*x_n
In a simple two-dimensional line, bias just means “y-intercept.”
Bias exists because not all models start from the origin (0,0). For example, suppose an amusement park costs 2 Euros to enter and an additional 0.5 Euro for every hour a customer stays. Therefore, a model mapping the total cost has a bias of 2 because the lowest cost is 2 Euros.
bidirectional
A term used to describe a system that evaluates the text that both precedes and follows a target section of text. In contrast, a unidirectional system only evaluates the text that precedes a target section of text.
For example, consider a masked language model that must determine probabilities for the word or words representing the underline in the following question:
What is the \_\_\_\_\_ with you?
A unidirectional language model would have to base its probabilities only on the context provided by the words “What”, “is”, and “the”. In contrast, a bidirectional language model could also gain context from “with” and “you”, which might help the model generate better predictions.
bidirectional language model
A language model that determines the probability that a given token is present at a given location in an excerpt of text based on the preceding and following text.
bigram
An N-gram (ordered sequence of N words) in which N=2.
binary classification
A type of classification task that predicts one of two mutually exclusive classes:
- the positive class
- the negative class
For example, the following two machine learning models each perform binary classification:
- A model that determines whether email messages are spam (the positive class) or not spam (the negative class).
- A model that evaluates medical symptoms to determine whether a person has a particular disease (the positive class) or doesn’t have that disease (the negative class).
Contrast with multi-class classification.
binary condition
In a decision tree, a condition that has only two possible outcomes, typically yes or no.
binning
Synonym for bucketing. Converting a single feature into multiple binary features called buckets or bins, typically based on a value range. The chopped feature is typically a continuous feature.
BLEU (Bilingual Evaluation Understudy)
A score between 0.0 and 1.0, inclusive, indicating the quality of a translation between two human languages (for example, between English and Russian). A BLEU score of 1.0 indicates a perfect translation; a BLEU score of 0.0 indicates a terrible translation.
boosting
A machine learning technique that iteratively combines a set of simple and not very accurate classifiers (referred to as “weak” classifiers) into a classifier with high accuracy (a “strong” classifier) by upweighting the examples that the model is currently misclassifying.
bounding box
In an image, the (x, y) coordinates of a rectangle around an area of interest, such as identifying the portion of an image containing a dog.
broadcasting
Expanding the shape of an operand in a matrix math operation to dimensions compatible for that operation. For instance, linear algebra requires that the two operands in a matrix addition operation must have the same dimensions. Consequently, you can’t add a matrix of shape (m, n) to a vector of length n. Broadcasting enables this operation by virtually expanding the vector of length n to a matrix of shape (m, n) by replicating the same values down each column.
bucketing
Converting a single feature into multiple binary features called buckets or bins, typically based on a value range. The chopped feature is typically a continuous feature.
For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete buckets, such as:
- <= 10 degrees Celsius would be the “cold” bucket.
- 11 - 24 degrees Celsius would be the “temperate” bucket.
- > = 25 degrees Celsius would be the “warm” bucket.
The model will treat every value in the same bucket identically. For example, the values 13 and 22 are both in the temperate bucket, so the model treats the two values identically.
calibration layer
A post-prediction adjustment, typically to account for prediction bias. The adjusted predictions and probabilities should match the distribution of an observed set of labels.
candidate generation
The initial set of recommendations chosen by a recommendation system. For example, consider a bookstore that offers 100,000 titles. The candidate generation phase creates a much smaller list of suitable books for a particular user, say 500. But even 500 books is way too many to recommend to a user. Subsequent, more expensive, phases of a recommendation system (such as scoring and re-ranking) reduce those 500 to a much smaller, more useful set of recommendations.
candidate sampling
A training-time optimization that calculates a probability for all the positive labels, using, for example, softmax, but only for a random sample of negative labels. For instance, given an example labeled beagle and dog, candidate sampling computes the predicted probabilities and corresponding loss terms for:
- beagle
- dog
- a random subset of the remaining negative classes (for example, cat, lollipop, fence).
The idea is that the negative classes can learn from less frequent negative reinforcement as long as positive classes always get proper positive reinforcement, and this is indeed observed empirically.
Candidate sampling is more computationally efficient than training algorithms that compute predictions for all negative classes, particularly when the number of negative classes is very large.
categorical data
Features having a specific set of possible values. For example, consider a categorical feature named traffic-light-state, which can only have one of the following three possible values:
- red
- yellow
- green
By representing traffic-light-state as a categorical feature, a model can learn the differing impacts of red, green, and yellow on driver behavior.
Categorical features are sometimes called discrete features.
Contrast with numerical data.
causal language model
Synonym for unidirectional language model. A language model that bases its probabilities only on the tokens appearing before, not after, the target token(s). Contrast with bidirectional language model.
centroid
The center of a cluster as determined by a k-means or k-median algorithm. For instance, if k is 3, then the k-means or k-median algorithm finds 3 centroids.
centroid-based clustering
A category of clustering algorithms that organizes data into nonhierarchical clusters. k-means is the most widely used centroid-based clustering algorithm.
Contrast with hierarchical clustering algorithms.
chain-of-thought prompting
A prompt engineering technique that encourages a large language model (LLM) to explain its reasoning, step by step. For example, consider the following prompt, paying particular attention to the second sentence:
How many g forces would a driver experience in a car that goes from 0 to 60 miles per hour in 7 seconds? In the answer, show all relevant calculations.
The LLM’s response would likely:
- Show a sequence of physics formulas, plugging in the values 0, 60, and 7 in appropriate places.
- Explain why it chose those formulas and what the various variables mean.
Chain-of-thought prompting forces the LLM to perform all the calculations, which might lead to a more correct answer. In addition, chain-of-thought prompting enables the user to examine the LLM’s steps to determine whether or not the answer makes sense.
checkpoint
Data that captures the state of a model’s parameters at a particular training iteration. Checkpoints enable exporting model weights, or performing training across multiple sessions. Checkpoints also enable training to continue past errors (for example, job preemption).
When fine tuning, the starting point for training the new model will be a specific checkpoint of the pre-trained model.
class
A category that a label can belong to. For example:
- In a binary classification model that detects spam, the two classes might be spam and not spam.
- In a multi-class classification model that identifies dog breeds, the classes might be poodle, beagle, pug, and so on.
A classification model predicts a class. In contrast, a regression model predicts a number rather than a class.
classification model
A model whose prediction is a class. For example, the following are all classification models:
- A model that predicts an input sentence’s language (French? Spanish? Italian?).
- A model that predicts tree species (Maple? Oak? Baobab?).
- A model that predicts the positive or negative class for a particular medical condition.
In contrast, regression models predict numbers rather than classes.
Two common types of classification models are:
- binary classification
- multi-class classification
classification threshold
In a binary classification, a number between 0 and 1 that converts the raw output of a logistic regression model into a prediction of either the positive class or the negative class. Note that the classification threshold is a value that a human chooses, not a value chosen by model training.
A logistic regression model outputs a raw value between 0 and 1. Then:
- If this raw value is greater than the classification threshold, then the positive class is predicted.
- If this raw value is less than the classification threshold, then the negative class is predicted.
For example, suppose the classification threshold is 0.8. If the raw value is 0.9, then the model predicts the positive class. If the raw value is 0.7, then the model predicts the negative class.
The choice of classification threshold strongly influences the number of false positives and false negatives.
class-imbalanced dataset
A dataset for a classification problem in which the total number of labels of each class differs significantly. For example, consider a binary classification dataset whose two labels are divided as follows:
- 1,000,000 negative labels
- 10 positive labels
The ratio of negative to positive labels is 100,000 to 1, so this is a class-imbalanced dataset.
In contrast, the following dataset is not class-imbalanced because the ratio of negative labels to positive labels is relatively close to 1:
- 517 negative labels
- 483 positive labels
Multi-class datasets can also be class-imbalanced. For example, the following multi-class classification dataset is also class-imbalanced because one label has far more examples than the other two:
- 1,000,000 labels with class “green”
- 200 labels with class “purple”
- 350 labels with class “orange”
See also entropy, majority class, and minority class.
clipping
A technique for handling outliers by doing either or both of the following:
- Reducing feature values that are greater than a maximum threshold down to that maximum threshold.
- Increasing feature values that are less than a minimum threshold up to that minimum threshold.
For example, suppose that <0.5% of values for a particular feature fall outside the range 40–60. In this case, you could do the following:
- Clip all values over 60 (the maximum threshold) to be exactly 60.
- Clip all values under 40 (the minimum threshold) to be exactly 40.
Outliers can damage models, sometimes causing weights to overflow during training. Some outliers can also dramatically spoil metrics like accuracy. Clipping is a common technique to limit the damage.
Gradient clipping forces gradient values within a designated range during training.
Cloud TPU
A specialized hardware accelerator designed to speed up machine learning workloads on Google Cloud Platform.
clustering
Grouping related examples, particularly during unsupervised learning. Once all the examples are grouped, a human can optionally supply meaning to each cluster.
Many clustering algorithms exist. For example, the k-means algorithm clusters examples based on their proximity to a centroid. As another example, consider a clustering algorithm based on an example’s distance from a center point.
co-adaptation
When neurons predict patterns in training data by relying almost exclusively on outputs of specific other neurons instead of relying on the network’s behavior as a whole. When the patterns that cause co-adaption are not present in validation data, then co-adaptation causes overfitting. Dropout regularization reduces co-adaptation because dropout ensures neurons cannot rely solely on specific other neurons.
collaborative filtering
Making predictions about the interests of one user based on the interests of many other users. Collaborative filtering is often used in recommendation systems.
condition
In a decision tree, any node that evaluates an expression. A condition is also called a split or a test.
Contrast condition with leaf.
configuration
The process of assigning the initial property values used to train a model, including:
- the model’s composing layers
- the location of the data
- hyperparameters such as:
- learning rate
- iterations
- optimizer
- loss function
In machine learning projects, configuration can be done through a special configuration file or via configuration libraries such as the following:
- HParam
- Gin
- Fiddle
confirmation bias
The tendency to search for, interpret, favor, and recall information in a way that confirms one’s preexisting beliefs or hypotheses. Machine learning developers may inadvertently collect or label data in ways that influence an outcome supporting their existing beliefs. Confirmation bias is a form of implicit bias.
Experimenter’s bias is a form of confirmation bias in which an experimenter continues training models until a preexisting hypothesis is confirmed.
confusion matrix
An NxN table that summarizes the number of correct and incorrect predictions that a classification model made. The confusion matrix for a multi-class classification problem can help you identify patterns of mistakes. Confusion matrices contain sufficient information to calculate a variety of performance metrics, including precision and recall.
constituency parsing
Dividing a sentence into smaller grammatical structures (“constituents”). A later part of the ML system, such as a natural language understanding model, can parse the constituents more easily than the original sentence. For example, consider the following sentence:
My friend adopted two cats.
A constituency parser can divide this sentence into the following two constituents:
- My friend is a noun phrase.
- adopted two cats is a verb phrase.
These constituents can be further subdivided into smaller constituents. For example, the verb phrase
adopted two cats
could be further subdivided into:
- adopted is a verb.
- two cats is another noun phrase.
continuous feature
A floating-point feature with an infinite range of possible values, such as temperature or weight.
Contrast with discrete feature.
convenience sampling
Using a dataset not gathered scientifically in order to run quick experiments. Later on, it’s essential to switch to a scientifically gathered dataset.
convergence
A state reached when loss values change very little or not at all with each iteration. A model converges when additional training won’t improve the model.
In deep learning, loss values sometimes stay constant or nearly so for many iterations before finally descending. During a long period of constant loss values, you may temporarily get a false sense of convergence.
See also early stopping.
convex function
A function in which the region above the graph of the function is a convex set. The prototypical convex function is shaped something like the letter U. A strictly convex function has exactly one local minimum point, which is also the global minimum point. The classic U-shaped functions are strictly convex functions. However, some convex functions (for example, straight lines) are not U-shaped.
convex optimization
The process of using mathematical techniques such as gradient descent to find the minimum of a convex function. A great deal of research in machine learning has focused on formulating various problems as convex optimization problems and in solving those problems more efficiently.
convex set
A subset of Euclidean space such that a line drawn between any two points in the subset remains completely within the subset.
E.g. a square or circle would be a convex set, a star or horseshoe would not be a convex set
convolution
In mathematics, casually speaking, a mixture of two functions. In machine learning, a convolution mixes the convolutional filter and the input matrix in order to train weights.
The term “convolution” in machine learning is often a shorthand way of referring to either convolutional operation or convolutional layer.
Without convolutions, a machine learning algorithm would have to learn a separate weight for every cell in a large tensor. For example, a machine learning algorithm training on 2K x 2K images would be forced to find 4M separate weights. Thanks to convolutions, a machine learning algorithm only has to find weights for every cell in the convolutional filter, dramatically reducing the memory needed to train the model. When the convolutional filter is applied, it is simply replicated across cells such that each is multiplied by the filter.
convolutional filter
One of the two actors in a convolutional operation. (The other actor is a slice of an input matrix.) A convolutional filter is a matrix having the same rank as the input matrix, but a smaller shape. For example, given a 28x28 input matrix, the filter could be any 2D matrix smaller than 28x28.
In photographic manipulation, all the cells in a convolutional filter are typically set to a constant pattern of ones and zeroes. In machine learning, convolutional filters are typically seeded with random numbers and then the network trains the ideal values.
convolutional layer
A layer of a deep neural network in which a convolutional filter passes along an input matrix. For example, consider the following 3x3 convolutional filter:
0 1 0
1 0 1
0 1 0
When applied to a 5x5 input matrix, it will output a 3x3 matrix of convolutional operations, by applying the 3x3 convolution operation to 3x3 slices of the input matrix.
convolutional neural network
A neural network in which at least one layer is a convolutional layer. A typical convolutional neural network consists of some combination of the following layers:
- convolutional layers
- pooling layers
- dense layers
Convolutional neural networks have had great success in certain kinds of problems, such as image recognition.
convolutional operation
The following two-step mathematical operation:
- Element-wise multiplication of the convolutional filter and a slice of an input matrix. (The slice of the input matrix has the same rank and size as the convolutional filter.)
- Summation of all the values in the resulting product matrix.
A convolutional layer consists of a series of convolutional operations, each acting on a different slice of the input matrix.
cost
Synonym for loss. During the training of a supervised model, a measure of how far a model’s prediction is from its label.
A loss function calculates the loss.
co-training
A semi-supervised learning approach particularly useful when all of the following conditions are true:
- The ratio of unlabeled examples to labeled examples in the dataset is high.
- This is a classification problem (binary or multi-class).
- The dataset contains two different sets of predictive features that are independent of each other and complementary.
Co-training essentially amplifies independent signals into a stronger signal. For instance, consider a classification model that categorizes individual used cars as either Good or Bad. One set of predictive features might focus on aggregate characteristics such as the year, make, and model of the car; another set of predictive features might focus on the previous owner’s driving record and the car’s maintenance history.
counterfactual fairness
A fairness metric that checks whether a classifier produces the same result for one individual as it does for another individual who is identical to the first, except with respect to one or more sensitive attributes. Evaluating a classifier for counterfactual fairness is one method for surfacing potential sources of bias in a model.
coverage bias
A form of selection bias, in which the population represented in the dataset doesn’t match the population that the machine learning model is making predictions about.
crash blossom
A sentence or phrase with an ambiguous meaning. Crash blossoms present a significant problem in natural language understanding. For example, the headline Red Tape Holds Up Skyscraper is a crash blossom because an NLU model could interpret the headline literally or figuratively.
critic
Synonym for Deep Q-Network, in Q-learning, a deep neural network that predicts Q-functions.
cross-entropy
A generalization of Log Loss to multi-class classification problems. Cross-entropy quantifies the difference between two probability distributions. See also perplexity.
cross-validation
A mechanism for estimating how well a model would generalize to new data by testing the model against one or more non-overlapping data subsets withheld from the training set.
data analysis
Obtaining an understanding of data by considering samples, measurement, and visualization. Data analysis can be particularly useful when a dataset is first received, before one builds the first model. It is also crucial in understanding experiments and debugging problems with the system.
data augmentation
Artificially boosting the range and number of training examples by transforming existing examples to create additional examples. For example, suppose images are one of your features, but your dataset doesn’t contain enough image examples for the model to learn useful associations. Ideally, you’d add enough labeled images to your dataset to enable your model to train properly. If that’s not possible, data augmentation can rotate, stretch, and reflect each image to produce many variants of the original picture, possibly yielding enough labeled data to enable excellent training.
DataFrame
A popular pandas datatype for representing datasets in memory.
A DataFrame is analogous to a table or a spreadsheet. Each column of a DataFrame has a name (a header), and each row is identified by a unique number.
Each column in a DataFrame is structured like a 2D array, except that each column can be assigned its own data type.
data parallelism
A way of scaling training or inference that replicates an entire model onto multiple devices and then passes a subset of the input data to each device. Data parallelism can enable training and inference on very large batch sizes; however, data parallelism requires that the model be small enough to fit on all devices.
Data parallelism typically speeds training and inference.
See also model parallelism.
data set or dataset
A collection of raw data, commonly (but not exclusively) organized in one of the following formats:
- a spreadsheet
- a file in CSV (comma-separated values) format
Dataset API (tf.data)
A high-level TensorFlow API for reading data and transforming it into a form that a machine learning algorithm requires. A tf.data.Dataset object represents a sequence of elements, in which each element contains one or more Tensors. A tf.data.Iterator object provides access to the elements of a Dataset.
decision boundary
The separator between classes learned by a model in a binary class or multi-class classification problems. For a visual example, consider a 2D scatter plot representing a binary classification problem - the decision boundary would be the line bisecting the two classes/clusters of data points.
decision forest
A model created from multiple decision trees. A decision forest makes a prediction by aggregating the predictions of its decision trees. Popular types of decision forests include random forests and gradient boosted trees.