Google Glossary Flashcards
A/B testing
A statistical way of comparing two (or more) techniques, typically an incumbent against a new rival. A/B testing aims to determine not only which technique performs better but also to understand whether the difference is statistically significant. A/B testing usually considers only two techniques using one measurement, but it can be applied to any finite number of techniques and measures.
accuracy
The fraction of predictions that a classification model got right. In multi-class classification, accuracy is defined as follows:
Accuracy = (correct predictions)/(Total number of examples)
In binary classification, accuracy has the following definition:
Accuracy = (True Positives + True Negatives)/(Total number of examples)
See true positive and true negative.
action
In reinforcement learning, the mechanism by which the agent transitions between states of the environment. The agent chooses the action by using a policy.
activation function
A function (for example, ReLU or sigmoid) that takes in the weighted sum of all of the inputs from the previous layer and then generates and passes an output value (typically nonlinear) to the next layer.
active learning
A training approach in which the algorithm chooses some of the data it learns from. Active learning is particularly valuable when labeled examples are scarce or expensive to obtain. Instead of blindly seeking a diverse range of labeled examples, an active learning algorithm selectively seeks the particular range of examples it needs for learning.
AdaGrad
A sophisticated gradient descent algorithm that rescales the gradients of each parameter, effectively giving each parameter an independent learning rate. For a full explanation, see this paper.
agent
In reinforcement learning, the entity that uses a policy to maximize expected return gained from transitioning between states of the environment.
agglomerative clustering
See hierarchical clustering.
A category of clustering algorithms that create a tree of clusters. Hierarchical clustering is well-suited to hierarchical data, such as botanical taxonomies. There are two types of hierarchical clustering algorithms:
Agglomerative clustering first assigns every example to its own cluster, and iteratively merges the closest clusters to create a hierarchical tree. Divisive clustering first groups all examples into one cluster and then iteratively divides the cluster into a hierarchical tree.
Contrast with centroid-based clustering.
augmented reality.
A technology that superimposes a computer-generated image on a user’s view of the real world, thus providing a composite view.
PR AUC (area under the PR curve)
PR AUC (area under the PR curve)
Area under the interpolated precision-recall curve, obtained by plotting (recall, precision) points for different values of the classification threshold. Depending on how it’s calculated, PR AUC may be equivalent to the average precision of the model.
AUC (Area under the ROC Curve)
An evaluation metric that considers all possible classification thresholds.
The Area Under the ROC curve is the probability that a classifier will be more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive.
artificial general intelligence
A non-human mechanism that demonstrates a broad range of problem solving, creativity, and adaptability. For example, a program demonstrating artificial general intelligence could translate text, compose symphonies, and excel at games that have not yet been invented.
artificial intelligence
A non-human program or model that can solve sophisticated tasks. For example, a program or model that translates text or a program or model that identifies diseases from radiologic images both exhibit artificial intelligence.
Formally, machine learning is a sub-field of artificial intelligence. However, in recent years, some organizations have begun using the terms artificial intelligence and machine learning interchangeably.
attribute
Synonym for feature. In fairness, attributes often refer to characteristics pertaining to individuals.
automation bias
When a human decision maker favors recommendations made by an automated decision-making system over information made without automation, even when the automated decision-making system makes errors.
average precision
A metric for summarizing the performance of a ranked sequence of results. Average precision is calculated by taking the average of the precision values for each relevant result (each result in the ranked list where the recall increases relative to the previous result).
See also Area under the PR Curve.
backpropagation
The primary algorithm for performing gradient descent on neural networks. First, the output values of each node are calculated (and cached) in a forward pass. Then, the partial derivative of the error with respect to each parameter is calculated in a backward pass through the graph.
bag of words
A representation of the words in a phrase or passage, irrespective of order. For example, bag of words represents the following three phrases identically:
the dog jumps jumps the dog dog jumps the
Each word is mapped to an index in a sparse vector, where the vector has an index for every word in the vocabulary. For example, the phrase the dog jumps is mapped into a feature vector with non-zero values at the three indices corresponding to the words the, dog, and jumps. The non-zero value can be any of the following:
A 1 to indicate the presence of a word. A count of the number of times a word appears in the bag. For example, if the phrase were the maroon dog is a dog with maroon fur, then both maroon and dog would be represented as 2, while the other words would be represented as 1. Some other value, such as the logarithm of the count of the number of times a word appears in the bag.
baseline
A model used as a reference point for comparing how well another model (typically, a more complex one) is performing. For example, a logistic regression model might serve as a good baseline for a deep model.
For a particular problem, the baseline helps model developers quantify the minimal expected performance that a new model must achieve for the new model to be useful.
batch
The set of examples used in one iteration (that is, one gradient update) of model training.
See also batch size.
batch normalization
Normalizing the input or output of the activation functions in a hidden layer. Batch normalization can provide the following benefits:
Make neural networks more stable by protecting against outlier weights. Enable higher learning rates. Reduce overfitting.
batch size
The number of examples in a batch. For example, the batch size of SGD is 1, while the batch size of a mini-batch is usually between 10 and 1000. Batch size is usually fixed during training and inference; however, TensorFlow does permit dynamic batch sizes.
Bayesian neural network
A probabilistic neural network that accounts for uncertainty in weights and outputs. A standard neural network regression model typically predicts a scalar value; for example, a model predicts a house price of 853,000. By contrast, a Bayesian neural network predicts a distribution of values; for example, a model predicts a house price of 853,000 with a standard deviation of 67,200. A Bayesian neural network relies on Bayes’ Theorem to calculate uncertainties in weights and predictions. A Bayesian neural network can be useful when it is important to quantify uncertainty, such as in models related to pharmaceuticals. Bayesian neural networks can also help prevent overfitting.
Bellman equation #rl
In reinforcement learning, the following identity satisfied by the optimal Q-function:
Reinforcement learning algorithms apply this identity to create Q-learning via the following update rule:
Beyond reinforcement learning, the Bellman equation has applications to dynamic programming. See the Wikipedia entry for Bellman Equation.
bias (ethics/fairness) #fairness
- Stereotyping, prejudice or favoritism towards some things, people, or groups over others. These biases can affect collection and interpretation of data, the design of a system, and how users interact with a system. Forms of this type of bias include:
automation bias confirmation bias experimenter’s bias group attribution bias implicit bias in-group bias out-group homogeneity bias
- Systematic error introduced by a sampling or reporting procedure. Forms of this type of bias include:
coverage bias non-response bias participation bias reporting bias sampling bias selection bias
Not to be confused with the bias term in machine learning models or prediction bias.
bias (math)
An intercept or offset from an origin. Bias (also known as the bias term) is referred to as b or w0 in machine learning models. For example, bias is the b in the following formula:
y’=b+w1x1+w2x2+…wnxn
Not to be confused with bias in ethics and fairness or prediction bias.
bigram
An N-gram in which N=2.
binary classification
A type of classification task that outputs one of two mutually exclusive classes. For example, a machine learning model that evaluates email messages and outputs either “spam” or “not spam” is a binary classifier.
binning
See bucketing. bucketing
Converting a (usually continuous) feature into multiple binary features called buckets or bins, typically based on value range. For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete bins. Given temperature data sensitive to a tenth of a degree, all temperatures between 0.0 and 15.0 degrees could be put into one bin, 15.1 to 30.0 degrees could be a second bin, and 30.1 to 50.0 degrees could be a third bin.
boosting
A machine learning technique that iteratively combines a set of simple and not very accurate classifiers (referred to as “weak” classifiers) into a classifier with high accuracy (a “strong” classifier) by upweighting the examples that the model is currently misclassfying.
broadcasting
Expanding the shape of an operand in a matrix math operation to dimensions compatible for that operation. For instance, linear algebra requires that the two operands in a matrix addition operation must have the same dimensions. Consequently, you can’t add a matrix of shape (m, n) to a vector of length n. Broadcasting enables this operation by virtually expanding the vector of length n to a matrix of shape (m,n) by replicating the same values down each column.
For example, given the following definitions, linear algebra prohibits A+B because A and B have different dimensions:
A = [[7, 10, 4],
[13, 5, 9]]
B = [2]
However, broadcasting enables the operation A+B by virtually expanding B to:
[[2, 2, 2],
[2, 2, 2]]
Thus, A+B is now a valid operation:
[[7, 10, 4], + [[2, 2, 2], = [[ 9, 12, 6],
[13, 5, 9]] [2, 2, 2]] [15, 7, 11]]
See the following description of broadcasting in NumPy for more details.
bucketing
Converting a (usually continuous) feature into multiple binary features called buckets or bins, typically based on value range. For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete bins. Given temperature data sensitive to a tenth of a degree, all temperatures between 0.0 and 15.0 degrees could be put into one bin, 15.1 to 30.0 degrees could be a second bin, and 30.1 to 50.0 degrees could be a third bin.
calibration layer
A post-prediction adjustment, typically to account for prediction bias. The adjusted predictions and probabilities should match the distribution of an observed set of labels.
candidate generation
The initial set of recommendations chosen by a recommendation system. For example, consider a bookstore that offers 100,000 titles. The candidate generation phase creates a much smaller list of suitable books for a particular user, say 500. But even 500 books is way too many to recommend to a user. Subsequent, more expensive, phases of a recommendation system (such as scoring and re-ranking) whittle down those 500 to a much smaller, more useful set of recommendations.
candidate sampling
A training-time optimization in which a probability is calculated for all the positive labels, using, for example, softmax, but only for a random sample of negative labels. For example, if we have an example labeled beagle and dog candidate sampling computes the predicted probabilities and corresponding loss terms for the beagle and dog class outputs in addition to a random subset of the remaining classes (cat, lollipop, fence). The idea is that the negative classes can learn from less frequent negative reinforcement as long as positive classes always get proper positive reinforcement, and this is indeed observed empirically. The motivation for candidate sampling is a computational efficiency win from not computing predictions for all negatives.
categorical data
Features having a discrete set of possible values. For example, consider a categorical feature named house style, which has a discrete set of three possible values: Tudor, ranch, colonial. By representing house style as categorical data, the model can learn the separate impacts of Tudor, ranch, and colonial on house price.
Sometimes, values in the discrete set are mutually exclusive, and only one value can be applied to a given example. For example, a car maker categorical feature would probably permit only a single value (Toyota) per example. Other times, more than one value may be applicable. A single car could be painted more than one different color, so a car color categorical feature would likely permit a single example to have multiple values (for example, red and white).
Categorical features are sometimes called discrete features.
Contrast with numerical data.
centroid
The center of a cluster as determined by a k-means or k-median algorithm. For instance, if k is 3, then the k-means or k-median algorithm finds 3 centroids.
centroid-based clustering
A category of clustering algorithms that organizes data into nonhierarchical clusters. k-means is the most widely used centroid-based clustering algorithm.
Contrast with hierarchical clustering algorithms.
checkpoint
Data that captures the state of the variables of a model at a particular time. Checkpoints enable exporting model weights, as well as performing training across multiple sessions. Checkpoints also enable training to continue past errors (for example, job preemption). Note that the graph itself is not included in a checkpoint.
class
One of a set of enumerated target values for a label. For example, in a binary classification model that detects spam, the two classes are spam and not spam. In a multi-class classification model that identifies dog breeds, the classes would be poodle, beagle, pug, and so on.
classification model
A type of machine learning model for distinguishing among two or more discrete classes. For example, a natural language processing classification model could determine whether an input sentence was in French, Spanish, or Italian. Compare with regression model.
classification threshold
A scalar-value criterion that is applied to a model’s predicted score in order to separate the positive class from the negative class. Used when mapping logistic regression results to binary classification. For example, consider a logistic regression model that determines the probability of a given email message being spam. If the classification threshold is 0.9, then logistic regression values above 0.9 are classified as spam and those below 0.9 are classified as not spam.
class-imbalanced dataset
A binary classification problem in which the labels for the two classes have significantly different frequencies. For example, a disease dataset in which 0.0001 of examples have positive labels and 0.9999 have negative labels is a class-imbalanced problem, but a football game predictor in which 0.51 of examples label one team winning and 0.49 label the other team winning is not a class-imbalanced problem.
clipping
A technique for handling outliers. Specifically, reducing feature values that are greater than a set maximum value down to that maximum value. Also, increasing feature values that are less than a specific minimum value up to that minimum value.
For example, suppose that only a few feature values fall outside the range 40–60. In this case, you could do the following:
Clip all values over 60 to be exactly 60. Clip all values under 40 to be exactly 40.
In addition to bringing input values within a designated range, clipping can also used to force gradient values within a designated range during training.
Cloud TPU
A specialized hardware accelerator designed to speed up machine learning workloads on Google Cloud Platform. #TensorFlow #GoogleCloud
clustering
Grouping related examples, particularly during unsupervised learning. Once all the examples are grouped, a human can optionally supply meaning to each cluster.
Many clustering algorithms exist. For example, the k-means algorithm clusters examples based on their proximity to a centroid, as in the following diagram:
A human researcher could then review the clusters and, for example, label cluster 1 as “dwarf trees” and cluster 2 as “full-size trees.”
As another example, consider a clustering algorithm based on an example’s distance from a center point, illustrated as follows:
co-adaptation
When neurons predict patterns in training data by relying almost exclusively on outputs of specific other neurons instead of relying on the network’s behavior as a whole. When the patterns that cause co-adaption are not present in validation data, then co-adaptation causes overfitting. Dropout regularization reduces co-adaptation because dropout ensures neurons cannot rely solely on specific other neurons.
collaborative filtering
Making predictions about the interests of one user based on the interests of many other users. Collaborative filtering is often used in recommendation systems.
confirmation bias
fairness
The tendency to search for, interpret, favor, and recall information in a way that confirms one’s preexisting beliefs or hypotheses. Machine learning developers may inadvertently collect or label data in ways that influence an outcome supporting their existing beliefs. Confirmation bias is a form of implicit bias.
Experimenter’s bias is a form of confirmation bias in which an experimenter continues training models until a preexisting hypothesis is confirmed.
confusion matrix
An NxN table that summarizes how successful a classification model’s predictions were; that is, the correlation between the label and the model’s classification. One axis of a confusion matrix is the label that the model predicted, and the other axis is the actual label. N represents the number of classes. In a binary classification problem, N=2. For example, here is a sample confusion matrix for a binary classification problem:
Tumor (predicted) Non-Tumor (predicted)
Tumor (actual) 18 1
Non-Tumor (actual) 6 452
The preceding confusion matrix shows that of the 19 samples that actually had tumors, the model correctly classified 18 as having tumors (18 true positives), and incorrectly classified 1 as not having a tumor (1 false negative). Similarly, of 458 samples that actually did not have tumors, 452 were correctly classified (452 true negatives) and 6 were incorrectly classified (6 false positives).
The confusion matrix for a multi-class classification problem can help you determine mistake patterns. For example, a confusion matrix could reveal that a model trained to recognize handwritten digits tends to mistakenly predict 9 instead of 4, or 1 instead of 7.
Confusion matrices contain sufficient information to calculate a variety of performance metrics, including precision and recall.
continuous feature
A floating-point feature with an infinite range of possible values. Contrast with discrete feature.
convenience sampling
Using a dataset not gathered scientifically in order to run quick experiments. Later on, it’s essential to switch to a scientifically gathered dataset.
convergence
Informally, often refers to a state reached during training in which training loss and validation loss change very little or not at all with each iteration after a certain number of iterations. In other words, a model reaches convergence when additional training on the current data will not improve the model. In deep learning, loss values sometimes stay constant or nearly so for many iterations before finally descending, temporarily producing a false sense of convergence.
See also early stopping.
See also Boyd and Vandenberghe, Convex Optimization.
convex function
A function in which the region above the graph of the function is a convex set. The prototypical convex function is shaped something like the letter U. For example, the following are all convex functions:
A typical convex function is shaped like the letter ‘U’.
By contrast, the following function is not convex. Notice how the region above the graph is not a convex set:
A strictly convex function has exactly one local minimum point, which is also the global minimum point. The classic U-shaped functions are strictly convex functions. However, some convex functions (for example, straight lines) are not U-shaped.
A lot of the common loss functions, including the following, are convex functions:
L2 loss Log Loss L1 regularization L2 regularization
Many variations of gradient descent are guaranteed to find a point close to the minimum of a strictly convex function. Similarly, many variations of stochastic gradient descent have a high probability (though, not a guarantee) of finding a point close to the minimum of a strictly convex function.
The sum of two convex functions (for example, L2 loss + L1 regularization) is a convex function.
Deep models are never convex functions. Remarkably, algorithms designed for convex optimization tend to find reasonably good solutions on deep networks anyway, even though those solutions are not guaranteed to be a global minimum.
convex optimization
The process of using mathematical techniques such as gradient descent to find the minimum of a convex function. A great deal of research in machine learning has focused on formulating various problems as convex optimization problems and in solving those problems more efficiently.
For complete details, see Boyd and Vandenberghe, Convex Optimization.
convex set
A subset of Euclidean space such that a line drawn between any two points in the subset remains completely within the subset. For instance, the following two shapes are convex sets:
A rectangle and a semi-ellipse are both convex sets.
By contrast, the following two shapes are not convex sets:
A pie-chart with a missing slice and a firework are both nonconvex sets.
convolution
In mathematics, casually speaking, a mixture of two functions. In machine learning, a convolution mixes the convolutional filter and the input matrix in order to train weights.
The term “convolution” in machine learning is often a shorthand way of referring to either convolutional operation or convolutional layer.
Without convolutions, a machine learning algorithm would have to learn a separate weight for every cell in a large tensor. For example, a machine learning algorithm training on 2K x 2K images would be forced to find 4M separate weights. Thanks to convolutions, a machine learning algorithm only has to find weights for every cell in the convolutional filter, dramatically reducing the memory needed to train the model. When the convolutional filter is applied, it is simply replicated across cells such that each is multiplied by the filter.
convolutional filter
One of the two actors in a convolutional operation. (The other actor is a slice of an input matrix.) A convolutional filter is a matrix having the same rank as the input matrix, but a smaller shape. For example, given a 28x28 input matrix, the filter could be any 2D matrix smaller than 28x28.
In photographic manipulation, all the cells in a convolutional filter are typically set to a constant pattern of ones and zeroes. In machine learning, convolutional filters are typically seeded with random numbers and then the network trains the ideal values.
convolutional layer
A layer of a deep neural network in which a convolutional filter passes along an input matrix. For example, consider the following 3x3 convolutional filter:
The following animation shows a convolutional layer consisting of 9 convolutional operations involving the 5x5 input matrix. Notice that each convolutional operation works on a different 3x3 slice of the input matrix. The resulting 3x3 matrix (on the right) consists of the results of the 9 convolutional operations:
convolutional neural network
A neural network in which at least one layer is a convolutional layer. A typical convolutional neural network consists of some combination of the following layers:
convolutional layers pooling layers dense layers
Convolutional neural networks have had great success in certain kinds of problems, such as image recognition.
convolutional operation
The following two-step mathematical operation:
Element-wise multiplication of the convolutional filter and a slice of an input matrix. (The slice of the input matrix has the same rank and size as the convolutional filter.) Summation of all the values in the resulting product matrix.
For example, consider the following 5x5 input matrix:
Now imagine the following 2x2 convolutional filter:
Each convolutional operation involves a single 2x2 slice of the input matrix. For instance, suppose we use the 2x2 slice at the top-left of the input matrix. So, the convolution operation on this slice looks as follows:
A convolutional layer consists of a series of convolutional operations, each acting on a different slice of the input matrix.
cost
Synonym for loss.
counterfactual fairness
#fairness A fairness metric that checks whether a classifier produces the same result for one individual as it does for another individual who is identical to the first, except with respect to one or more sensitive attributes. Evaluating a classifier for counterfactual fairness is one method for surfacing potential sources of bias in a model.
See “When Worlds Collide: Integrating Different Counterfactual Assumptions in Fairness” for a more detailed discussion of counterfactual fairness.
coverage bias
See selection bias.
crash blossom
A sentence or phrase with an ambiguous meaning. Crash blossoms present a significant problem in natural language understanding. For example, the headline Red Tape Holds Up Skyscraper is a crash blossom because an NLU model could interpret the headline literally or figuratively.
critic
rl
Synonym for Deep Q-Network.
cross-entropy
A generalization of Log Loss to multi-class classification problems. Cross-entropy quantifies the difference between two probability distributions. See also perplexity.
cross-validation
A mechanism for estimating how well a model will generalize to new data by testing the model against one or more non-overlapping data subsets withheld from the training set.
custom Estimator
TensorFlow
An Estimator that you write yourself by following these directions.
Contrast with premade Estimators.
data analysis
Obtaining an understanding of data by considering samples, measurement, and visualization. Data analysis can be particularly useful when a dataset is first received, before one builds the first model. It is also crucial in understanding experiments and debugging problems with the system.
data augmentation
Artificially boosting the range and number of training examples by transforming existing examples to create additional examples. For example, suppose images are one of your features, but your dataset doesn’t contain enough image examples for the model to learn useful associations. Ideally, you’d add enough labeled images to your dataset to enable your model to train properly. If that’s not possible, data augmentation can rotate, stretch, and reflect each image to produce many variants of the original picture, possibly yielding enough labeled data to enable excellent training.
DataFrame
A popular datatype for representing datasets in pandas. A DataFrame is analogous to a table. Each column of the DataFrame has a name (a header), and each row is identified by a number.
data set or dataset
A collection of examples.
Dataset API (tf.data)
TensorFlow
A high-level TensorFlow API for reading data and transforming it into a form that a machine learning algorithm requires. A tf.data.Dataset object represents a sequence of elements, in which each element contains one or more Tensors. A tf.data.Iterator object provides access to the elements of a Dataset.
For details about the Dataset API, see Importing Data in the TensorFlow Programmer’s Guide.
decision boundary
The separator between classes learned by a model in a binary class or multi-class classification problems. For example, in the following image representing a binary classification problem, the decision boundary is the frontier between the orange class and the blue class:
A well-defined boundary between one class and another.
decision threshold
Synonym for classification threshold.
decision tree
A model represented as a sequence of branching statements. For example, the following over-simplified decision tree branches a few times to predict the price of a house (in thousands of USD). According to this decision tree, a house larger than 160 square meters, having more than three bedrooms, and built less than 10 years ago would have a predicted price of 510 thousand USD.
A tree three-levels deep whose branches predict house prices.
Machine learning can generate deep decision trees.
deep model
A type of neural network containing multiple hidden layers.
Contrast with wide model.
deep neural network
Synonym for deep model.
Deep Q-Network (DQN)
rl
In Q-learning, a deep neural network that predicts Q-functions.
Critic is a synonym for Deep Q-Network.
demographic parity
fairness
A fairness metric that is satisfied if the results of a model’s classification are not dependent on a given sensitive attribute.
For example, if both Lilliputians and Brobdingnagians apply to Glubbdubdrib University, demographic parity is achieved if the percentage of Lilliputians admitted is the same as the percentage of Brobdingnagians admitted, irrespective of whether one group is on average more qualified than the other.
Contrast with equalized odds and equality of opportunity, which permit classification results in aggregate to depend on sensitive attributes, but do not permit classification results for certain specified ground-truth labels to depend on sensitive attributes. See “Attacking discrimination with smarter machine learning” for a visualization exploring the tradeoffs when optimizing for demographic parity.
dense feature
A feature in which most values are non-zero, typically a Tensor of floating-point values. Contrast with sparse feature.
dense layer
Synonym for fully connected layer. fully connected layer
A hidden layer in which each node is connected to every node in the subsequent hidden layer.
A fully connected layer is also known as a dense layer.
depth
The number of layers (including any embedding layers) in a neural network that learn weights. For example, a neural network with 5 hidden layers and 1 output layer has a depth of 6.
depthwise separable convolutional neural network (sepCNN)
A convolutional neural network architecture based on Inception, but where Inception modules are replaced with depthwise separable convolutions. Also known as Xception.
A depthwise separable convolution (also abbreviated as separable convolution) factors a standard 3-D convolution into two separate convolution operations that are more computationally efficient: first, a depthwise convolution, with a depth of 1 (n ✕ n ✕ 1), and then second, a pointwise convolution, with length and width of 1 (1 ✕ 1 ✕ n).
To learn more, see Xception: Deep Learning with Depthwise Separable Convolutions.
device
TensorFlow
A category of hardware that can run a TensorFlow session, including CPUs, GPUs, and TPUs.
dimension reduction
Decreasing the number of dimensions used to represent a particular feature in a feature vector, typically by converting to an embedding.
dimensions
Overloaded term having any of the following definitions:
The number of levels of coordinates in a Tensor. For example: A scalar has zero dimensions; for example, ["Hello"]. A vector has one dimension; for example, [3, 5, 7, 11]. A matrix has two dimensions; for example, [[2, 4, 18], [5, 7, 14]]. You can uniquely specify a particular cell in a one-dimensional vector with one coordinate; you need two coordinates to uniquely specify a particular cell in a two-dimensional matrix. The number of entries in a feature vector. The number of elements in an embedding layer.
discrete feature
A feature with a finite set of possible values. For example, a feature whose values may only be animal, vegetable, or mineral is a discrete (or categorical) feature. Contrast with continuous feature.
discriminative model
A model that predicts labels from a set of one or more features. More formally, discriminative models define the conditional probability of an output given the features and weights; that is:
p(output | features, weights)
For example, a model that predicts whether an email is spam from features and weights is a discriminative model.
The vast majority of supervised learning models, including classification and regression models, are discriminative models.
Contrast with generative model.
discriminator
A system that determines whether examples are real or fake.
The subsystem within a generative adversarial network that determines whether the examples created by the generator are real or fake.
disparate impact
fairness
Making decisions about people that impact different population subgroups disproportionately. This usually refers to situations where an algorithmic decision-making process harms or benefits some subgroups more than others.
For example, suppose an algorithm that determines a Lilliputian’s eligibility for a miniature-home loan is more likely to classify them as “ineligible” if their mailing address contains a certain postal code. If Big-Endian Lilliputians are more likely to have mailing addresses with this postal code than Little-Endian Lilliputians, then this algorithm may result in disparate impact.
Contrast with disparate treatment, which focuses on disparities that result when subgroup characteristics are explicit inputs to an algorithmic decision-making process.
disparate treatment
fairness
Factoring subjects’ sensitive attributes into an algorithmic decision-making process such that different subgroups of people are treated differently.
For example, consider an algorithm that determines Lilliputians’ eligibility for a miniature-home loan based on the data they provide in their loan application. If the algorithm uses a Lilliputian’s affiliation as Big-Endian or Little-Endian as an input, it is enacting disparate treatment along that dimension.
Contrast with disparate impact, which focuses on disparities in the societal impacts of algorithmic decisions on subgroups, irrespective of whether those subgroups are inputs to the model.
divisive clustering
See hierarchical clustering.
hierarchical clustering
A category of clustering algorithms that create a tree of clusters. Hierarchical clustering is well-suited to hierarchical data, such as botanical taxonomies. There are two types of hierarchical clustering algorithms:
Agglomerative clustering first assigns every example to its own cluster, and iteratively merges the closest clusters to create a hierarchical tree. Divisive clustering first groups all examples into one cluster and then iteratively divides the cluster into a hierarchical tree.
Contrast with centroid-based clustering.
hinge loss
A family of loss functions for classification designed to find the decision boundary as distant as possible from each training example, thus maximizing the margin between examples and the boundary. KSVMs use hinge loss (or a related function, such as squared hinge loss). For binary classification, the hinge loss function is defined as follows:
where y is the true label, either -1 or +1, and y’ is the raw output of the classifier model:
Consequently, a plot of hinge loss vs. (y * y’) looks as follows:
holdout data
Examples intentionally not used (“held out”) during training. The validation dataset and test dataset are examples of holdout data. Holdout data helps evaluate your model’s ability to generalize to data other than the data it was trained on. The loss on the holdout set provides a better estimate of the loss on an unseen dataset than does the loss on the training set.
hyperparameter
The “knobs” that you tweak during successive runs of training a model. For example, learning rate is a hyperparameter.
Contrast with parameter.
hyperplane
A boundary that separates a space into two subspaces. For example, a line is a hyperplane in two dimensions and a plane is a hyperplane in three dimensions. More typically in machine learning, a hyperplane is the boundary separating a high-dimensional space. Kernel Support Vector Machines use hyperplanes to separate positive classes from negative classes, often in a very high-dimensional space.
independently and identically distributed (i.i.d)
independently and identically distributed (i.i.d)
Data drawn from a distribution that doesn’t change, and where each value drawn doesn’t depend on values that have been drawn previously. An i.i.d. is the ideal gas of machine learning—a useful mathematical construct but almost never exactly found in the real world. For example, the distribution of visitors to a web page may be i.i.d. over a brief window of time; that is, the distribution doesn’t change during that brief window and one person’s visit is generally independent of another’s visit. However, if you expand that window of time, seasonal differences in the web page’s visitors may appear.
distributed.
individual fairness
fairness
A fairness metric that checks whether similar individuals are classified similarly. For example, Brobdingnagian Academy might want to satisfy individual fairness by ensuring that two students with identical grades and standardized test scores are equally likely to gain admission.
Note that individual fairness relies entirely on how you define “similarity” (in this case, grades and test scores), and you can run the risk of introducing new fairness problems if your similarity metric misses important information (such as the rigor of a student’s curriculum).
See “Fairness Through Awareness” for a more detailed discussion of individual fairness.
inference
In machine learning, often refers to the process of making predictions by applying the trained model to unlabeled examples. In statistics, inference refers to the process of fitting the parameters of a distribution conditioned on some observed data. (See the Wikipedia article on statistical inference.)
in-group bias
fairness
Showing partiality to one’s own group or own characteristics. If testers or raters consist of the machine learning developer’s friends, family, or colleagues, then in-group bias may invalidate product testing or the dataset.
In-group bias is a form of group attribution bias. See also out-group homogeneity bias.
input function
TensorFlow
In TensorFlow, a function that returns input data to the training, evaluation, or prediction method of an Estimator. For example, the training input function returns a batch of features and labels from the training set.
input layer
The first layer (the one that receives the input data) in a neural network.
instance
Synonym for example. One row of a dataset. An example contains one or more features and possibly a label. See also labeled example and unlabeled example.
interpretability
The degree to which a model’s predictions can be readily explained. Deep models are often non-interpretable; that is, a deep model’s different layers can be hard to decipher. By contrast, linear regression models and wide models are typically far more interpretable.
inter-rater agreement
A measurement of how often human raters agree when doing a task. If raters disagree, the task instructions may need to be improved. Also sometimes called inter-annotator agreement or inter-rater reliability. See also Cohen’s kappa, which is one of the most popular inter-rater agreement measurements.
item matrix
In recommendation systems, a matrix of embeddings generated by matrix factorization that holds latent signals about each item. Each row of the item matrix holds the value of a single latent feature for all items. For example, consider a movie recommendation system. Each column in the item matrix represents a single movie. The latent signals might represent genres, or might be harder-to-interpret signals that involve complex interactions among genre, stars, movie age, or other factors.
The item matrix has the same number of columns as the target matrix that is being factorized. For example, given a movie recommendation system that evaluates 10,000 movie titles, the item matrix will have 10,000 columns.
items
In a recommendation system, the entities that a system recommends. For example, videos are the items that a video store recommends, while books are the items that a bookstore recommends.
iteration
A single update of a model’s weights during training. An iteration consists of computing the gradients of the parameters with respect to the loss on a single batch of data.
Keras
A popular Python machine learning API. Keras runs on several deep learning frameworks, including TensorFlow, where it is made available as tf.keras.
Kernel Support Vector Machines (KSVMs)
A classification algorithm that seeks to maximize the margin between positive and negative classes by mapping input data vectors to a higher dimensional space. For example, consider a classification problem in which the input dataset has a hundred features. To maximize the margin between positive and negative classes, a KSVM could internally map those features into a million-dimension space. KSVMs uses a loss function called hinge loss.
k-means
A popular clustering algorithm that groups examples in unsupervised learning. The k-means algorithm basically does the following:
Iteratively determines the best k center points (known as centroids). Assigns each example to the closest centroid. Those examples nearest the same centroid belong to the same group.
The k-means algorithm picks centroid locations to minimize the cumulative square of the distances from each example to its closest centroid.
For example, consider the following plot of dog height to dog width:
If k=3, the k-means algorithm will determine three centroids. Each example is assigned to its closest centroid, yielding three groups:
Imagine that a manufacturer wants to determine the ideal sizes for small, medium, and large sweaters for dogs. The three centroids identify the mean height and mean width of each dog in that cluster. So, the manufacturer should probably base sweater sizes on those three centroids. Note that the centroid of a cluster is typically not an example in the cluster.
The preceding illustrations shows k-means for examples with only two features (height and width). Note that k-means can group examples across many features.
k-median
A clustering algorithm closely related to k-means. The practical difference between the two is as follows:
In k-means, centroids are determined by minimizing the sum of the squares of the distance between a centroid candidate and each of its examples. In k-median, centroids are determined by minimizing the sum of the distance between a centroid candidate and each of its examples.
Note that the definitions of distance are also different:
k-means relies on the Euclidean distance from the centroid to an example. (In two dimensions, the Euclidean distance means using the Pythagorean theorem to calculate the hypotenuse.) For example, the k-means distance between (2,2) and (5,-2) would be: Euclidian distance = sqrt((2-5)^2 + (2--2)^2)) = 5 k-median relies on the Manhattan distance from the centroid to an example. This distance is the sum of the absolute deltas in each dimension. For example, the k-median distance between (2,2) and (5,-2) would be: Manhattan distance = |2-5| +|2--2| = 7
L1 loss
Loss function based on the absolute value of the difference between the values that a model is predicting and the actual values of the labels. L1 loss is less sensitive to outliers than L2 loss.
L1 regularization
A type of regularization that penalizes weights in proportion to the sum of the absolute values of the weights. In models relying on sparse features, L1 regularization helps drive the weights of irrelevant or barely relevant features to exactly 0, which removes those features from the model. Contrast with L2 regularization.
L2 loss
See squared loss. squared loss
The loss function used in linear regression. (Also known as L2 Loss.) This function calculates the squares of the difference between a model’s predicted value for a labeled example and the actual value of the label. Due to squaring, this loss function amplifies the influence of bad predictions. That is, squared loss reacts more strongly to outliers than L1 loss.
L2 regularization
A type of regularization that penalizes weights in proportion to the sum of the squares of the weights. L2 regularization helps drive outlier weights (those with high positive or low negative values) closer to 0 but not quite to 0. (Contrast with L1 regularization.) L2 regularization always improves generalization in linear models.
label
In supervised learning, the “answer” or “result” portion of an example. Each example in a labeled dataset consists of one or more features and a label. For instance, in a housing dataset, the features might include the number of bedrooms, the number of bathrooms, and the age of the house, while the label might be the house’s price. In a spam detection dataset, the features might include the subject line, the sender, and the email message itself, while the label would probably be either “spam” or “not spam.”
labeled example
An example that contains features and a label. In supervised training, models learn from labeled examples.
lambda
Synonym for regularization rate.
This is an overloaded term. Here we’re focusing on the term’s definition within regularization.
layer
A set of neurons in a neural network that process a set of input features, or the output of those neurons.
Also, an abstraction in TensorFlow. Layers are Python functions that take Tensors and configuration options as input and produce other tensors as output. Once the necessary Tensors have been composed, the user can convert the result into an Estimator via a model function.
Layers API (tf.layers)
TensorFlow
A TensorFlow API for constructing a deep neural network as a composition of layers. The Layers API enables you to build different types of layers, such as:
tf. layers.Dense for a fully-connected layer. tf. layers.Conv2D for a convolutional layer.
When writing a custom Estimator, you compose Layers objects to define the characteristics of all the hidden layers.
The Layers API follows the Keras layers API conventions. That is, aside from a different prefix, all functions in the Layers API have the same names and signatures as their counterparts in the Keras layers API.
learning rate
A scalar used to train a model via gradient descent. During each iteration, the gradient descent algorithm multiplies the learning rate by the gradient. The resulting product is called the gradient step.
Learning rate is a key hyperparameter.
least squares regression
least squares regression
A linear regression model trained by minimizing L2 Loss.
linear model
A model that assigns one weight per feature to make predictions. (Linear models also incorporate a bias.) By contrast, the relationship of weights to features in deep models is not one-to-one.
A linear model uses the following formula:
where:
y’=b+w1x1+w2x2+…wnxn
y’ = is the raw prediction. (In certain kinds of linear models, this raw prediction will be further modified. For example, see logistic regression.)
b is the bias.
w is a weight, so w1 is the weight of the first feature,
w2 is the weight of the second feature, and so on.
x is a feature, so x1 is the value of the first feature,
x2 is the value of the second feature, and so on.
Linear models tend to be easier to analyze and train than deep models. However, deep models can model complex relationships between features.
Linear regression and logistic regression are two types of linear models. Linear models include not only models that use the linear equation but also a broader set of models that use the linear equation as part of the formula. For example, logistic regression post-processes the raw prediction (
) to calculate the prediction.