Google Glossary Flashcards

1
Q

A/B testing

A

A statistical way of comparing two (or more) techniques, typically an incumbent against a new rival. A/B testing aims to determine not only which technique performs better but also to understand whether the difference is statistically significant. A/B testing usually considers only two techniques using one measurement, but it can be applied to any finite number of techniques and measures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

accuracy

A

The fraction of predictions that a classification model got right. In multi-class classification, accuracy is defined as follows:
Accuracy = (correct predictions)/(Total number of examples)
In binary classification, accuracy has the following definition:
Accuracy = (True Positives + True Negatives)/(Total number of examples)
See true positive and true negative.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

action

A

In reinforcement learning, the mechanism by which the agent transitions between states of the environment. The agent chooses the action by using a policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

activation function

A

A function (for example, ReLU or sigmoid) that takes in the weighted sum of all of the inputs from the previous layer and then generates and passes an output value (typically nonlinear) to the next layer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

active learning

A

A training approach in which the algorithm chooses some of the data it learns from. Active learning is particularly valuable when labeled examples are scarce or expensive to obtain. Instead of blindly seeking a diverse range of labeled examples, an active learning algorithm selectively seeks the particular range of examples it needs for learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

AdaGrad

A

A sophisticated gradient descent algorithm that rescales the gradients of each parameter, effectively giving each parameter an independent learning rate. For a full explanation, see this paper.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

agent

A

In reinforcement learning, the entity that uses a policy to maximize expected return gained from transitioning between states of the environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

agglomerative clustering

A

See hierarchical clustering.
A category of clustering algorithms that create a tree of clusters. Hierarchical clustering is well-suited to hierarchical data, such as botanical taxonomies. There are two types of hierarchical clustering algorithms:

Agglomerative clustering first assigns every example to its own cluster, and iteratively merges the closest clusters to create a hierarchical tree.
Divisive clustering first groups all examples into one cluster and then iteratively divides the cluster into a hierarchical tree.

Contrast with centroid-based clustering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

augmented reality.

A

A technology that superimposes a computer-generated image on a user’s view of the real world, thus providing a composite view.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

PR AUC (area under the PR curve)

A

PR AUC (area under the PR curve)

Area under the interpolated precision-recall curve, obtained by plotting (recall, precision) points for different values of the classification threshold. Depending on how it’s calculated, PR AUC may be equivalent to the average precision of the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

AUC (Area under the ROC Curve)

A

An evaluation metric that considers all possible classification thresholds.

The Area Under the ROC curve is the probability that a classifier will be more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

artificial general intelligence

A

A non-human mechanism that demonstrates a broad range of problem solving, creativity, and adaptability. For example, a program demonstrating artificial general intelligence could translate text, compose symphonies, and excel at games that have not yet been invented.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

artificial intelligence

A

A non-human program or model that can solve sophisticated tasks. For example, a program or model that translates text or a program or model that identifies diseases from radiologic images both exhibit artificial intelligence.

Formally, machine learning is a sub-field of artificial intelligence. However, in recent years, some organizations have begun using the terms artificial intelligence and machine learning interchangeably.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

attribute

A

Synonym for feature. In fairness, attributes often refer to characteristics pertaining to individuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

automation bias

A

When a human decision maker favors recommendations made by an automated decision-making system over information made without automation, even when the automated decision-making system makes errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

average precision

A

A metric for summarizing the performance of a ranked sequence of results. Average precision is calculated by taking the average of the precision values for each relevant result (each result in the ranked list where the recall increases relative to the previous result).

See also Area under the PR Curve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

backpropagation

A

The primary algorithm for performing gradient descent on neural networks. First, the output values of each node are calculated (and cached) in a forward pass. Then, the partial derivative of the error with respect to each parameter is calculated in a backward pass through the graph.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

bag of words

A

A representation of the words in a phrase or passage, irrespective of order. For example, bag of words represents the following three phrases identically:

the dog jumps
jumps the dog
dog jumps the

Each word is mapped to an index in a sparse vector, where the vector has an index for every word in the vocabulary. For example, the phrase the dog jumps is mapped into a feature vector with non-zero values at the three indices corresponding to the words the, dog, and jumps. The non-zero value can be any of the following:

A 1 to indicate the presence of a word.
A count of the number of times a word appears in the bag. For example, if the phrase were the maroon dog is a dog with maroon fur, then both maroon and dog would be represented as 2, while the other words would be represented as 1.
Some other value, such as the logarithm of the count of the number of times a word appears in the bag.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

baseline

A

A model used as a reference point for comparing how well another model (typically, a more complex one) is performing. For example, a logistic regression model might serve as a good baseline for a deep model.

For a particular problem, the baseline helps model developers quantify the minimal expected performance that a new model must achieve for the new model to be useful.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

batch

A

The set of examples used in one iteration (that is, one gradient update) of model training.

See also batch size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

batch normalization

A

Normalizing the input or output of the activation functions in a hidden layer. Batch normalization can provide the following benefits:

Make neural networks more stable by protecting against outlier weights.
Enable higher learning rates.
Reduce overfitting.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

batch size

A

The number of examples in a batch. For example, the batch size of SGD is 1, while the batch size of a mini-batch is usually between 10 and 1000. Batch size is usually fixed during training and inference; however, TensorFlow does permit dynamic batch sizes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Bayesian neural network

A

A probabilistic neural network that accounts for uncertainty in weights and outputs. A standard neural network regression model typically predicts a scalar value; for example, a model predicts a house price of 853,000. By contrast, a Bayesian neural network predicts a distribution of values; for example, a model predicts a house price of 853,000 with a standard deviation of 67,200. A Bayesian neural network relies on Bayes’ Theorem to calculate uncertainties in weights and predictions. A Bayesian neural network can be useful when it is important to quantify uncertainty, such as in models related to pharmaceuticals. Bayesian neural networks can also help prevent overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q
Bellman equation
#rl
A

In reinforcement learning, the following identity satisfied by the optimal Q-function:

Reinforcement learning algorithms apply this identity to create Q-learning via the following update rule:

Beyond reinforcement learning, the Bellman equation has applications to dynamic programming. See the Wikipedia entry for Bellman Equation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q
bias (ethics/fairness)
#fairness
A
  1. Stereotyping, prejudice or favoritism towards some things, people, or groups over others. These biases can affect collection and interpretation of data, the design of a system, and how users interact with a system. Forms of this type of bias include:
    automation bias
    confirmation bias
    experimenter’s bias
    group attribution bias
    implicit bias
    in-group bias
    out-group homogeneity bias
  1. Systematic error introduced by a sampling or reporting procedure. Forms of this type of bias include:
    coverage bias
    non-response bias
    participation bias
    reporting bias
    sampling bias
    selection bias

Not to be confused with the bias term in machine learning models or prediction bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

bias (math)

A

An intercept or offset from an origin. Bias (also known as the bias term) is referred to as b or w0 in machine learning models. For example, bias is the b in the following formula:
y’=b+w1x1+w2x2+…wnxn
Not to be confused with bias in ethics and fairness or prediction bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

bigram

A

An N-gram in which N=2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

binary classification

A

A type of classification task that outputs one of two mutually exclusive classes. For example, a machine learning model that evaluates email messages and outputs either “spam” or “not spam” is a binary classifier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

binning

A

See bucketing. bucketing

Converting a (usually continuous) feature into multiple binary features called buckets or bins, typically based on value range. For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete bins. Given temperature data sensitive to a tenth of a degree, all temperatures between 0.0 and 15.0 degrees could be put into one bin, 15.1 to 30.0 degrees could be a second bin, and 30.1 to 50.0 degrees could be a third bin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

boosting

A

A machine learning technique that iteratively combines a set of simple and not very accurate classifiers (referred to as “weak” classifiers) into a classifier with high accuracy (a “strong” classifier) by upweighting the examples that the model is currently misclassfying.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

broadcasting

A

Expanding the shape of an operand in a matrix math operation to dimensions compatible for that operation. For instance, linear algebra requires that the two operands in a matrix addition operation must have the same dimensions. Consequently, you can’t add a matrix of shape (m, n) to a vector of length n. Broadcasting enables this operation by virtually expanding the vector of length n to a matrix of shape (m,n) by replicating the same values down each column.

For example, given the following definitions, linear algebra prohibits A+B because A and B have different dimensions:

A = [[7, 10, 4],
[13, 5, 9]]
B = [2]

However, broadcasting enables the operation A+B by virtually expanding B to:

[[2, 2, 2],
[2, 2, 2]]

Thus, A+B is now a valid operation:

[[7, 10, 4], + [[2, 2, 2], = [[ 9, 12, 6],
[13, 5, 9]] [2, 2, 2]] [15, 7, 11]]

See the following description of broadcasting in NumPy for more details.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

bucketing

A

Converting a (usually continuous) feature into multiple binary features called buckets or bins, typically based on value range. For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete bins. Given temperature data sensitive to a tenth of a degree, all temperatures between 0.0 and 15.0 degrees could be put into one bin, 15.1 to 30.0 degrees could be a second bin, and 30.1 to 50.0 degrees could be a third bin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

calibration layer

A

A post-prediction adjustment, typically to account for prediction bias. The adjusted predictions and probabilities should match the distribution of an observed set of labels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

candidate generation

A

The initial set of recommendations chosen by a recommendation system. For example, consider a bookstore that offers 100,000 titles. The candidate generation phase creates a much smaller list of suitable books for a particular user, say 500. But even 500 books is way too many to recommend to a user. Subsequent, more expensive, phases of a recommendation system (such as scoring and re-ranking) whittle down those 500 to a much smaller, more useful set of recommendations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

candidate sampling

A

A training-time optimization in which a probability is calculated for all the positive labels, using, for example, softmax, but only for a random sample of negative labels. For example, if we have an example labeled beagle and dog candidate sampling computes the predicted probabilities and corresponding loss terms for the beagle and dog class outputs in addition to a random subset of the remaining classes (cat, lollipop, fence). The idea is that the negative classes can learn from less frequent negative reinforcement as long as positive classes always get proper positive reinforcement, and this is indeed observed empirically. The motivation for candidate sampling is a computational efficiency win from not computing predictions for all negatives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

categorical data

A

Features having a discrete set of possible values. For example, consider a categorical feature named house style, which has a discrete set of three possible values: Tudor, ranch, colonial. By representing house style as categorical data, the model can learn the separate impacts of Tudor, ranch, and colonial on house price.

Sometimes, values in the discrete set are mutually exclusive, and only one value can be applied to a given example. For example, a car maker categorical feature would probably permit only a single value (Toyota) per example. Other times, more than one value may be applicable. A single car could be painted more than one different color, so a car color categorical feature would likely permit a single example to have multiple values (for example, red and white).

Categorical features are sometimes called discrete features.

Contrast with numerical data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

centroid

A

The center of a cluster as determined by a k-means or k-median algorithm. For instance, if k is 3, then the k-means or k-median algorithm finds 3 centroids.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

centroid-based clustering

A

A category of clustering algorithms that organizes data into nonhierarchical clusters. k-means is the most widely used centroid-based clustering algorithm.

Contrast with hierarchical clustering algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

checkpoint

A

Data that captures the state of the variables of a model at a particular time. Checkpoints enable exporting model weights, as well as performing training across multiple sessions. Checkpoints also enable training to continue past errors (for example, job preemption). Note that the graph itself is not included in a checkpoint.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

class

A

One of a set of enumerated target values for a label. For example, in a binary classification model that detects spam, the two classes are spam and not spam. In a multi-class classification model that identifies dog breeds, the classes would be poodle, beagle, pug, and so on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

classification model

A

A type of machine learning model for distinguishing among two or more discrete classes. For example, a natural language processing classification model could determine whether an input sentence was in French, Spanish, or Italian. Compare with regression model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

classification threshold

A

A scalar-value criterion that is applied to a model’s predicted score in order to separate the positive class from the negative class. Used when mapping logistic regression results to binary classification. For example, consider a logistic regression model that determines the probability of a given email message being spam. If the classification threshold is 0.9, then logistic regression values above 0.9 are classified as spam and those below 0.9 are classified as not spam.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

class-imbalanced dataset

A

A binary classification problem in which the labels for the two classes have significantly different frequencies. For example, a disease dataset in which 0.0001 of examples have positive labels and 0.9999 have negative labels is a class-imbalanced problem, but a football game predictor in which 0.51 of examples label one team winning and 0.49 label the other team winning is not a class-imbalanced problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

clipping

A

A technique for handling outliers. Specifically, reducing feature values that are greater than a set maximum value down to that maximum value. Also, increasing feature values that are less than a specific minimum value up to that minimum value.

For example, suppose that only a few feature values fall outside the range 40–60. In this case, you could do the following:

Clip all values over 60 to be exactly 60.
Clip all values under 40 to be exactly 40.

In addition to bringing input values within a designated range, clipping can also used to force gradient values within a designated range during training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Cloud TPU

A
A specialized hardware accelerator designed to speed up machine learning workloads on Google Cloud Platform.
#TensorFlow
#GoogleCloud
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

clustering

A

Grouping related examples, particularly during unsupervised learning. Once all the examples are grouped, a human can optionally supply meaning to each cluster.

Many clustering algorithms exist. For example, the k-means algorithm clusters examples based on their proximity to a centroid, as in the following diagram:

A human researcher could then review the clusters and, for example, label cluster 1 as “dwarf trees” and cluster 2 as “full-size trees.”

As another example, consider a clustering algorithm based on an example’s distance from a center point, illustrated as follows:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

co-adaptation

A

When neurons predict patterns in training data by relying almost exclusively on outputs of specific other neurons instead of relying on the network’s behavior as a whole. When the patterns that cause co-adaption are not present in validation data, then co-adaptation causes overfitting. Dropout regularization reduces co-adaptation because dropout ensures neurons cannot rely solely on specific other neurons.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

collaborative filtering

A

Making predictions about the interests of one user based on the interests of many other users. Collaborative filtering is often used in recommendation systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

confirmation bias

A

fairness

The tendency to search for, interpret, favor, and recall information in a way that confirms one’s preexisting beliefs or hypotheses. Machine learning developers may inadvertently collect or label data in ways that influence an outcome supporting their existing beliefs. Confirmation bias is a form of implicit bias.

Experimenter’s bias is a form of confirmation bias in which an experimenter continues training models until a preexisting hypothesis is confirmed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

confusion matrix

A

An NxN table that summarizes how successful a classification model’s predictions were; that is, the correlation between the label and the model’s classification. One axis of a confusion matrix is the label that the model predicted, and the other axis is the actual label. N represents the number of classes. In a binary classification problem, N=2. For example, here is a sample confusion matrix for a binary classification problem:
Tumor (predicted) Non-Tumor (predicted)
Tumor (actual) 18 1
Non-Tumor (actual) 6 452

The preceding confusion matrix shows that of the 19 samples that actually had tumors, the model correctly classified 18 as having tumors (18 true positives), and incorrectly classified 1 as not having a tumor (1 false negative). Similarly, of 458 samples that actually did not have tumors, 452 were correctly classified (452 true negatives) and 6 were incorrectly classified (6 false positives).

The confusion matrix for a multi-class classification problem can help you determine mistake patterns. For example, a confusion matrix could reveal that a model trained to recognize handwritten digits tends to mistakenly predict 9 instead of 4, or 1 instead of 7.

Confusion matrices contain sufficient information to calculate a variety of performance metrics, including precision and recall.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

continuous feature

A

A floating-point feature with an infinite range of possible values. Contrast with discrete feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

convenience sampling

A

Using a dataset not gathered scientifically in order to run quick experiments. Later on, it’s essential to switch to a scientifically gathered dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

convergence

A

Informally, often refers to a state reached during training in which training loss and validation loss change very little or not at all with each iteration after a certain number of iterations. In other words, a model reaches convergence when additional training on the current data will not improve the model. In deep learning, loss values sometimes stay constant or nearly so for many iterations before finally descending, temporarily producing a false sense of convergence.

See also early stopping.

See also Boyd and Vandenberghe, Convex Optimization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

convex function

A

A function in which the region above the graph of the function is a convex set. The prototypical convex function is shaped something like the letter U. For example, the following are all convex functions:

A typical convex function is shaped like the letter ‘U’.

By contrast, the following function is not convex. Notice how the region above the graph is not a convex set:

A strictly convex function has exactly one local minimum point, which is also the global minimum point. The classic U-shaped functions are strictly convex functions. However, some convex functions (for example, straight lines) are not U-shaped.

A lot of the common loss functions, including the following, are convex functions:

L2 loss
Log Loss
L1 regularization
L2 regularization

Many variations of gradient descent are guaranteed to find a point close to the minimum of a strictly convex function. Similarly, many variations of stochastic gradient descent have a high probability (though, not a guarantee) of finding a point close to the minimum of a strictly convex function.

The sum of two convex functions (for example, L2 loss + L1 regularization) is a convex function.

Deep models are never convex functions. Remarkably, algorithms designed for convex optimization tend to find reasonably good solutions on deep networks anyway, even though those solutions are not guaranteed to be a global minimum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

convex optimization

A

The process of using mathematical techniques such as gradient descent to find the minimum of a convex function. A great deal of research in machine learning has focused on formulating various problems as convex optimization problems and in solving those problems more efficiently.

For complete details, see Boyd and Vandenberghe, Convex Optimization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

convex set

A

A subset of Euclidean space such that a line drawn between any two points in the subset remains completely within the subset. For instance, the following two shapes are convex sets:

A rectangle and a semi-ellipse are both convex sets.

By contrast, the following two shapes are not convex sets:

A pie-chart with a missing slice and a firework are both nonconvex sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

convolution

A

In mathematics, casually speaking, a mixture of two functions. In machine learning, a convolution mixes the convolutional filter and the input matrix in order to train weights.

The term “convolution” in machine learning is often a shorthand way of referring to either convolutional operation or convolutional layer.

Without convolutions, a machine learning algorithm would have to learn a separate weight for every cell in a large tensor. For example, a machine learning algorithm training on 2K x 2K images would be forced to find 4M separate weights. Thanks to convolutions, a machine learning algorithm only has to find weights for every cell in the convolutional filter, dramatically reducing the memory needed to train the model. When the convolutional filter is applied, it is simply replicated across cells such that each is multiplied by the filter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

convolutional filter

A

One of the two actors in a convolutional operation. (The other actor is a slice of an input matrix.) A convolutional filter is a matrix having the same rank as the input matrix, but a smaller shape. For example, given a 28x28 input matrix, the filter could be any 2D matrix smaller than 28x28.

In photographic manipulation, all the cells in a convolutional filter are typically set to a constant pattern of ones and zeroes. In machine learning, convolutional filters are typically seeded with random numbers and then the network trains the ideal values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

convolutional layer

A

A layer of a deep neural network in which a convolutional filter passes along an input matrix. For example, consider the following 3x3 convolutional filter:

The following animation shows a convolutional layer consisting of 9 convolutional operations involving the 5x5 input matrix. Notice that each convolutional operation works on a different 3x3 slice of the input matrix. The resulting 3x3 matrix (on the right) consists of the results of the 9 convolutional operations:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

convolutional neural network

A

A neural network in which at least one layer is a convolutional layer. A typical convolutional neural network consists of some combination of the following layers:

convolutional layers
pooling layers
dense layers

Convolutional neural networks have had great success in certain kinds of problems, such as image recognition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

convolutional operation

A

The following two-step mathematical operation:

Element-wise multiplication of the convolutional filter and a slice of an input matrix. (The slice of the input matrix has the same rank and size as the convolutional filter.)
Summation of all the values in the resulting product matrix.

For example, consider the following 5x5 input matrix:

Now imagine the following 2x2 convolutional filter:

Each convolutional operation involves a single 2x2 slice of the input matrix. For instance, suppose we use the 2x2 slice at the top-left of the input matrix. So, the convolution operation on this slice looks as follows:

A convolutional layer consists of a series of convolutional operations, each acting on a different slice of the input matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

cost

A

Synonym for loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

counterfactual fairness

A
#fairness
A fairness metric that checks whether a classifier produces the same result for one individual as it does for another individual who is identical to the first, except with respect to one or more sensitive attributes. Evaluating a classifier for counterfactual fairness is one method for surfacing potential sources of bias in a model.

See “When Worlds Collide: Integrating Different Counterfactual Assumptions in Fairness” for a more detailed discussion of counterfactual fairness.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

coverage bias

A

See selection bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

crash blossom

A

A sentence or phrase with an ambiguous meaning. Crash blossoms present a significant problem in natural language understanding. For example, the headline Red Tape Holds Up Skyscraper is a crash blossom because an NLU model could interpret the headline literally or figuratively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

critic

A

rl

Synonym for Deep Q-Network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

cross-entropy

A

A generalization of Log Loss to multi-class classification problems. Cross-entropy quantifies the difference between two probability distributions. See also perplexity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

cross-validation

A

A mechanism for estimating how well a model will generalize to new data by testing the model against one or more non-overlapping data subsets withheld from the training set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

custom Estimator

A

TensorFlow

An Estimator that you write yourself by following these directions.

Contrast with premade Estimators.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

data analysis

A

Obtaining an understanding of data by considering samples, measurement, and visualization. Data analysis can be particularly useful when a dataset is first received, before one builds the first model. It is also crucial in understanding experiments and debugging problems with the system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

data augmentation

A

Artificially boosting the range and number of training examples by transforming existing examples to create additional examples. For example, suppose images are one of your features, but your dataset doesn’t contain enough image examples for the model to learn useful associations. Ideally, you’d add enough labeled images to your dataset to enable your model to train properly. If that’s not possible, data augmentation can rotate, stretch, and reflect each image to produce many variants of the original picture, possibly yielding enough labeled data to enable excellent training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

DataFrame

A

A popular datatype for representing datasets in pandas. A DataFrame is analogous to a table. Each column of the DataFrame has a name (a header), and each row is identified by a number.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

data set or dataset

A

A collection of examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

Dataset API (tf.data)

A

TensorFlow

A high-level TensorFlow API for reading data and transforming it into a form that a machine learning algorithm requires. A tf.data.Dataset object represents a sequence of elements, in which each element contains one or more Tensors. A tf.data.Iterator object provides access to the elements of a Dataset.

For details about the Dataset API, see Importing Data in the TensorFlow Programmer’s Guide.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

decision boundary

A

The separator between classes learned by a model in a binary class or multi-class classification problems. For example, in the following image representing a binary classification problem, the decision boundary is the frontier between the orange class and the blue class:

A well-defined boundary between one class and another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

decision threshold

A

Synonym for classification threshold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

decision tree

A

A model represented as a sequence of branching statements. For example, the following over-simplified decision tree branches a few times to predict the price of a house (in thousands of USD). According to this decision tree, a house larger than 160 square meters, having more than three bedrooms, and built less than 10 years ago would have a predicted price of 510 thousand USD.

A tree three-levels deep whose branches predict house prices.

Machine learning can generate deep decision trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

deep model

A

A type of neural network containing multiple hidden layers.

Contrast with wide model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

deep neural network

A

Synonym for deep model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

Deep Q-Network (DQN)

A

rl

In Q-learning, a deep neural network that predicts Q-functions.

Critic is a synonym for Deep Q-Network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

demographic parity

A

fairness

A fairness metric that is satisfied if the results of a model’s classification are not dependent on a given sensitive attribute.

For example, if both Lilliputians and Brobdingnagians apply to Glubbdubdrib University, demographic parity is achieved if the percentage of Lilliputians admitted is the same as the percentage of Brobdingnagians admitted, irrespective of whether one group is on average more qualified than the other.

Contrast with equalized odds and equality of opportunity, which permit classification results in aggregate to depend on sensitive attributes, but do not permit classification results for certain specified ground-truth labels to depend on sensitive attributes. See “Attacking discrimination with smarter machine learning” for a visualization exploring the tradeoffs when optimizing for demographic parity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

dense feature

A

A feature in which most values are non-zero, typically a Tensor of floating-point values. Contrast with sparse feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

dense layer

A

Synonym for fully connected layer. fully connected layer

A hidden layer in which each node is connected to every node in the subsequent hidden layer.

A fully connected layer is also known as a dense layer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

depth

A

The number of layers (including any embedding layers) in a neural network that learn weights. For example, a neural network with 5 hidden layers and 1 output layer has a depth of 6.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

depthwise separable convolutional neural network (sepCNN)

A

A convolutional neural network architecture based on Inception, but where Inception modules are replaced with depthwise separable convolutions. Also known as Xception.

A depthwise separable convolution (also abbreviated as separable convolution) factors a standard 3-D convolution into two separate convolution operations that are more computationally efficient: first, a depthwise convolution, with a depth of 1 (n ✕ n ✕ 1), and then second, a pointwise convolution, with length and width of 1 (1 ✕ 1 ✕ n).

To learn more, see Xception: Deep Learning with Depthwise Separable Convolutions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

device

A

TensorFlow

A category of hardware that can run a TensorFlow session, including CPUs, GPUs, and TPUs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

dimension reduction

A

Decreasing the number of dimensions used to represent a particular feature in a feature vector, typically by converting to an embedding.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

dimensions

A

Overloaded term having any of the following definitions:

The number of levels of coordinates in a Tensor. For example:
    A scalar has zero dimensions; for example, ["Hello"].
    A vector has one dimension; for example, [3, 5, 7, 11].
    A matrix has two dimensions; for example, [[2, 4, 18], [5, 7, 14]].

You can uniquely specify a particular cell in a one-dimensional vector with one coordinate; you need two coordinates to uniquely specify a particular cell in a two-dimensional matrix.

The number of entries in a feature vector.

The number of elements in an embedding layer.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

discrete feature

A

A feature with a finite set of possible values. For example, a feature whose values may only be animal, vegetable, or mineral is a discrete (or categorical) feature. Contrast with continuous feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

discriminative model

A

A model that predicts labels from a set of one or more features. More formally, discriminative models define the conditional probability of an output given the features and weights; that is:

p(output | features, weights)

For example, a model that predicts whether an email is spam from features and weights is a discriminative model.

The vast majority of supervised learning models, including classification and regression models, are discriminative models.

Contrast with generative model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
91
Q

discriminator

A

A system that determines whether examples are real or fake.

The subsystem within a generative adversarial network that determines whether the examples created by the generator are real or fake.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
92
Q

disparate impact

A

fairness

Making decisions about people that impact different population subgroups disproportionately. This usually refers to situations where an algorithmic decision-making process harms or benefits some subgroups more than others.

For example, suppose an algorithm that determines a Lilliputian’s eligibility for a miniature-home loan is more likely to classify them as “ineligible” if their mailing address contains a certain postal code. If Big-Endian Lilliputians are more likely to have mailing addresses with this postal code than Little-Endian Lilliputians, then this algorithm may result in disparate impact.

Contrast with disparate treatment, which focuses on disparities that result when subgroup characteristics are explicit inputs to an algorithmic decision-making process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
93
Q

disparate treatment

A

fairness

Factoring subjects’ sensitive attributes into an algorithmic decision-making process such that different subgroups of people are treated differently.

For example, consider an algorithm that determines Lilliputians’ eligibility for a miniature-home loan based on the data they provide in their loan application. If the algorithm uses a Lilliputian’s affiliation as Big-Endian or Little-Endian as an input, it is enacting disparate treatment along that dimension.

Contrast with disparate impact, which focuses on disparities in the societal impacts of algorithmic decisions on subgroups, irrespective of whether those subgroups are inputs to the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
94
Q

divisive clustering

A

See hierarchical clustering.
hierarchical clustering

A category of clustering algorithms that create a tree of clusters. Hierarchical clustering is well-suited to hierarchical data, such as botanical taxonomies. There are two types of hierarchical clustering algorithms:

Agglomerative clustering first assigns every example to its own cluster, and iteratively merges the closest clusters to create a hierarchical tree.
Divisive clustering first groups all examples into one cluster and then iteratively divides the cluster into a hierarchical tree.

Contrast with centroid-based clustering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
95
Q

hinge loss

A

A family of loss functions for classification designed to find the decision boundary as distant as possible from each training example, thus maximizing the margin between examples and the boundary. KSVMs use hinge loss (or a related function, such as squared hinge loss). For binary classification, the hinge loss function is defined as follows:

where y is the true label, either -1 or +1, and y’ is the raw output of the classifier model:

Consequently, a plot of hinge loss vs. (y * y’) looks as follows:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
96
Q

holdout data

A

Examples intentionally not used (“held out”) during training. The validation dataset and test dataset are examples of holdout data. Holdout data helps evaluate your model’s ability to generalize to data other than the data it was trained on. The loss on the holdout set provides a better estimate of the loss on an unseen dataset than does the loss on the training set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
97
Q

hyperparameter

A

The “knobs” that you tweak during successive runs of training a model. For example, learning rate is a hyperparameter.

Contrast with parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
98
Q

hyperplane

A

A boundary that separates a space into two subspaces. For example, a line is a hyperplane in two dimensions and a plane is a hyperplane in three dimensions. More typically in machine learning, a hyperplane is the boundary separating a high-dimensional space. Kernel Support Vector Machines use hyperplanes to separate positive classes from negative classes, often in a very high-dimensional space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
99
Q

independently and identically distributed (i.i.d)

A

independently and identically distributed (i.i.d)

Data drawn from a distribution that doesn’t change, and where each value drawn doesn’t depend on values that have been drawn previously. An i.i.d. is the ideal gas of machine learning—a useful mathematical construct but almost never exactly found in the real world. For example, the distribution of visitors to a web page may be i.i.d. over a brief window of time; that is, the distribution doesn’t change during that brief window and one person’s visit is generally independent of another’s visit. However, if you expand that window of time, seasonal differences in the web page’s visitors may appear.

distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
100
Q

individual fairness

A

fairness

A fairness metric that checks whether similar individuals are classified similarly. For example, Brobdingnagian Academy might want to satisfy individual fairness by ensuring that two students with identical grades and standardized test scores are equally likely to gain admission.

Note that individual fairness relies entirely on how you define “similarity” (in this case, grades and test scores), and you can run the risk of introducing new fairness problems if your similarity metric misses important information (such as the rigor of a student’s curriculum).

See “Fairness Through Awareness” for a more detailed discussion of individual fairness.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
101
Q

inference

A

In machine learning, often refers to the process of making predictions by applying the trained model to unlabeled examples. In statistics, inference refers to the process of fitting the parameters of a distribution conditioned on some observed data. (See the Wikipedia article on statistical inference.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
102
Q

in-group bias

A

fairness

Showing partiality to one’s own group or own characteristics. If testers or raters consist of the machine learning developer’s friends, family, or colleagues, then in-group bias may invalidate product testing or the dataset.

In-group bias is a form of group attribution bias. See also out-group homogeneity bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
103
Q

input function

A

TensorFlow

In TensorFlow, a function that returns input data to the training, evaluation, or prediction method of an Estimator. For example, the training input function returns a batch of features and labels from the training set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
104
Q

input layer

A

The first layer (the one that receives the input data) in a neural network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
105
Q

instance

A

Synonym for example. One row of a dataset. An example contains one or more features and possibly a label. See also labeled example and unlabeled example.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
106
Q

interpretability

A

The degree to which a model’s predictions can be readily explained. Deep models are often non-interpretable; that is, a deep model’s different layers can be hard to decipher. By contrast, linear regression models and wide models are typically far more interpretable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
107
Q

inter-rater agreement

A

A measurement of how often human raters agree when doing a task. If raters disagree, the task instructions may need to be improved. Also sometimes called inter-annotator agreement or inter-rater reliability. See also Cohen’s kappa, which is one of the most popular inter-rater agreement measurements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
108
Q

item matrix

A

In recommendation systems, a matrix of embeddings generated by matrix factorization that holds latent signals about each item. Each row of the item matrix holds the value of a single latent feature for all items. For example, consider a movie recommendation system. Each column in the item matrix represents a single movie. The latent signals might represent genres, or might be harder-to-interpret signals that involve complex interactions among genre, stars, movie age, or other factors.

The item matrix has the same number of columns as the target matrix that is being factorized. For example, given a movie recommendation system that evaluates 10,000 movie titles, the item matrix will have 10,000 columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
109
Q

items

A

In a recommendation system, the entities that a system recommends. For example, videos are the items that a video store recommends, while books are the items that a bookstore recommends.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
110
Q

iteration

A

A single update of a model’s weights during training. An iteration consists of computing the gradients of the parameters with respect to the loss on a single batch of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
111
Q

Keras

A

A popular Python machine learning API. Keras runs on several deep learning frameworks, including TensorFlow, where it is made available as tf.keras.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
112
Q

Kernel Support Vector Machines (KSVMs)

A

A classification algorithm that seeks to maximize the margin between positive and negative classes by mapping input data vectors to a higher dimensional space. For example, consider a classification problem in which the input dataset has a hundred features. To maximize the margin between positive and negative classes, a KSVM could internally map those features into a million-dimension space. KSVMs uses a loss function called hinge loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
113
Q

k-means

A

A popular clustering algorithm that groups examples in unsupervised learning. The k-means algorithm basically does the following:

Iteratively determines the best k center points (known as centroids).
Assigns each example to the closest centroid. Those examples nearest the same centroid belong to the same group.

The k-means algorithm picks centroid locations to minimize the cumulative square of the distances from each example to its closest centroid.

For example, consider the following plot of dog height to dog width:

If k=3, the k-means algorithm will determine three centroids. Each example is assigned to its closest centroid, yielding three groups:

Imagine that a manufacturer wants to determine the ideal sizes for small, medium, and large sweaters for dogs. The three centroids identify the mean height and mean width of each dog in that cluster. So, the manufacturer should probably base sweater sizes on those three centroids. Note that the centroid of a cluster is typically not an example in the cluster.

The preceding illustrations shows k-means for examples with only two features (height and width). Note that k-means can group examples across many features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
114
Q

k-median

A

A clustering algorithm closely related to k-means. The practical difference between the two is as follows:

In k-means, centroids are determined by minimizing the sum of the squares of the distance between a centroid candidate and each of its examples.
In k-median, centroids are determined by minimizing the sum of the distance between a centroid candidate and each of its examples.

Note that the definitions of distance are also different:

k-means relies on the Euclidean distance from the centroid to an example. (In two dimensions, the Euclidean distance means using the Pythagorean theorem to calculate the hypotenuse.) For example, the k-means distance between (2,2) and (5,-2) would be: Euclidian distance = sqrt((2-5)^2 + (2--2)^2)) = 5
k-median relies on the Manhattan distance from the centroid to an example. This distance is the sum of the absolute deltas in each dimension. For example, the k-median distance between (2,2) and (5,-2) would be: Manhattan distance = |2-5| +|2--2| = 7
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
115
Q

L1 loss

A

Loss function based on the absolute value of the difference between the values that a model is predicting and the actual values of the labels. L1 loss is less sensitive to outliers than L2 loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
116
Q

L1 regularization

A

A type of regularization that penalizes weights in proportion to the sum of the absolute values of the weights. In models relying on sparse features, L1 regularization helps drive the weights of irrelevant or barely relevant features to exactly 0, which removes those features from the model. Contrast with L2 regularization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
117
Q

L2 loss

A

See squared loss. squared loss

The loss function used in linear regression. (Also known as L2 Loss.) This function calculates the squares of the difference between a model’s predicted value for a labeled example and the actual value of the label. Due to squaring, this loss function amplifies the influence of bad predictions. That is, squared loss reacts more strongly to outliers than L1 loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
118
Q

L2 regularization

A

A type of regularization that penalizes weights in proportion to the sum of the squares of the weights. L2 regularization helps drive outlier weights (those with high positive or low negative values) closer to 0 but not quite to 0. (Contrast with L1 regularization.) L2 regularization always improves generalization in linear models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
119
Q

label

A

In supervised learning, the “answer” or “result” portion of an example. Each example in a labeled dataset consists of one or more features and a label. For instance, in a housing dataset, the features might include the number of bedrooms, the number of bathrooms, and the age of the house, while the label might be the house’s price. In a spam detection dataset, the features might include the subject line, the sender, and the email message itself, while the label would probably be either “spam” or “not spam.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
120
Q

labeled example

A

An example that contains features and a label. In supervised training, models learn from labeled examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
121
Q

lambda

A

Synonym for regularization rate.

This is an overloaded term. Here we’re focusing on the term’s definition within regularization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
122
Q

layer

A

A set of neurons in a neural network that process a set of input features, or the output of those neurons.

Also, an abstraction in TensorFlow. Layers are Python functions that take Tensors and configuration options as input and produce other tensors as output. Once the necessary Tensors have been composed, the user can convert the result into an Estimator via a model function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
123
Q

Layers API (tf.layers)

A

TensorFlow

A TensorFlow API for constructing a deep neural network as a composition of layers. The Layers API enables you to build different types of layers, such as:

tf. layers.Dense for a fully-connected layer.
tf. layers.Conv2D for a convolutional layer.

When writing a custom Estimator, you compose Layers objects to define the characteristics of all the hidden layers.

The Layers API follows the Keras layers API conventions. That is, aside from a different prefix, all functions in the Layers API have the same names and signatures as their counterparts in the Keras layers API.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
124
Q

learning rate

A

A scalar used to train a model via gradient descent. During each iteration, the gradient descent algorithm multiplies the learning rate by the gradient. The resulting product is called the gradient step.

Learning rate is a key hyperparameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
125
Q

least squares regression

A

least squares regression

A linear regression model trained by minimizing L2 Loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
126
Q

linear model

A

A model that assigns one weight per feature to make predictions. (Linear models also incorporate a bias.) By contrast, the relationship of weights to features in deep models is not one-to-one.

A linear model uses the following formula:

where:
y’=b+w1x1+w2x2+…wnxn
y’ = is the raw prediction. (In certain kinds of linear models, this raw prediction will be further modified. For example, see logistic regression.)
b is the bias.
w is a weight, so w1 is the weight of the first feature,
w2 is the weight of the second feature, and so on.
x is a feature, so x1 is the value of the first feature,

x2 is the value of the second feature, and so on.

Linear models tend to be easier to analyze and train than deep models. However, deep models can model complex relationships between features.

Linear regression and logistic regression are two types of linear models. Linear models include not only models that use the linear equation but also a broader set of models that use the linear equation as part of the formula. For example, logistic regression post-processes the raw prediction (
) to calculate the prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
127
Q

linear regression

A

Using the raw output (
) of a linear model as the actual prediction in a regression model. The goal of a regression problem is to make a real-valued prediction. For example, if the raw output (

) of a linear model is 8.37, then the prediction is 8.37.

Contrast linear regression with logistic regression. Also, contrast regression with classification.

128
Q

logistic regression

A

A classification model that uses a sigmoid function to convert a linear model’s raw prediction (

) into a value between 0 and 1. You can interpret the value between 0 and 1 in either of the following two ways:

    As a probability that the example belongs to the positive class in a binary classification problem.
    As a value to be compared against a classification threshold. If the value is equal to or above the classification threshold, the system classifies the example as the positive class. Conversely, if the value is below the given threshold, the system classifies the example as the negative class. For example, suppose the classification threshold is 0.82:
        Imagine an example that produces a raw prediction (
    ) of 2.6. The sigmoid of 2.6 is 0.93. Since 0.93 is greater than 0.82, the system classifies this example as the positive class.
    Imagine a different example that produces a raw prediction of 1.3. The sigmoid of 1.3 is 0.79. Since 0.79 is less than 0.82, the system classifies that example as the negative class.

Although logistic regression is often used in binary classification problems, logistic regression can also be used in multi-class classification problems (where it becomes called multi-class logistic regression or multinomial regression).

129
Q

logits

A

The vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function. If the model is solving a multi-class classification problem, logits typically become an input to the softmax function. The softmax function then generates a vector of (normalized) probabilities with one value for each possible class.

In addition, logits sometimes refer to the element-wise inverse of the sigmoid function. For more information, see tf.nn.sigmoid_cross_entropy_with_logits.

130
Q

Log Loss

A

The loss function used in binary logistic regression.

131
Q

log-odds

A

The logarithm of the odds of some event.

If the event refers to a binary probability, then odds refers to the ratio of the probability of success (p) to the probability of failure (1-p). For example, suppose that a given event has a 90% probability of success and a 10% probability of failure. In this case, odds is calculated as follows:
odds = p/(1-p) = .9/(1-.9) = .9/.1 = 9

The log-odds is simply the logarithm of the odds. By convention, “logarithm” refers to natural logarithm, but logarithm could actually be any base greater than 1. Sticking to convention, the log-odds of our example is therefore:

log-odds = ln(9) = 2.2
The log-odds are the inverse of the sigmoid function.

132
Q

Long Short-Term Memory (LSTM)

A

A type of cell in a recurrent neural network used to process sequences of data in applications such as handwriting recognition, machine translation, and image captioning. LSTMs address the vanishing gradient problem that occurs when training RNNs due to long data sequences by maintaining history in an internal memory state based on new input and context from previous cells in the RNN.

133
Q

loss

A

A measure of how far a model’s predictions are from its label. Or, to phrase it more pessimistically, a measure of how bad the model is. To determine this value, a model must define a loss function. For example, linear regression models typically use mean squared error for a loss function, while logistic regression models use Log Loss.

134
Q

loss curve

A

A graph of loss as a function of training iterations. For example:
x= iterations, y = loss (concave func)

A graph of loss versus training iterations, showing a steady drop as iterations increase, but then a slight rise in loss at a high number of iterations.

The loss curve can help you determine when your model is converging, overfitting, or underfitting.

135
Q

loss surface

A

A graph of weight(s) vs. loss. Gradient descent aims to find the weight(s) for which the loss surface is at a local minimum.

136
Q

machine learning

A

A program or system that builds (trains) a predictive model from input data. The system uses the learned model to make useful predictions from new (never-before-seen) data drawn from the same distribution as the one used to train the model. Machine learning also refers to the field of study concerned with these programs or systems.

137
Q

majority class

A

The more common label in a class-imbalanced dataset. For example, given a dataset containing 99% non-spam labels and 1% spam labels, the non-spam labels are the majority class.

138
Q

Markov decision process (MDP)

A

rl

A graph representing the decision-making model where decisions (or actions) are taken to navigate a sequence of states under the assumption that the Markov property holds. In reinforcement learning, these transitions between states return a numerical reward.

139
Q

Markov property

A

rl

A property of certain environments, where state transitions are entirely determined by information implicit in the current state and the agent’s action.

140
Q

matplotlib

A

An open-source Python 2D plotting library. matplotlib helps you visualize different aspects of machine learning.

141
Q

matrix factorization

A

In math, a mechanism for finding the matrices whose dot product approximates a target matrix.

In recommendation systems, the target matrix often holds users’ ratings on items. For example, the target matrix for a movie recommendation system might look something like the following, where the positive integers are user ratings and 0 means that the user didn’t rate the movie:
Casablanca The Philadelphia Story Black Panther Wonder Woman Pulp Fiction
User 1 5.0 3.0 0.0 2.0 0.0
User 2 4.0 0.0 0.0 1.0 5.0
User 3 3.0 1.0 4.0 5.0 0.0

The movie recommendation system aims to predict user ratings for unrated movies. For example, will User 1 like Black Panther?

One approach for recommendation systems is to use matrix factorization to generate the following two matrices:

A user matrix, shaped as the number of users X the number of embedding dimensions.
An item matrix, shaped as the number of embedding dimensions X the number of items.

For example, using matrix factorization on our three users and five items could yield the following user matrix and item matrix:

User Matrix Item Matrix

  1. 1 2.3 0.9 0.2 1.4 2.0 1.2
  2. 6 2.0 1.7 1.2 1.2 -0.1 2.1
  3. 5 0.5

The dot product of the user matrix and item matrix yields a recommendation matrix that contains not only the original user ratings but also predictions for the movies that each user hasn’t seen. For example, consider User 1’s rating of Casablanca, which was 5.0. The dot product corresponding to that cell in the recommendation matrix should hopefully be around 5.0, and it is:

(1.1 * 0.9) + (2.3 * 1.7) = 4.9

More importantly, will User 1 like Black Panther? Taking the dot product corresponding to the first row and the third column yields a predicted rating of 4.3:

(1.1 * 1.4) + (2.3 * 1.2) = 4.3

Matrix factorization typically yields a user matrix and item matrix that, together, are significantly more compact than the target matrix.

142
Q

Mean Absolute Error (MAE)

A

An error metric calculated by taking an average of absolute errors. In the context of evaluating a model’s accuracy, MAE is the average absolute difference between the expected and predicted values across all training examples. Specifically, for
examples, for each value and its prediction

, MAE is defined as follows:

MAE = (1/n) sum(|yi-yi_hat|) from i=1, n

143
Q

Mean Squared Error (MSE)

A

The average squared loss per example. MSE is calculated by dividing the squared loss by the number of examples. The values that TensorFlow Playground displays for “Training loss” and “Test loss” are MSE.

144
Q

metric

A

TensorFlow

A number that you care about. May or may not be directly optimized in a machine-learning system. A metric that your system tries to optimize is called an objective.

145
Q

Metrics API (tf.metrics)

A

A TensorFlow API for evaluating models. For example, tf.metrics.accuracy determines how often a model’s predictions match labels. When writing a custom Estimator, you invoke Metrics API functions to specify how your model should be evaluated.

146
Q

mini-batch

A

A small, randomly selected subset of the entire batch of examples run together in a single iteration of training or inference. The batch size of a mini-batch is usually between 10 and 1,000. It is much more efficient to calculate the loss on a mini-batch than on the full training data.

147
Q

mini-batch stochastic gradient descent (SGD)

A

A gradient descent algorithm that uses mini-batches. In other words, mini-batch SGD estimates the gradient based on a small subset of the training data. Vanilla SGD uses a mini-batch of size 1.

148
Q

minimax loss

A

A loss function for generative adversarial networks, based on the cross-entropy between the distribution of generated data and real data.

Minimax loss is used in the first paper to describe generative adversarial networks.

149
Q

minority class

A

The less common label in a class-imbalanced dataset. For example, given a dataset containing 99% non-spam labels and 1% spam labels, the spam labels are the minority class.

150
Q

MNIST

A

A public-domain dataset compiled by LeCun, Cortes, and Burges containing 60,000 images, each image showing how a human manually wrote a particular digit from 0–9. Each image is stored as a 28x28 array of integers, where each integer is a grayscale value between 0 and 255, inclusive.

MNIST is a canonical dataset for machine learning, often used to test new machine learning approaches. For details, see The MNIST Database of Handwritten Digits.

151
Q

model

A

The representation of what a machine learning system has learned from the training data. Within TensorFlow, model is an overloaded term, which can have either of the following two related meanings:

The TensorFlow graph that expresses the structure of how a prediction will be computed.
The particular weights and biases of that TensorFlow graph, which are determined by training.
152
Q

model capacity

A

The complexity of problems that a model can learn. The more complex the problems that a model can learn, the higher the model’s capacity. A model’s capacity typically increases with the number of model parameters. For a formal definition of classifier capacity, see VC dimension.

153
Q

model function

A

TensorFlow

The function within an Estimator that implements machine learning training, evaluation, and inference. For example, the training portion of a model function might handle tasks such as defining the topology of a deep neural network and identifying its optimizer function. When using premade Estimators, someone has already written the model function for you. When using custom Estimators, you must write the model function yourself.

For details about writing a model function, see the Creating Custom Estimators chapter in the TensorFlow Programmers Guide.

154
Q

model training

A

The process of determining the best model.

155
Q

Momentum

A

A sophisticated gradient descent algorithm in which a learning step depends not only on the derivative in the current step, but also on the derivatives of the step(s) that immediately preceded it. Momentum involves computing an exponentially weighted moving average of the gradients over time, analogous to momentum in physics. Momentum sometimes prevents learning from getting stuck in local minima.

156
Q

multi-class classification

A

Classification problems that distinguish among more than two classes. For example, there are approximately 128 species of maple trees, so a model that categorized maple tree species would be multi-class. Conversely, a model that divided emails into only two categories (spam and not spam) would be a binary classification model.

157
Q

multi-class logistic regression

A

Using logistic regression in multi-class classification problems.

158
Q

multinomial classification

A

Synonym for multi-class classification.

159
Q

NaN trap

A

When one number in your model becomes a NaN during training, which causes many or all other numbers in your model to eventually become a NaN.

NaN is an abbreviation for “Not a Number.”

160
Q

natural language understanding (NLU)

A

Determining a user’s intentions based on what the user typed or said. For example, a search engine uses natural language understanding to determine what the user is searching for based on what the user typed or said.

161
Q

negative class

A

In binary classification, one class is termed positive and the other is termed negative. The positive class is the thing we’re looking for and the negative class is the other possibility. For example, the negative class in a medical test might be “not tumor.” The negative class in an email classifier might be “not spam.” See also positive class.

162
Q

neural network

A

A model that, taking inspiration from the brain, is composed of layers (at least one of which is hidden) consisting of simple connected units or neurons followed by nonlinearities.

163
Q

neuron

A

A node in a neural network, typically taking in multiple input values and generating one output value. The neuron calculates the output value by applying an activation function (nonlinear transformation) to a weighted sum of input values.

164
Q

N-gram

A

An ordered sequence of N words. For example, truly madly is a 2-gram. Because order is relevant, madly truly is a different 2-gram than truly madly.
N Name(s) for this kind of N-gram Examples
2 bigram or 2-gram to go, go to, eat lunch, eat dinner
3 trigram or 3-gram ate too much, three blind mice, the bell tolls
4 4-gram walk in the park, dust in the wind, the boy ate lentils

Many natural language understanding models rely on N-grams to predict the next word that the user will type or say. For example, suppose a user typed three blind. An NLU model based on trigrams would likely predict that the user will next type mice.

Contrast N-grams with bag of words, which are unordered sets of words.

165
Q

node (neural network)

A

A neuron in a hidden layer.

166
Q

node (TensorFlow graph)

A

TensorFlow

An operation in a TensorFlow graph.

167
Q

noise

A

Broadly speaking, anything that obscures the signal in a dataset. Noise can be introduced into data in a variety of ways. For example:

Human raters make mistakes in labeling.
Humans and instruments mis-record or omit feature values.
168
Q

non-response bias

A

fairness

See selection bias. 
selection bias
#fairness

Errors in conclusions drawn from sampled data due to a selection process that generates systematic differences between samples observed in the data and those not observed. The following forms of selection bias exist:

coverage bias: The population represented in the dataset does not match the population that the machine learning model is making predictions about.
sampling bias: Data is not collected randomly from the target group.
non-response bias (also called participation bias): Users from certain groups opt-out of surveys at different rates than users from other groups.

For example, suppose you are creating a machine learning model that predicts people’s enjoyment of a movie. To collect training data, you hand out a survey to everyone in the front row of a theater showing the movie. Offhand, this may sound like a reasonable way to gather a dataset; however, this form of data collection may introduce the following forms of selection bias:

coverage bias: By sampling from a population who chose to see the movie, your model's predictions may not generalize to people who did not already express that level of interest in the movie.
sampling bias: Rather than randomly sampling from the intended population (all the people at the movie), you sampled only the people in the front row. It is possible that the people sitting in the front row were more interested in the movie than those in other rows.
non-response bias: In general, people with strong opinions tend to respond to optional surveys more frequently than people with mild opinions. Since the movie survey is optional, the responses are more likely to form a bimodal distribution than a normal (bell-shaped) distribution.
169
Q

normalization

A

The process of converting an actual range of values into a standard range of values, typically -1 to +1 or 0 to 1. For example, suppose the natural range of a certain feature is 800 to 6,000. Through subtraction and division, you can normalize those values into the range -1 to +1.

See also scaling.

170
Q

numerical data

A

Features represented as integers or real-valued numbers. For example, in a real estate model, you would probably represent the size of a house (in square feet or square meters) as numerical data. Representing a feature as numerical data indicates that the feature’s values have a mathematical relationship to each other and possibly to the label. For example, representing the size of a house as numerical data indicates that a 200 square-meter house is twice as large as a 100 square-meter house. Furthermore, the number of square meters in a house probably has some mathematical relationship to the price of the house.

Not all integer data should be represented as numerical data. For example, postal codes in some parts of the world are integers; however, integer postal codes should not be represented as numerical data in models. That’s because a postal code of 20000 is not twice (or half) as potent as a postal code of 10000. Furthermore, although different postal codes do correlate to different real estate values, we can’t assume that real estate values at postal code 20000 are twice as valuable as real estate values at postal code 10000. Postal codes should be represented as categorical data instead.

Numerical features are sometimes called continuous features.

171
Q

NumPy

A

An open-source math library that provides efficient array operations in Python. pandas is built on NumPy.

172
Q

objective

A

A metric that your algorithm is trying to optimize.

173
Q

objective function

A

The mathematical formula or metric that a model aims to optimize. For example, the objective function for linear regression is usually squared loss. Therefore, when training a linear regression model, the goal is to minimize squared loss.

In some cases, the goal is to maximize the objective function. For example, if the objective function is accuracy, the goal is to maximize accuracy.

See also loss.

174
Q

offline inference

A

Generating a group of predictions, storing those predictions, and then retrieving those predictions on demand. Contrast with online inference.

175
Q

one-hot encoding

A

A sparse vector in which:

One element is set to 1.
All other elements are set to 0.

One-hot encoding is commonly used to represent strings or identifiers that have a finite set of possible values. For example, suppose a given botany dataset chronicles 15,000 different species, each denoted with a unique string identifier. As part of feature engineering, you’ll probably encode those string identifiers as one-hot vectors in which the vector has a size of 15,000.

176
Q

one-shot learning

A

A machine learning approach, often used for object classification, designed to learn effective classifiers from a single training example.

See also few-shot learning.

177
Q

one-vs.-all

A

Given a classification problem with N possible solutions, a one-vs.-all solution consists of N separate binary classifiers—one binary classifier for each possible outcome. For example, given a model that classifies examples as animal, vegetable, or mineral, a one-vs.-all solution would provide the following three separate binary classifiers:

animal vs. not animal
vegetable vs. not vegetable
mineral vs. not mineral
178
Q

online inference

A

Generating predictions on demand. Contrast with offline inference.

179
Q

Operation (op)

A

TensorFlow

A node in the TensorFlow graph. In TensorFlow, any procedure that creates, manipulates, or destroys a Tensor is an operation. For example, a matrix multiply is an operation that takes two Tensors as input and generates one Tensor as output.

180
Q

optimizer

A

A specific implementation of the gradient descent algorithm. TensorFlow’s base class for optimizers is tf.train.Optimizer. Popular optimizers include:

AdaGrad, which stands for ADAptive GRADient descent.
Adam, which stands for ADAptive with Momentum.

Different optimizers may leverage one or more of the following concepts to enhance the effectiveness of gradient descent on a given training set:

momentum (Momentum)
update frequency
sparsity/regularization (Ftrl)
more complex math (Proximal, and others)

You might even imagine an NN-driven optimizer.

181
Q

out-group homogeneity bias

A

fairness

The tendency to see out-group members as more alike than in-group members when comparing attitudes, values, personality traits, and other characteristics. In-group refers to people you interact with regularly; out-group refers to people you do not interact with regularly. If you create a dataset by asking people to provide attributes about out-groups, those attributes may be less nuanced and more stereotyped than attributes that participants list for people in their in-group.

For example, Lilliputians might describe the houses of other Lilliputians in great detail, citing small differences in architectural styles, windows, doors, and sizes. However, the same Lilliputians might simply declare that Brobdingnagians all live in identical houses.

Out-group homogeneity bias is a form of group attribution bias.

See also in-group bias.

182
Q

outliers

A

Values distant from most other values. In machine learning, any of the following are outliers:

Weights with high absolute values.
Predicted values relatively far away from the actual values.
Input data whose values are more than roughly 3 standard deviations from the mean.

Outliers often cause problems in model training. Clipping is one way of managing outliers.

183
Q

output layer

A

The “final” layer of a neural network. The layer containing the answer(s).

184
Q

overfitting

A

Creating a model that matches the training data so closely that the model fails to make correct predictions on new data.

185
Q

pandas

A

A column-oriented data analysis API. Many machine learning frameworks, including TensorFlow, support pandas data structures as input. See the pandas documentation for details.

186
Q

parameter

A

A variable of a model that the machine learning system trains on its own. For example, weights are parameters whose values the machine learning system gradually learns through successive training iterations. Contrast with hyperparameter.

187
Q

Parameter Server (PS)

A

TensorFlow

A job that keeps track of a model’s parameters in a distributed setting.

See the TensorFlow Architecture chapter in the TensorFlow Programmers Guide for details.

188
Q

parameter update

A

The operation of adjusting a model’s parameters during training, typically within a single iteration of gradient descent.

189
Q

partial derivative

A

A derivative in which all but one of the variables is considered a constant. For example, the partial derivative of f(x, y) with respect to x is the derivative of f considered as a function of x alone (that is, keeping y constant). The partial derivative of f with respect to x focuses only on how x is changing and ignores all other variables in the equation.

190
Q

participation bias

A

fairness

Synonym for non-response bias. See selection bias. non-response bias (also called participation bias): Users from certain groups opt-out of surveys at different rates than users from other groups. non-response bias: In general, people with strong opinions tend to respond to optional surveys more frequently than people with mild opinions. Since the movie survey is optional, the responses are more likely to form a bimodal distribution than a normal (bell-shaped) distribution.

191
Q

partitioning strategy

A

The algorithm by which variables are divided across parameter servers.

192
Q

perceptron

A
A system (either hardware or software) that takes in one or more input values, runs a function on the weighted sum of the inputs, and computes a single output value. In machine learning, the function is typically nonlinear, such as ReLU, sigmoid, or tanh. For example, the following perceptron relies on the sigmoid function to process three input values:
f(x1,x2,x3) = sigmoid(w1x1+w2x2+w3x3)

In the following illustration, the perceptron takes three inputs, each of which is itself modified by a weight before entering the perceptron:

A perceptron that takes in 3 inputs, each multiplied by separate weights. The perceptron outputs a single value.

Perceptrons are the (nodes) in deep neural networks. That is, a deep neural network consists of multiple connected perceptrons, plus a backpropagation algorithm to introduce feedback.

193
Q

performance

A

performance

Overloaded term with the following meanings:

The traditional meaning within software engineering. Namely: How fast (or efficiently) does this piece of software run?
The meaning within machine learning. Here, performance answers the following question: How correct is this model? That is, how good are the model's predictions?
194
Q

perplexity

A

One measure of how well a model is accomplishing its task. For example, suppose your task is to read the first few letters of a word a user is typing on a smartphone keyboard, and to offer a list of possible completion words. Perplexity, P, for this task is approximately the number of guesses you need to offer in order for your list to contain the actual word the user is trying to type.

Perplexity is related to cross-entropy as follows:
P=2 ^(-cross entropy)

195
Q

pipeline

A

The infrastructure surrounding a machine learning algorithm. A pipeline includes gathering the data, putting the data into training data files, training one or more models, and exporting the models to production.

196
Q

policy

A

rl

In reinforcement learning, an agent’s probabilistic mapping from states to actions.

197
Q

pooling

A

Reducing a matrix (or matrices) created by an earlier convolutional layer to a smaller matrix. Pooling usually involves taking either the maximum or average value across the pooled area. For example, suppose we have the following 3x3 matrix:

A pooling operation, just like a convolutional operation, divides that matrix into slices and then slides that convolutional operation by strides. For example, suppose the pooling operation divides the convolutional matrix into 2x2 slices with a 1x1 stride. As the following diagram illustrates, four pooling operations take place. Imagine that each pooling operation picks the maximum value of the four in that slice:

Pooling helps enforce translational invariance in the input matrix.

Pooling for vision applications is known more formally as spatial pooling. Time-series applications usually refer to pooling as temporal pooling. Less formally, pooling is often called subsampling or downsampling.

198
Q

positive class

A

In binary classification, the two possible classes are labeled as positive and negative. The positive outcome is the thing we’re testing for. (Admittedly, we’re simultaneously testing for both outcomes, but play along.) For example, the positive class in a medical test might be “tumor.” The positive class in an email classifier might be “spam.”

Contrast with negative class.

199
Q

post-processing

A
#fairness
Processing the output of a model after the model has been run. Post-processing can be used to enforce fairness constraints without modifying models themselves.

For example, one might apply post-processing to a binary classifier by setting a classification threshold such that equality of opportunity is maintained for some attribute by checking that the true positive rate is the same for all values of that attribute.

200
Q

PR AUC (area under the PR curve)

A

Area under the interpolated precision-recall curve, obtained by plotting (recall, precision) points for different values of the classification threshold. Depending on how it’s calculated, PR AUC may be equivalent to the average precision of the model.

201
Q

precision

A

A metric for classification models. Precision identifies the frequency with which a model was correct when predicting the positive class. That is:
Precision = TP/(TP+FP)

202
Q

precision-recall curve

A

A curve of precision vs. recall at different classification thresholds.

203
Q

prediction

A

A model’s output when provided with an input example.

204
Q

prediction bias

A

fairness

A value indicating how far apart the average of predictions is from the average of labels in the dataset.

Not to be confused with the bias term in machine learning models or with bias in ethics and fairness.

205
Q

predictive parity

A

fairness

A fairness metric that checks whether, for a given classifier, the precision rates are equivalent for subgroups under consideration.

For example, a model that predicts college acceptance would satisfy predictive parity for nationality if its precision rate is the same for Lilliputians and Brobdingnagians.

Predictive parity is sometime also called predictive rate parity.

See “Fairness Definitions Explained” (section 3.2.1) for a more detailed discussion of predictive parity.

206
Q

predictive rate parity

A

fairness

Another name for predictive parity. A fairness metric that checks whether, for a given classifier, the precision rates are equivalent for subgroups under consideration.

For example, a model that predicts college acceptance would satisfy predictive parity for nationality if its precision rate is the same for Lilliputians and Brobdingnagians.

Predictive parity is sometime also called predictive rate parity.

See “Fairness Definitions Explained” (section 3.2.1) for a more detailed discussion of predictive parity.

207
Q

premade Estimator

A

TensorFlow

An Estimator that someone has already built. TensorFlow provides several premade Estimators, including DNNClassifier, DNNRegressor, and LinearClassifier. To learn more about premade Estimators, see the Premade Estimators chapter in the TensorFlow Programmers Guide.

Contrast with custom estimators.

208
Q

preprocessing

A
#fairness
Processing data before it's used to train a model. Preprocessing could be as simple as removing words from an English text corpus that don't occur in the English dictionary, or could be as complex as re-expressing data points in a way that eliminates as many attributes that are correlated with sensitive attributes as possible. Preprocessing can help satisfy fairness constraints.
209
Q

pre-trained model

A

Models or model components (such as embeddings) that have been already been trained. Sometimes, you’ll feed pre-trained embeddings into a neural network. Other times, your model will train the embeddings itself rather than rely on the pre-trained embeddings.

210
Q

prior belief

A

What you believe about the data before you begin training on it. For example, L2 regularization relies on a prior belief that weights should be small and normally distributed around zero.

211
Q

proxy (sensitive attributes)

A
#fairness
An attribute used as a stand-in for a sensitive attribute. For example, an individual's postal code might be used as a proxy for their income, race, or ethnicity.
212
Q

proxy labels

A

Data used to approximate labels not directly available in a dataset.

For example, suppose you want is it raining? to be a Boolean label for your dataset, but the dataset doesn’t contain rain data. If photographs are available, you might establish pictures of people carrying umbrellas as a proxy label for is it raining? However, proxy labels may distort results. For example, in some places, it may be more common to carry umbrellas to protect against sun than the rain.

213
Q

Q-function

A

rl

In reinforcement learning, the function that predicts the expected return from taking an action in a state and then following a given policy.

Q-function is also known as state-action value function.

214
Q

Q-learning

A

rl

In reinforcement learning, an algorithm that allows an agent to learn the optimal Q-function of a Markov decision process by applying the Bellman equation. The Markov decision process models an environment.

215
Q

quantile

A

Each bucket in quantile bucketing.

216
Q

quantile bucketing

A

Distributing a feature’s values into buckets so that each bucket contains the same (or almost the same) number of examples. For example, the following figure divides 44 points into 4 buckets, each of which contains 11 points. In order for each bucket in the figure to contain the same number of points, some buckets span a different width of x-values.

40 data points divided into 4 buckets of 11 points each. Some of the buckets contain a wider range of feature values than others.

217
Q

quantization

A

An algorithm that implements quantile bucketing on a particular feature in a dataset.

218
Q

queue

A

TensorFlow

A TensorFlow Operation that implements a queue data structure. Typically used in I/O.

219
Q

random forest

A

An ensemble approach to finding the decision tree that best fits the training data by creating many decision trees and then determining the “average” one. The “random” part of the term refers to building each of the decision trees from a random selection of features; the “forest” refers to the set of decision trees.

220
Q

random policy

A

rl

In reinforcement learning, a policy that chooses an action at random.

221
Q

rank (ordinality)

A

The ordinal position of a class in a machine learning problem that categorizes classes from highest to lowest. For example, a behavior ranking system could rank a dog’s rewards from highest (a steak) to lowest (wilted kale).

222
Q

rank (Tensor)

A

TensorFlow

The number of dimensions in a Tensor. For instance, a scalar has rank 0, a vector has rank 1, and a matrix has rank 2.

Not to be confused with rank (ordinality).

223
Q

rater

A

A human who provides labels in examples. Sometimes called an “annotator.”

224
Q

recall

A

A metric for classification models that answers the following question: Out of all the possible positive labels, how many did the model correctly identify? That is:

recall = TP/(TP+FN)

225
Q

recommendation system

A

A system that selects for each user a relatively small set of desirable items from a large corpus. For example, a video recommendation system might recommend two videos from a corpus of 100,000 videos, selecting Casablanca and The Philadelphia Story for one user, and Wonder Woman and Black Panther for another. A video recommendation system might base its recommendations on factors such as:

Movies that similar users have rated or watched.
Genre, directors, actors, target demographic...
226
Q

Rectified Linear Unit (ReLU)

A

An activation function with the following rules:

If input is negative or zero, output is 0.
If input is positive, output is equal to input.
227
Q

recurrent neural network (RNN)

A

recurrent neural network

A neural network that is intentionally run multiple times, where parts of each run feed into the next run. Specifically, hidden layers from the previous run provide part of the input to the same hidden layer in the next run. Recurrent neural networks are particularly useful for evaluating sequences, so that the hidden layers can learn from previous runs of the neural network on earlier parts of the sequence.

For example, the following figure shows a recurrent neural network that runs four times. Notice that the values learned in the hidden layers from the first run become part of the input to the same hidden layers in the second run. Similarly, the values learned in the hidden layer on the second run become part of the input to the same hidden layer in the third run. In this way, the recurrent neural network gradually trains and predicts the meaning of the entire sequence rather than just the meaning of individual words.

228
Q

regression model

A

A type of model that outputs continuous (typically, floating-point) values. Compare with classification models, which output discrete values, such as “day lily” or “tiger lily.”

229
Q

regularization

A

The penalty on a model’s complexity. Regularization helps prevent overfitting. Different kinds of regularization include:

L1 regularization
L2 regularization
dropout regularization
early stopping (this is not a formal regularization method, but can effectively limit overfitting)
230
Q

regularization rate

A

A scalar value, represented as lambda, specifying the relative importance of the regularization function. The following simplified loss equation shows the regularization rate’s influence:
min(loss func + lambda(regularization func))

Raising the regularization rate reduces overfitting but may make the model less accurate.

231
Q

reinforcement learning (RL)

A

rl

A family of algorithms that learn an optimal policy, whose goal is to maximize return when interacting with an environment. For example, the ultimate reward of most games is victory. Reinforcement learning systems can become expert at playing complex games by evaluating sequences of previous game moves that ultimately led to wins and sequences that ultimately led to losses.

232
Q

replay buffer

A

rl

In DQN-like algorithms, the memory used by the agent to store state transitions for use in experience replay.

233
Q

reporting bias

A

fairness

The fact that the frequency with which people write about actions, outcomes, or properties is not a reflection of their real-world frequencies or the degree to which a property is characteristic of a class of individuals. Reporting bias can influence the composition of data that machine learning systems learn from.

For example, in books, the word laughed is more prevalent than breathed. A machine learning model that estimates the relative frequency of laughing and breathing from a book corpus would probably determine that laughing is more common than breathing.

234
Q

representation

A

The process of mapping data to useful features.

235
Q

re-ranking

A

The final stage of a recommendation system, during which scored items may be re-graded according to some other (typically, non-ML) algorithm. Re-ranking evaluates the list of items generated by the scoring phase, taking actions such as:

Eliminating items that the user has already purchased.
Boosting the score of fresher items.
236
Q

return

A

rl

In reinforcement learning, given a certain policy and a certain state, the return is the sum of all rewards that the agent expects to receive when following the policy from the state to the end of the episode. The agent accounts for the delayed nature of expected rewards by discounting rewards according to the state transitions required to obtain the reward.

Therefore, if the discount factor is gamma
, and r_0, r_n

denote the rewards until the end of the episode, then the return calculation is as follows:
returns = r_0 + gammar_1 + gamma^2r_2 + …+gamma ^n-1*r_n-1

237
Q

reward

A

rl

In reinforcement learning, the numerical result of taking an action in a state, as defined by the environment.

238
Q

ridge regularization

A

Synonym for L2 regularization. The term ridge regularization is more frequently used in pure statistics contexts, whereas L2 regularization is used more often in machine learning.

239
Q

ROC (receiver operating characteristic) Curve

A

A curve of true positive rate vs. false positive rate at different classification thresholds. See also AUC.

240
Q

root directory

A

TensorFlow

The directory you specify for hosting subdirectories of the TensorFlow checkpoint and events files of multiple models.

241
Q

Root Mean Squared Error (RMSE)

A

The square root of the Mean Squared Error.

242
Q

rotational invariance

A

In an image classification problem, an algorithm’s ability to successfully classify images even when the orientation of the image changes. For example, the algorithm can still identify a tennis racket whether it is pointing up, sideways, or down. Note that rotational invariance is not always desirable; for example, an upside-down 9 should not be classified as a 9.

See also translational invariance and size invariance.

243
Q

sampling bias

A

fairness

See selection bias. sampling bias: Data is not collected randomly from the target group. sampling bias: Rather than randomly sampling from the intended population (all the people at the movie), you sampled only the people in the front row. It is possible that the people sitting in the front row were more interested in the movie than those in other rows.

244
Q

SavedModel

A

TensorFlow

The recommended format for saving and recovering TensorFlow models. SavedModel is a language-neutral, recoverable serialization format, which enables higher-level systems and tools to produce, consume, and transform TensorFlow models.

See the Saving and Restoring chapter in the TensorFlow Programmer’s Guide for complete details.

245
Q

Saver

A

TensorFlow

A TensorFlow object responsible for saving model checkpoints.

246
Q

scalar

A

A single number or a single string that can be represented as a tensor of rank 0. For example, the following lines of code each create one scalar in TensorFlow:

breed = tf.Variable("poodle", tf.string)
temperature = tf.Variable(27, tf.int16)
precision = tf.Variable(0.982375101275, tf.float64)
247
Q

scaling

A

A commonly used practice in feature engineering to tame a feature’s range of values to match the range of other features in the dataset. For example, suppose that you want all floating-point features in the dataset to have a range of 0 to 1. Given a particular feature’s range of 0 to 500, you could scale that feature by dividing each value by 500.

See also normalization.

248
Q

scikit-learn

A

A popular open-source machine learning platform. See www.scikit-learn.org.

249
Q

scoring

A

The part of a recommendation system that provides a value or ranking for each item produced by the candidate generation phase.

250
Q

selection bias

A

fairness

Errors in conclusions drawn from sampled data due to a selection process that generates systematic differences between samples observed in the data and those not observed. The following forms of selection bias exist:

coverage bias: The population represented in the dataset does not match the population that the machine learning model is making predictions about.
sampling bias: Data is not collected randomly from the target group.
non-response bias (also called participation bias): Users from certain groups opt-out of surveys at different rates than users from other groups.

For example, suppose you are creating a machine learning model that predicts people’s enjoyment of a movie. To collect training data, you hand out a survey to everyone in the front row of a theater showing the movie. Offhand, this may sound like a reasonable way to gather a dataset; however, this form of data collection may introduce the following forms of selection bias:

coverage bias: By sampling from a population who chose to see the movie, your model's predictions may not generalize to people who did not already express that level of interest in the movie.
sampling bias: Rather than randomly sampling from the intended population (all the people at the movie), you sampled only the people in the front row. It is possible that the people sitting in the front row were more interested in the movie than those in other rows.
non-response bias: In general, people with strong opinions tend to respond to optional surveys more frequently than people with mild opinions. Since the movie survey is optional, the responses are more likely to form a bimodal distribution than a normal (bell-shaped) distribution.
251
Q

semi-supervised learning

A

Training a model on data where some of the training examples have labels but others don’t. One technique for semi-supervised learning is to infer labels for the unlabeled examples, and then to train on the inferred labels to create a new model. Semi-supervised learning can be useful if labels are expensive to obtain but unlabeled examples are plentiful.

252
Q

sensitive attribute

A
#fairness
A human attribute that may be given special consideration for legal, ethical, social, or personal reasons.
253
Q

sentiment analysis

A

Using statistical or machine learning algorithms to determine a group’s overall attitude—positive or negative—toward a service, product, organization, or topic. For example, using natural language understanding, an algorithm could perform sentiment analysis on the textual feedback from a university course to determine the degree to which students generally liked or disliked the course.

254
Q

sequence model

A

A model whose inputs have a sequential dependence. For example, predicting the next video watched from a sequence of previously watched videos.

255
Q

serving

A

A synonym for inferring.

256
Q

session (tf.session)

A

TensorFlow

An object that encapsulates the state of the TensorFlow runtime and runs all or part of a graph. When using the low-level TensorFlow APIs, you instantiate and manage one or more tf.session objects directly. When using the Estimators API, Estimators instantiate session objects for you.

257
Q

shape (Tensor)

A

The number of elements in each dimension of a tensor. The shape is represented as a list of integers. For example, the following two-dimensional tensor has a shape of [3,4]:

[[5, 7, 6, 4],
[2, 9, 4, 8],
[3, 6, 5, 1]]

TensorFlow uses row-major (C-style) format to represent the order of dimensions, which is why the shape in TensorFlow is [3,4] rather than [4,3]. In other words, in a two-dimensional TensorFlow Tensor, the shape is [number of rows, number of columns].

258
Q

sigmoid function

A

A function that maps logistic or multinomial regression output (log odds) to probabilities, returning a value between 0 and 1. The sigmoid function has the following formula:

where

in logistic regression problems is simply:

In other words, the sigmoid function converts

into a probability between 0 and 1.

In some neural networks, the sigmoid function acts as the activation function.

259
Q

similarity measure

A

In clustering algorithms, the metric used to determine how alike (how similar) any two examples are.

260
Q

size invariance

A

In an image classification problem, an algorithm’s ability to successfully classify images even when the size of the image changes. For example, the algorithm can still identify a cat whether it consumes 2M pixels or 200K pixels. Note that even the best image classification algorithms still have practical limits on size invariance. For example, an algorithm (or human) is unlikely to correctly classify a cat image consuming only 20 pixels.

See also translational invariance and rotational invariance.

261
Q

sketching

A

In unsupervised machine learning, a category of algorithms that perform a preliminary similarity analysis on examples. Sketching algorithms use a locality-sensitive hash function to identify points that are likely to be similar, and then group them into buckets.

Sketching decreases the computation required for similarity calculations on large datasets. Instead of calculating similarity for every single pair of examples in the dataset, we calculate similarity only for each pair of points within each bucket.

262
Q

softmax

A

A function that provides probabilities for each possible class in a multi-class classification model. The probabilities add up to exactly 1.0. For example, softmax might determine that the probability of a particular image being a dog at 0.9, a cat at 0.08, and a horse at 0.02. (Also called full softmax.)

Contrast with candidate sampling.

263
Q

sparse feature

A

Feature vector whose values are predominately zero or empty. For example, a vector containing a single 1 value and a million 0 values is sparse. As another example, words in a search query could also be a sparse feature—there are many possible words in a given language, but only a few of them occur in a given query.

Contrast with dense feature.

264
Q

sparse representation

A

A representation of a tensor that only stores nonzero elements.

For example, the English language consists of about a million words. Consider two ways to represent a count of the words used in one English sentence:

A dense representation of this sentence must set an integer for all one million cells, placing a 0 in most of them, and a low integer into a few of them.
A sparse representation of this sentence stores only those cells symbolizing a word actually in the sentence. So, if the sentence contained only 20 unique words, then the sparse representation for the sentence would store an integer in only 20 cells.

For example, consider two ways to represent the sentence, “Dogs wag tails.” As the following tables show, the dense representation consumes about a million cells; the sparse representation consumes only 3 cells:
Dense Representation Cell Number Word Occurrence
0 a 0
1 aardvark 0
2 aargh 0
3 aarti 0
… 140,391 more words with an occurrence of 0
140395 dogs 1
… 633,062 words with an occurrence of 0
773458 tails 1
… 189,136 words with an occurrence of 0
962594 wag 1
… many more words with an occurrence of 0
Sparse Representation Cell Number Word Occurrence
140395 dogs 1
773458 tails 1
962594

265
Q

sparse vector

A

A vector whose values are mostly zeroes. See also sparse feature.

266
Q

sparsity

A

The number of elements set to zero (or null) in a vector or matrix divided by the total number of entries in that vector or matrix. For example, consider a 10x10 matrix in which 98 cells contain zero. The calculation of sparsity is as follows:
sparsity = 98/100 = 0.98
Feature sparsity refers to the sparsity of a feature vector; model sparsity refers to the sparsity of the model weights.

267
Q

spatial pooling

A

Pooling for vision applications is known more formally as spatial pooling

268
Q

squared hinge loss

A

The square of the hinge loss. Squared hinge loss penalizes outliers more harshly than regular hinge loss.

269
Q

squared loss

A

The loss function used in linear regression. (Also known as L2 Loss.) This function calculates the squares of the difference between a model’s predicted value for a labeled example and the actual value of the label. Due to squaring, this loss function amplifies the influence of bad predictions. That is, squared loss reacts more strongly to outliers than L1 loss.

270
Q

state

A

rl

In reinforcement learning, the parameter values that describe the current configuration of the environment, which the agent uses to choose an action.

271
Q

state-action value function

A

rl

Synonym for Q-function. Q-function
#rl

In reinforcement learning, the function that predicts the expected return from taking an action in a state and then following a given policy.

Q-function is also known as state-action value function.

272
Q

static model

A

A model that is trained offline.

273
Q

stationarity

A

A property of data in a dataset, in which the data distribution stays constant across one or more dimensions. Most commonly, that dimension is time, meaning that data exhibiting stationarity doesn’t change over time. For example, data that exhibits stationarity doesn’t change from September to December.

274
Q

step

A

A forward and backward evaluation of one batch.

275
Q

step size

A

Synonym for learning rate.

276
Q

stochastic gradient descent (SGD)

A

A gradient descent algorithm in which the batch size is one. In other words, SGD relies on a single example chosen uniformly at random from a dataset to calculate an estimate of the gradient at each step.

277
Q

stride

A

In a convolutional operation or pooling, the delta in each dimension of the next series of input slices. For example, the following animation demonstrates a (1,1) stride during a convolutional operation. Therefore, the next input slice starts one position to the right of the previous input slice. When the operation reaches the right edge, the next slice is all the way over to the left but one position down.

The preceding example demonstrates a two-dimensional stride. If the input matrix is three-dimensional, the stride would also be three-dimensional.

278
Q

structural risk minimization (SRM)

A

An algorithm that balances two goals:

The desire to build the most predictive model (for example, lowest loss).
The desire to keep the model as simple as possible (for example, strong regularization).

For example, a function that minimizes loss+regularization on the training set is a structural risk minimization algorithm.

For more information, see http://www.svms.org/srm/.

Contrast with empirical risk minimization.

279
Q

subsampling

A

See pooling.

Less formally, pooling is often called subsampling or downsampling.

280
Q

summary

A

TensorFlow

In TensorFlow, a value or set of values calculated at a particular step, usually used for tracking model metrics during training.

281
Q

supervised machine learning

A

Training a model from input data and its corresponding labels. Supervised machine learning is analogous to a student learning a subject by studying a set of questions and their corresponding answers. After mastering the mapping between questions and answers, the student can then provide answers to new (never-before-seen) questions on the same topic. Compare with unsupervised machine learning.

282
Q

synthetic feature

A

A feature not present among the input features, but created from one or more of them. Kinds of synthetic features include:

Bucketing a continuous feature into range bins.
Multiplying (or dividing) one feature value by other feature value(s) or by itself.
Creating a feature cross.

Features created by normalizing or scaling alone are not considered synthetic features.

283
Q

tabular Q-learning

A

rl

In reinforcement learning, implementing Q-learning by using a table to store the Q-functions for every combination of state and action.

284
Q

target

A

Synonym for label. label

In supervised learning, the “answer” or “result” portion of an example. Each example in a labeled dataset consists of one or more features and a label. For instance, in a housing dataset, the features might include the number of bedrooms, the number of bathrooms, and the age of the house, while the label might be the house’s price. In a spam detection dataset, the features might include the subject line, the sender, and the email message itself, while the label would probably be either “spam” or “not spam.”

285
Q

target network

A

rl

In Deep Q-learning, a neural network that is a stable approximation of the main neural network, where the main neural network implements either a Q-function or a policy. Then, you can train the main network on the Q-values predicted by the target network. Therefore, you prevent the feedback loop that occurs when the main network trains on Q-values predicted by itself. By avoiding this feedback, training stability increases.

286
Q

temporal data

A

Data recorded at different points in time. For example, winter coat sales recorded for each day of the year would be temporal data.

287
Q

Tensor

A

TensorFlow

The primary data structure in TensorFlow programs. Tensors are N-dimensional (where N could be very large) data structures, most commonly scalars, vectors, or matrices. The elements of a Tensor can hold integer, floating-point, or string values.

288
Q

TensorFlow

A

TensorFlow

A large-scale, distributed, machine learning platform. The term also refers to the base API layer in the TensorFlow stack, which supports general computation on dataflow graphs.

Although TensorFlow is primarily used for machine learning, you may also use TensorFlow for non-ML tasks that require numerical computation using dataflow graphs.

289
Q

termination condition

A

rl

In reinforcement learning, the conditions that determine when an episode ends, such as when the agent reaches a certain state or exceeds a threshold number of state transitions. For example, in tic-tac-toe (also known as noughts and crosses), an episode terminates either when a player marks three consecutive spaces or when all spaces are marked.

290
Q

test set

A

The subset of the dataset that you use to test your model after the model has gone through initial vetting by the validation set.

Contrast with training set and validation set.

291
Q

time series analysis

A

A subfield of machine learning and statistics that analyzes temporal data. Many types of machine learning problems require time series analysis, including classification, clustering, forecasting, and anomaly detection. For example, you could use time series analysis to forecast the future sales of winter coats by month based on historical sales data.

292
Q

timestep

A

One “unrolled” cell within a recurrent neural network. For example, the following figure shows three timesteps (labeled with the subscripts t-1, t, and t+1):

293
Q

tower

A

A component of a deep neural network that is itself a deep neural network without an output layer. Typically, each tower reads from an independent data source. Towers are independent until their output is combined in a final layer.

294
Q

training

A

The process of determining the ideal parameters comprising a model.

295
Q

training set

A

The subset of the dataset used to train a model.

Contrast with validation set and test set.

296
Q

trajectory

A
trajectory
#rl

In reinforcement learning, a sequence of tuples that represent a sequence of state transitions of the agent, where each tuple corresponds to the state, action, reward, and next state for a given state transition.

297
Q

transfer learning

A

Transferring information from one machine learning task to another. For example, in multi-task learning, a single model solves multiple tasks, such as a deep model that has different output nodes for different tasks. Transfer learning might involve transferring knowledge from the solution of a simpler task to a more complex one, or involve transferring knowledge from a task where there is more data to one where there is less data.

Most machine learning systems solve a single task. Transfer learning is a baby step towards artificial intelligence in which a single program can solve multiple tasks.

298
Q

translational invariance

A

In an image classification problem, an algorithm’s ability to successfully classify images even when the position of objects within the image changes. For example, the algorithm can still identify a dog, whether it is in the center of the frame or at the left end of the frame.

See also size invariance and rotational invariance.

299
Q

trigram

A

An N-gram in which N=3.
N-gram

An ordered sequence of N words. For example, truly madly is a 2-gram. Because order is relevant, madly truly is a different 2-gram than truly madly.
N Name(s) for this kind of N-gram Examples
2 bigram or 2-gram to go, go to, eat lunch, eat dinner
3 trigram or 3-gram ate too much, three blind mice, the bell tolls
4 4-gram walk in the park, dust in the wind, the boy ate lentils

Many natural language understanding models rely on N-grams to predict the next word that the user will type or say. For example, suppose a user typed three blind. An NLU model based on trigrams would likely predict that the user will next type mice.

Contrast N-grams with bag of words, which are unordered sets of words.

300
Q

true negative (TN)

A

An example in which the model correctly predicted the negative class. For example, the model inferred that a particular email message was not spam, and that email message really was not spam.

301
Q

true positive (TP)

A

An example in which the model correctly predicted the positive class. For example, the model inferred that a particular email message was spam, and that email message really was spam.

302
Q

true positive rate (TPR)

A

Synonym for recall. That is:
TPR = TP/(TP+FN)
True positive rate is the y-axis in an ROC curve.

303
Q

unawareness (to a sensitive attribute)

A

fairness

A situation in which sensitive attributes are present, but not included in the training data. Because sensitive attributes are often correlated with other attributes of one’s data, a model trained with unawareness about a sensitive attribute could still have disparate impact with respect to that attribute, or violate other fairness constraints.

304
Q

underfitting

A

Producing a model with poor predictive ability because the model hasn’t captured the complexity of the training data. Many problems can cause underfitting, including:

Training on the wrong set of features.
Training for too few epochs or at too low a learning rate.
Training with too high a regularization rate.
Providing too few hidden layers in a deep neural network.
305
Q

unlabeled example

A

An example that contains features but no label. Unlabeled examples are the input to inference. In semi-supervised and unsupervised learning, unlabeled examples are used during training.

306
Q

unsupervised machine learning

A

Training a model to find patterns in a dataset, typically an unlabeled dataset.

The most common use of unsupervised machine learning is to cluster data into groups of similar examples. For example, an unsupervised machine learning algorithm can cluster songs together based on various properties of the music. The resulting clusters can become an input to other machine learning algorithms (for example, to a music recommendation service). Clustering can be helpful in domains where true labels are hard to obtain. For example, in domains such as anti-abuse and fraud, clusters can help humans better understand the data.

Another example of unsupervised machine learning is principal component analysis (PCA). For example, applying PCA on a dataset containing the contents of millions of shopping carts might reveal that shopping carts containing lemons frequently also contain antacids.

Compare with supervised machine learning.

307
Q

upweighting

A

Applying a weight to the downsampled class equal to the factor by which you downsampled.

308
Q

user matrix

A

In recommendation systems, an embedding generated by matrix factorization that holds latent signals about user preferences. Each row of the user matrix holds information about the relative strength of various latent signals for a single user. For example, consider a movie recommendation system. In this system, the latent signals in the user matrix might represent each user’s interest in particular genres, or might be harder-to-interpret signals that involve complex interactions across multiple factors.

The user matrix has a column for each latent feature and a row for each user. That is, the user matrix has the same number of rows as the target matrix that is being factorized. For example, given a movie recommendation system for 1,000,000 users, the user matrix will have 1,000,000 rows.

309
Q

validation

A

A process used, as part of training, to evaluate the quality of a machine learning model using the validation set. Because the validation set is disjoint from the training set, validation helps ensure that the model’s performance generalizes beyond the training set.

Contrast with test set.

310
Q

validation set

A

A subset of the dataset—disjoint from the training set—used in validation.

Contrast with training set and test set.

311
Q

vanishing gradient problem

A

The tendency for the gradients of early hidden layers of some deep neural networks to become surprisingly flat (low). Increasingly lower gradients result in increasingly smaller changes to the weights on nodes in a deep neural network, leading to little or no learning. Models suffering from the vanishing gradient problem become difficult or impossible to train. Long Short-Term Memory cells address this issue.

Compare to exploding gradient problem.

312
Q

Wasserstein loss

A

One of the loss functions commonly used in generative adversarial networks, based on the earth-mover’s distance between the distribution of generated data and real data.

Wasserstein Loss is the default loss function in TF-GAN.

313
Q

weight

A

A coefficient for a feature in a linear model, or an edge in a deep network. The goal of training a linear model is to determine the ideal weight for each feature. If a weight is 0, then its corresponding feature does not contribute to the model.

314
Q

Weighted Alternating Least Squares (WALS)

A

An algorithm for minimizing the objective function during matrix factorization in recommendation systems, which allows a downweighting of the missing examples. WALS minimizes the weighted squared error between the original matrix and the reconstruction by alternating between fixing the row factorization and column factorization. Each of these optimizations can be solved by least squares convex optimization. For details, see the Recommendation Systems course

315
Q

wide model

A

A linear model that typically has many sparse input features. We refer to it as “wide” since such a model is a special type of neural network with a large number of inputs that connect directly to the output node. Wide models are often easier to debug and inspect than deep models. Although wide models cannot express nonlinearities through hidden layers, they can use transformations such as feature crossing and bucketization to model nonlinearities in different ways.

Contrast with deep model.

316
Q

width

A

The number of neurons in a particular layer of a neural network.