Machine Learning Flashcards

Terms and concepts related to machine learning

1
Q

ablation

A

A technique for evaluating the importance of a feature or component by temporarily removing it from a model. You then retrain the model without that feature or component, and if the retrained model performs significantly worse, then the removed feature or component was likely important.

For example, suppose you train a classification model on 10 features and achieve 88% precision on the test set. To check the importance of the first feature, you can retrain the model using only the nine other features. If the retrained model performs significantly worse (for instance, 55% precision), then the removed feature was probably important. Conversely, if the retrained model performs equally well, then that feature was probably not that important.

Ablation can also help determine the importance of:

  • Larger components, such as an entire subsystem of a larger ML system
  • Processes or techniques, such as a data preprocessing step

In both cases, you would observe how the system’s performance changes (or doesn’t change) after you’ve removed the component.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

A/B testing

A

A statistical way of comparing two (or more) techniques—the A and the B. Typically, the A is an existing technique, and the B is a new technique. A/B testing not only determines which technique performs better but also whether the difference is statistically significant.

A/B testing usually compares a single metric on two techniques; for example, how does model accuracy compare for two techniques? However, A/B testing can also compare any finite number of metrics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

accelerator chip

A

A category of specialized hardware components designed to perform key computations needed for deep learning algorithms.

Accelerator chips (or just accelerators, for short) can significantly increase the speed and efficiency of training and inference tasks compared to a general-purpose CPU. They are ideal for training neural networks and similar computationally intensive tasks.

Examples of accelerator chips include:

  • Google’s Tensor Processing Units (TPUs) with dedicated hardware for deep learning.
  • NVIDIA’s GPUs which, though initially designed for graphics processing, are designed to enable parallel processing, which can significantly increase processing speed.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

accuracy

A

The number of correct classification predictions divided by the total number of predictions. That is:

Accuracy = correct predictions ÷ (correct predictions + incorrect predictions)

For example, a model that made 40 correct predictions and 10 incorrect predictions would have an accuracy of:

Accuracy = 40 ÷ (40 + 10) = 80%

Binary classification provides specific names for the different categories of correct predictions and incorrect predictions. So, the accuracy formula for binary classification is as follows:

Accuracy = TP + TN ÷ (TP + TN + FP + FN)

where:

  • TP is the number of true positives (correct predictions).
  • TN is the number of true negatives (correct predictions).
  • FP is the number of false positives (incorrect predictions).
  • FN is the number of false negatives (incorrect predictions).

Compare and contrast accuracy with precision and recall.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

action

A

In reinforcement learning, the mechanism by which the agent transitions between states of the environment. The agent chooses the action by using a policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

activation function

A

A function that enables neural networks to learn nonlinear (complex) relationships between features and the label. The plots of activation functions are never single straight lines. Popular activation functions include:
- ReLU
- Sigmoid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

active learning

A

A training approach in which the algorithm chooses some of the data it learns from. Active learning is particularly valuable when labeled examples are scarce or expensive to obtain. Instead of blindly seeking a diverse range of labeled examples, an active learning algorithm selectively seeks the particular range of examples it needs for learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

AdaGrad

A

A sophisticated gradient descent algorithm that rescales the gradients of each parameter, effectively giving each parameter an independent learning rate. AdaGrad was one of the first algorithms to use adaptive learning rates and set the stage for further development in this area.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

agent

A

In reinforcement learning, the entity that uses a policy to maximize the expected return gained from transitioning between states of the environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

agglomerative clustering

A

A form of hierarchical clustering, agglomerative clustering first assigns every example to its own cluster, and iteratively merges the closest clusters to create a hierarchical tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

anomaly detection

A

The process of identifying outliers. For example, if the mean for a certain feature is 100 with a standard deviation of 10, then anomaly detection should flag a value of 200 as suspicious.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

artificial general intelligence

A

A non-human mechanism that demonstrates a broad range of problem solving, creativity, and adaptability. For example, a program demonstrating artificial general intelligence could translate text, compose symphonies, and excel at games that have not yet been invented.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

artificial intelligence

A

A non-human program or model that can solve sophisticated tasks. For example, a program or model that translates text or a program or model that identifies diseases from radiologic images both exhibit artificial intelligence.

Formally, machine learning is a sub-field of artificial intelligence. However, in recent years, some organizations have begun using the terms artificial intelligence and machine learning interchangeably.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

attention

A

A mechanism used in a neural network that indicates the importance of a particular word or part of a word. Attention compresses the amount of information a model needs to predict the next token/word. A typical attention mechanism might consist of a weighted sum over a set of inputs, where the weight for each input is computed by another part of the neural network.

Refer also to self-attention and multi-head self-attention, which are the building blocks of Transformers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

attribute

A

Synonym for feature.

In machine learning fairness, attributes often refer to characteristics pertaining to individuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

attribute sampling

A

A tactic for training a decision forest in which each decision tree considers only a random subset of possible features when learning the condition. Generally, a different subset of features is sampled for each node. In contrast, when training a decision tree without attribute sampling, all possible features are considered for each node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

AUC (Area under the ROC curve)

A

Area under the receiver operating characteristic curve. A number between 0.0 and 1.0 representing a binary classification model’s ability to separate positive classes from negative classes. The closer the AUC is to 1.0, the better the model’s ability to separate classes from each other. AUC ignores any value you set for classification threshold. Instead, AUC considers all possible classification thresholds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

augmented reality

A

A technology that superimposes a computer-generated image on a user’s view of the real world, thus providing a composite view.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

autoencoder

A

A system that learns to extract the most important information from the input. Autoencoders are a combination of an encoder and decoder. Autoencoders rely on the following two-step process:

  1. The encoder maps the input to a (typically) lossy lower-dimensional (intermediate) format.
  2. The decoder builds a lossy version of the original input by mapping the lower-dimensional format to the original higher-dimensional input format.

Autoencoders are trained end-to-end by having the decoder attempt to reconstruct the original input from the encoder’s intermediate format as closely as possible. Because the intermediate format is smaller (lower-dimensional) than the original format, the autoencoder is forced to learn what information in the input is essential, and the output won’t be perfectly identical to the input.

For example:

  • If the input data is a graphic, the non-exact copy would be similar to the original graphic, but somewhat modified. Perhaps the non-exact copy removes noise from the original graphic or fills in some missing pixels.
  • If the input data is text, an autoencoder would generate new text that mimics (but is not identical to) the original text.

See also variational autoencoders.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

automation bias

A

When a human decision maker favors recommendations made by an automated decision-making system over information made without automation, even when the automated decision-making system makes errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

AutoML

A

Any automated process for building machine learning models. AutoML can automatically do tasks such as the following:

  • Search for the most appropriate model.
  • Tune hyperparameters.
  • Prepare data (including performing feature engineering).
  • Deploy the resulting model.

AutoML is useful for data scientists because it can save them time and effort in developing machine learning pipelines and improve prediction accuracy. It is also useful to non-experts, by making complicated machine learning tasks more accessible to them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

auto-regressive model

A

A model that infers a prediction based on its own previous predictions. For example, auto-regressive language models predict the next token based on the previously predicted tokens. All Transformer-based large language models are auto-regressive.

In contrast, GAN-based image models are usually not auto-regressive since they generate an image in a single forward-pass and not iteratively in steps. However, certain image generation models are auto-regressive because they generate an image in steps.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

auxiliary loss

A

A loss function—used in conjunction with a neural network model’s main loss function—that helps accelerate training during the early iterations when weights are randomly initialized.

Auxiliary loss functions push effective gradients to the earlier layers. This facilitates convergence during training by combating the vanishing gradient problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

average precision

A

A metric for summarizing the performance of a ranked sequence of results. Average precision is calculated by taking the average of the precision values for each relevant result (each result in the ranked list where the recall increases relative to the previous result).

See also Area under the PR Curve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

axis-aligned condition

A

In a decision tree, a condition that involves only a single feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

backpropagation

A

The algorithm that implements gradient descent in neural networks.

Training a neural network involves many iterations of the following two-pass cycle:

During the forward pass, the system processes a batch of examples to yield prediction(s). The system compares each prediction to each label value. The difference between the prediction and the label value is the loss for that example. The system aggregates the losses for all the examples to compute the total loss for the current batch.
During the backward pass (backpropagation), the system reduces loss by adjusting the weights of all the neurons in all the hidden layer(s).

Neural networks often contain many neurons across many hidden layers. Each of those neurons contribute to the overall loss in different ways. Backpropagation determines whether to increase or decrease the weights applied to particular neurons.

The learning rate is a multiplier that controls the degree to which each backward pass increases or decreases each weight. A large learning rate will increase or decrease each weight more than a small learning rate.

In calculus terms, backpropagation implements calculus’ chain rule. That is, backpropagation calculates the partial derivative of the error with respect to each parameter. For more details, see this tutorial in Machine Learning Crash Course.

Years ago, ML practitioners had to write code to implement backpropagation. Modern ML APIs like TensorFlow now implement backpropagation for you. Phew!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

bagging

A

A method to train an ensemble where each constituent model trains on a random subset of training examples sampled with replacement. For example, a random forest is a collection of decision trees trained with bagging.

The term bagging is short for bootstrap aggregating.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

bag of words

A

A representation of the words in a phrase or passage, irrespective of order. For example, bag of words represents the following three phrases identically:

  • the dog jumps
  • jumps the dog
  • dog jumps the

Each word is mapped to an index in a sparse vector, where the vector has an index for every word in the vocabulary. For example, the phrase the dog jumps is mapped into a feature vector with non-zero values at the three indices corresponding to the words the, dog, and jumps. The non-zero value can be any of the following:

  • A 1 to indicate the presence of a word.
  • A count of the number of times a word appears in the bag. For example, if the phrase were the maroon dog is a dog with maroon fur, then both maroon and dog would be represented as 2, while the other words would be represented as 1.
  • Some other value, such as the logarithm of the count of the number of times a word appears in the bag.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

baseline

A

A model used as a reference point for comparing how well another model (typically, a more complex one) is performing. For example, a logistic regression model might serve as a good baseline for a deep model.

For a particular problem, the baseline helps model developers quantify the minimal expected performance that a new model must achieve for the new model to be useful.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

batch

A

The set of examples used in one training iteration. The batch size determines the number of examples in a batch.

See epoch for an explanation of how a batch relates to an epoch.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

batch inference

A

The process of inferring predictions on multiple unlabeled examples divided into smaller subsets (“batches”).

Batch inference can leverage the parallelization features of accelerator chips. That is, multiple accelerators can simultaneously infer predictions on different batches of unlabeled examples, dramatically increasing the number of inferences per second.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

batch normalization

A

Normalizing the input or output of the activation functions in a hidden layer. Batch normalization can provide the following benefits:

  • Make neural networks more stable by protecting against outlier weights.
  • Enable higher learning rates, which can speed training.
  • Reduce overfitting.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

batch size

A

The number of examples in a batch. For instance, if the batch size is 100, then the model processes 100 examples per iteration.

The following are popular batch size strategies:

  • Stochastic Gradient Descent (SGD), in which the batch size is 1.
  • full batch, in which the batch size is the number of examples in the entire training set. For instance, if the training set contains a million examples, then the batch size would be a million examples. Full batch is usually an inefficient strategy.
  • mini-batch in which the batch size is usually between 10 and 1000. Mini-batch is usually the most efficient strategy.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Bayesian neural network

A

A probabilistic neural network that accounts for uncertainty in weights and outputs. A standard neural network regression model typically predicts a scalar value; for example, a standard model predicts a house price of 853,000. In contrast, a Bayesian neural network predicts a distribution of values; for example, a Bayesian model predicts a house price of 853,000 with a standard deviation of 67,200.

A Bayesian neural network relies on Bayes’ Theorem to calculate uncertainties in weights and predictions. A Bayesian neural network can be useful when it is important to quantify uncertainty, such as in models related to pharmaceuticals. Bayesian neural networks can also help prevent overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Bayesian optimization

A

A probabilistic regression model technique for optimizing computationally expensive objective functions by instead optimizing a surrogate that quantifies the uncertainty via a Bayesian learning technique. Since Bayesian optimization is itself very expensive, it is usually used to optimize expensive-to-evaluate tasks that have a small number of parameters, such as selecting hyperparameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Bellman equation

A

A necessary condition for optimality associated with the mathematical optimization method known as dynamic programming. It writes the “value” of a decision problem at a certain point in time in terms of the payoff from some initial choices and the “value” of the remaining decision problem that results from those initial choices. This breaks a dynamic optimization problem into a sequence of simpler subproblems, as Bellman’s “principle of optimality” prescribes. The equation applies to algebraic structures with a total ordering; for algebraic structures with a partial ordering, the generic Bellman’s equation can be used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

BERT (Bidirectional Encoder Representations from Transformers)

A

A model architecture for text representation. A trained BERT model can act as part of a larger model for text classification or other ML tasks.

BERT has the following characteristics:

  • Uses the Transformer architecture, and therefore relies on self-attention.
  • Uses the encoder part of the Transformer. The encoder’s job is to produce good text representations, rather than to perform a specific task like classification.
  • Is bidirectional.
  • Uses masking for unsupervised training.

BERT’s variants include:

  • ALBERT, which is an acronym for A Light BERT.
  • LaBSE.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

bias (ethics/fairness)

A
  1. Stereotyping, prejudice or favoritism towards some things, people, or groups over others. These biases can affect collection and interpretation of data, the design of a system, and how users interact with a system. Forms of this type of bias include:
  • automation bias
  • confirmation bias
  • experimenter’s bias
  • group attribution bias
  • implicit bias
  • in-group bias
  • out-group homogeneity bias
  1. Systematic error introduced by a sampling or reporting procedure. Forms of this type of bias include:
  • coverage bias
  • non-response bias
  • participation bias
  • reporting bias
  • sampling bias
  • selection bias

Not to be confused with the bias term in machine learning models or prediction bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

bias (math) or bias term

A

An intercept or offset from an origin. Bias is a parameter in machine learning models, which is symbolized by either of the following:

  • b
  • w0

For example, bias is the b in the following formula:
y’ = b + w1x1 + w2x2 + … w_n*x_n

In a simple two-dimensional line, bias just means “y-intercept.”

Bias exists because not all models start from the origin (0,0). For example, suppose an amusement park costs 2 Euros to enter and an additional 0.5 Euro for every hour a customer stays. Therefore, a model mapping the total cost has a bias of 2 because the lowest cost is 2 Euros.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

bidirectional

A

A term used to describe a system that evaluates the text that both precedes and follows a target section of text. In contrast, a unidirectional system only evaluates the text that precedes a target section of text.

For example, consider a masked language model that must determine probabilities for the word or words representing the underline in the following question:

What is the \_\_\_\_\_ with you?

A unidirectional language model would have to base its probabilities only on the context provided by the words “What”, “is”, and “the”. In contrast, a bidirectional language model could also gain context from “with” and “you”, which might help the model generate better predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

bidirectional language model

A

A language model that determines the probability that a given token is present at a given location in an excerpt of text based on the preceding and following text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

bigram

A

An N-gram (ordered sequence of N words) in which N=2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

binary classification

A

A type of classification task that predicts one of two mutually exclusive classes:

  • the positive class
  • the negative class

For example, the following two machine learning models each perform binary classification:

  • A model that determines whether email messages are spam (the positive class) or not spam (the negative class).
  • A model that evaluates medical symptoms to determine whether a person has a particular disease (the positive class) or doesn’t have that disease (the negative class).

Contrast with multi-class classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

binary condition

A

In a decision tree, a condition that has only two possible outcomes, typically yes or no.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

binning

A

Synonym for bucketing. Converting a single feature into multiple binary features called buckets or bins, typically based on a value range. The chopped feature is typically a continuous feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

BLEU (Bilingual Evaluation Understudy)

A

A score between 0.0 and 1.0, inclusive, indicating the quality of a translation between two human languages (for example, between English and Russian). A BLEU score of 1.0 indicates a perfect translation; a BLEU score of 0.0 indicates a terrible translation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

boosting

A

A machine learning technique that iteratively combines a set of simple and not very accurate classifiers (referred to as “weak” classifiers) into a classifier with high accuracy (a “strong” classifier) by upweighting the examples that the model is currently misclassifying.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

bounding box

A

In an image, the (x, y) coordinates of a rectangle around an area of interest, such as identifying the portion of an image containing a dog.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

broadcasting

A

Expanding the shape of an operand in a matrix math operation to dimensions compatible for that operation. For instance, linear algebra requires that the two operands in a matrix addition operation must have the same dimensions. Consequently, you can’t add a matrix of shape (m, n) to a vector of length n. Broadcasting enables this operation by virtually expanding the vector of length n to a matrix of shape (m, n) by replicating the same values down each column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

bucketing

A

Converting a single feature into multiple binary features called buckets or bins, typically based on a value range. The chopped feature is typically a continuous feature.

For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete buckets, such as:

  • <= 10 degrees Celsius would be the “cold” bucket.
  • 11 - 24 degrees Celsius would be the “temperate” bucket.
  • > = 25 degrees Celsius would be the “warm” bucket.

The model will treat every value in the same bucket identically. For example, the values 13 and 22 are both in the temperate bucket, so the model treats the two values identically.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

calibration layer

A

A post-prediction adjustment, typically to account for prediction bias. The adjusted predictions and probabilities should match the distribution of an observed set of labels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

candidate generation

A

The initial set of recommendations chosen by a recommendation system. For example, consider a bookstore that offers 100,000 titles. The candidate generation phase creates a much smaller list of suitable books for a particular user, say 500. But even 500 books is way too many to recommend to a user. Subsequent, more expensive, phases of a recommendation system (such as scoring and re-ranking) reduce those 500 to a much smaller, more useful set of recommendations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

candidate sampling

A

A training-time optimization that calculates a probability for all the positive labels, using, for example, softmax, but only for a random sample of negative labels. For instance, given an example labeled beagle and dog, candidate sampling computes the predicted probabilities and corresponding loss terms for:

  • beagle
  • dog
  • a random subset of the remaining negative classes (for example, cat, lollipop, fence).

The idea is that the negative classes can learn from less frequent negative reinforcement as long as positive classes always get proper positive reinforcement, and this is indeed observed empirically.

Candidate sampling is more computationally efficient than training algorithms that compute predictions for all negative classes, particularly when the number of negative classes is very large.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

categorical data

A

Features having a specific set of possible values. For example, consider a categorical feature named traffic-light-state, which can only have one of the following three possible values:

  • red
  • yellow
  • green

By representing traffic-light-state as a categorical feature, a model can learn the differing impacts of red, green, and yellow on driver behavior.

Categorical features are sometimes called discrete features.

Contrast with numerical data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

causal language model

A

Synonym for unidirectional language model. A language model that bases its probabilities only on the tokens appearing before, not after, the target token(s). Contrast with bidirectional language model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

centroid

A

The center of a cluster as determined by a k-means or k-median algorithm. For instance, if k is 3, then the k-means or k-median algorithm finds 3 centroids.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

centroid-based clustering

A

A category of clustering algorithms that organizes data into nonhierarchical clusters. k-means is the most widely used centroid-based clustering algorithm.

Contrast with hierarchical clustering algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

chain-of-thought prompting

A

A prompt engineering technique that encourages a large language model (LLM) to explain its reasoning, step by step. For example, consider the following prompt, paying particular attention to the second sentence:

How many g forces would a driver experience in a car that goes from 0 to 60 miles per hour in 7 seconds? In the answer, show all relevant calculations.

The LLM’s response would likely:

  • Show a sequence of physics formulas, plugging in the values 0, 60, and 7 in appropriate places.
  • Explain why it chose those formulas and what the various variables mean.

Chain-of-thought prompting forces the LLM to perform all the calculations, which might lead to a more correct answer. In addition, chain-of-thought prompting enables the user to examine the LLM’s steps to determine whether or not the answer makes sense.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

checkpoint

A

Data that captures the state of a model’s parameters at a particular training iteration. Checkpoints enable exporting model weights, or performing training across multiple sessions. Checkpoints also enable training to continue past errors (for example, job preemption).

When fine tuning, the starting point for training the new model will be a specific checkpoint of the pre-trained model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

class

A

A category that a label can belong to. For example:

  • In a binary classification model that detects spam, the two classes might be spam and not spam.
  • In a multi-class classification model that identifies dog breeds, the classes might be poodle, beagle, pug, and so on.

A classification model predicts a class. In contrast, a regression model predicts a number rather than a class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

classification model

A

A model whose prediction is a class. For example, the following are all classification models:

  • A model that predicts an input sentence’s language (French? Spanish? Italian?).
  • A model that predicts tree species (Maple? Oak? Baobab?).
  • A model that predicts the positive or negative class for a particular medical condition.

In contrast, regression models predict numbers rather than classes.

Two common types of classification models are:

  • binary classification
  • multi-class classification
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

classification threshold

A

In a binary classification, a number between 0 and 1 that converts the raw output of a logistic regression model into a prediction of either the positive class or the negative class. Note that the classification threshold is a value that a human chooses, not a value chosen by model training.

A logistic regression model outputs a raw value between 0 and 1. Then:

  • If this raw value is greater than the classification threshold, then the positive class is predicted.
  • If this raw value is less than the classification threshold, then the negative class is predicted.

For example, suppose the classification threshold is 0.8. If the raw value is 0.9, then the model predicts the positive class. If the raw value is 0.7, then the model predicts the negative class.

The choice of classification threshold strongly influences the number of false positives and false negatives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

class-imbalanced dataset

A

A dataset for a classification problem in which the total number of labels of each class differs significantly. For example, consider a binary classification dataset whose two labels are divided as follows:

  • 1,000,000 negative labels
  • 10 positive labels

The ratio of negative to positive labels is 100,000 to 1, so this is a class-imbalanced dataset.

In contrast, the following dataset is not class-imbalanced because the ratio of negative labels to positive labels is relatively close to 1:

  • 517 negative labels
  • 483 positive labels

Multi-class datasets can also be class-imbalanced. For example, the following multi-class classification dataset is also class-imbalanced because one label has far more examples than the other two:

  • 1,000,000 labels with class “green”
  • 200 labels with class “purple”
  • 350 labels with class “orange”

See also entropy, majority class, and minority class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

clipping

A

A technique for handling outliers by doing either or both of the following:

  • Reducing feature values that are greater than a maximum threshold down to that maximum threshold.
  • Increasing feature values that are less than a minimum threshold up to that minimum threshold.

For example, suppose that <0.5% of values for a particular feature fall outside the range 40–60. In this case, you could do the following:

  • Clip all values over 60 (the maximum threshold) to be exactly 60.
  • Clip all values under 40 (the minimum threshold) to be exactly 40.

Outliers can damage models, sometimes causing weights to overflow during training. Some outliers can also dramatically spoil metrics like accuracy. Clipping is a common technique to limit the damage.

Gradient clipping forces gradient values within a designated range during training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

Cloud TPU

A

A specialized hardware accelerator designed to speed up machine learning workloads on Google Cloud Platform.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

clustering

A

Grouping related examples, particularly during unsupervised learning. Once all the examples are grouped, a human can optionally supply meaning to each cluster.

Many clustering algorithms exist. For example, the k-means algorithm clusters examples based on their proximity to a centroid. As another example, consider a clustering algorithm based on an example’s distance from a center point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

co-adaptation

A

When neurons predict patterns in training data by relying almost exclusively on outputs of specific other neurons instead of relying on the network’s behavior as a whole. When the patterns that cause co-adaption are not present in validation data, then co-adaptation causes overfitting. Dropout regularization reduces co-adaptation because dropout ensures neurons cannot rely solely on specific other neurons.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

collaborative filtering

A

Making predictions about the interests of one user based on the interests of many other users. Collaborative filtering is often used in recommendation systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

condition

A

In a decision tree, any node that evaluates an expression. A condition is also called a split or a test.

Contrast condition with leaf.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

configuration

A

The process of assigning the initial property values used to train a model, including:

  • the model’s composing layers
  • the location of the data
  • hyperparameters such as:
    - learning rate
    - iterations
    - optimizer
    - loss function

In machine learning projects, configuration can be done through a special configuration file or via configuration libraries such as the following:

  • HParam
  • Gin
  • Fiddle
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

confirmation bias

A

The tendency to search for, interpret, favor, and recall information in a way that confirms one’s preexisting beliefs or hypotheses. Machine learning developers may inadvertently collect or label data in ways that influence an outcome supporting their existing beliefs. Confirmation bias is a form of implicit bias.

Experimenter’s bias is a form of confirmation bias in which an experimenter continues training models until a preexisting hypothesis is confirmed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

confusion matrix

A

An NxN table that summarizes the number of correct and incorrect predictions that a classification model made. The confusion matrix for a multi-class classification problem can help you identify patterns of mistakes. Confusion matrices contain sufficient information to calculate a variety of performance metrics, including precision and recall.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

constituency parsing

A

Dividing a sentence into smaller grammatical structures (“constituents”). A later part of the ML system, such as a natural language understanding model, can parse the constituents more easily than the original sentence. For example, consider the following sentence:

My friend adopted two cats.

A constituency parser can divide this sentence into the following two constituents:

  • My friend is a noun phrase.
  • adopted two cats is a verb phrase.

These constituents can be further subdivided into smaller constituents. For example, the verb phrase

adopted two cats

could be further subdivided into:

  • adopted is a verb.
  • two cats is another noun phrase.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

continuous feature

A

A floating-point feature with an infinite range of possible values, such as temperature or weight.

Contrast with discrete feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

convenience sampling

A

Using a dataset not gathered scientifically in order to run quick experiments. Later on, it’s essential to switch to a scientifically gathered dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

convergence

A

A state reached when loss values change very little or not at all with each iteration. A model converges when additional training won’t improve the model.

In deep learning, loss values sometimes stay constant or nearly so for many iterations before finally descending. During a long period of constant loss values, you may temporarily get a false sense of convergence.

See also early stopping.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

convex function

A

A function in which the region above the graph of the function is a convex set. The prototypical convex function is shaped something like the letter U. A strictly convex function has exactly one local minimum point, which is also the global minimum point. The classic U-shaped functions are strictly convex functions. However, some convex functions (for example, straight lines) are not U-shaped.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

convex optimization

A

The process of using mathematical techniques such as gradient descent to find the minimum of a convex function. A great deal of research in machine learning has focused on formulating various problems as convex optimization problems and in solving those problems more efficiently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

convex set

A

A subset of Euclidean space such that a line drawn between any two points in the subset remains completely within the subset.
E.g. a square or circle would be a convex set, a star or horseshoe would not be a convex set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

convolution

A

In mathematics, casually speaking, a mixture of two functions. In machine learning, a convolution mixes the convolutional filter and the input matrix in order to train weights.

The term “convolution” in machine learning is often a shorthand way of referring to either convolutional operation or convolutional layer.

Without convolutions, a machine learning algorithm would have to learn a separate weight for every cell in a large tensor. For example, a machine learning algorithm training on 2K x 2K images would be forced to find 4M separate weights. Thanks to convolutions, a machine learning algorithm only has to find weights for every cell in the convolutional filter, dramatically reducing the memory needed to train the model. When the convolutional filter is applied, it is simply replicated across cells such that each is multiplied by the filter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

convolutional filter

A

One of the two actors in a convolutional operation. (The other actor is a slice of an input matrix.) A convolutional filter is a matrix having the same rank as the input matrix, but a smaller shape. For example, given a 28x28 input matrix, the filter could be any 2D matrix smaller than 28x28.

In photographic manipulation, all the cells in a convolutional filter are typically set to a constant pattern of ones and zeroes. In machine learning, convolutional filters are typically seeded with random numbers and then the network trains the ideal values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

convolutional layer

A

A layer of a deep neural network in which a convolutional filter passes along an input matrix. For example, consider the following 3x3 convolutional filter:

0 1 0
1 0 1
0 1 0

When applied to a 5x5 input matrix, it will output a 3x3 matrix of convolutional operations, by applying the 3x3 convolution operation to 3x3 slices of the input matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

convolutional neural network

A

A neural network in which at least one layer is a convolutional layer. A typical convolutional neural network consists of some combination of the following layers:

  • convolutional layers
  • pooling layers
  • dense layers

Convolutional neural networks have had great success in certain kinds of problems, such as image recognition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

convolutional operation

A

The following two-step mathematical operation:

  1. Element-wise multiplication of the convolutional filter and a slice of an input matrix. (The slice of the input matrix has the same rank and size as the convolutional filter.)
  2. Summation of all the values in the resulting product matrix.

A convolutional layer consists of a series of convolutional operations, each acting on a different slice of the input matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

cost

A

Synonym for loss. During the training of a supervised model, a measure of how far a model’s prediction is from its label.

A loss function calculates the loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

co-training

A

A semi-supervised learning approach particularly useful when all of the following conditions are true:

  • The ratio of unlabeled examples to labeled examples in the dataset is high.
  • This is a classification problem (binary or multi-class).
  • The dataset contains two different sets of predictive features that are independent of each other and complementary.

Co-training essentially amplifies independent signals into a stronger signal. For instance, consider a classification model that categorizes individual used cars as either Good or Bad. One set of predictive features might focus on aggregate characteristics such as the year, make, and model of the car; another set of predictive features might focus on the previous owner’s driving record and the car’s maintenance history.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

counterfactual fairness

A

A fairness metric that checks whether a classifier produces the same result for one individual as it does for another individual who is identical to the first, except with respect to one or more sensitive attributes. Evaluating a classifier for counterfactual fairness is one method for surfacing potential sources of bias in a model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

coverage bias

A

A form of selection bias, in which the population represented in the dataset doesn’t match the population that the machine learning model is making predictions about.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

crash blossom

A

A sentence or phrase with an ambiguous meaning. Crash blossoms present a significant problem in natural language understanding. For example, the headline Red Tape Holds Up Skyscraper is a crash blossom because an NLU model could interpret the headline literally or figuratively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

critic

A

Synonym for Deep Q-Network, in Q-learning, a deep neural network that predicts Q-functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
91
Q

cross-entropy

A

A generalization of Log Loss to multi-class classification problems. Cross-entropy quantifies the difference between two probability distributions. See also perplexity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
92
Q

cross-validation

A

A mechanism for estimating how well a model would generalize to new data by testing the model against one or more non-overlapping data subsets withheld from the training set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
93
Q

data analysis

A

Obtaining an understanding of data by considering samples, measurement, and visualization. Data analysis can be particularly useful when a dataset is first received, before one builds the first model. It is also crucial in understanding experiments and debugging problems with the system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
94
Q

data augmentation

A

Artificially boosting the range and number of training examples by transforming existing examples to create additional examples. For example, suppose images are one of your features, but your dataset doesn’t contain enough image examples for the model to learn useful associations. Ideally, you’d add enough labeled images to your dataset to enable your model to train properly. If that’s not possible, data augmentation can rotate, stretch, and reflect each image to produce many variants of the original picture, possibly yielding enough labeled data to enable excellent training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
95
Q

DataFrame

A

A popular pandas datatype for representing datasets in memory.

A DataFrame is analogous to a table or a spreadsheet. Each column of a DataFrame has a name (a header), and each row is identified by a unique number.

Each column in a DataFrame is structured like a 2D array, except that each column can be assigned its own data type.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
96
Q

data parallelism

A

A way of scaling training or inference that replicates an entire model onto multiple devices and then passes a subset of the input data to each device. Data parallelism can enable training and inference on very large batch sizes; however, data parallelism requires that the model be small enough to fit on all devices.

Data parallelism typically speeds training and inference.

See also model parallelism.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
97
Q

data set or dataset

A

A collection of raw data, commonly (but not exclusively) organized in one of the following formats:

  • a spreadsheet
  • a file in CSV (comma-separated values) format
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
98
Q

Dataset API (tf.data)

A

A high-level TensorFlow API for reading data and transforming it into a form that a machine learning algorithm requires. A tf.data.Dataset object represents a sequence of elements, in which each element contains one or more Tensors. A tf.data.Iterator object provides access to the elements of a Dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
99
Q

decision boundary

A

The separator between classes learned by a model in a binary class or multi-class classification problems. For a visual example, consider a 2D scatter plot representing a binary classification problem - the decision boundary would be the line bisecting the two classes/clusters of data points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
100
Q

decision forest

A

A model created from multiple decision trees. A decision forest makes a prediction by aggregating the predictions of its decision trees. Popular types of decision forests include random forests and gradient boosted trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
101
Q

decision tree

A

A supervised learning model composed of a set of conditions and leaves organized hierarchically.

102
Q

decoder

A

In general, any ML system that converts from a processed, dense, or internal representation to a more raw, sparse, or external representation.

Decoders are often a component of a larger model, where they are frequently paired with an encoder.

In sequence-to-sequence tasks, a decoder starts with the internal state generated by the encoder to predict the next sequence.

Refer to Transformer for the definition of a decoder within the Transformer architecture.

103
Q

deep model

A

A neural network containing more than one hidden layer.

A deep model is also called a deep neural network.

Contrast with wide model.

104
Q

Deep Q-Network (DQN)

A

In Q-learning, a deep neural network that predicts Q-functions.

Critic is a synonym for Deep Q-Network.

105
Q

demographic parity

A

A fairness metric that is satisfied if the results of a model’s classification are not dependent on a given sensitive attribute.

For example, if both Lilliputians and Brobdingnagians apply to Glubbdubdrib University, demographic parity is achieved if the percentage of Lilliputians admitted is the same as the percentage of Brobdingnagians admitted, irrespective of whether one group is on average more qualified than the other.

Contrast with equalized odds and equality of opportunity, which permit classification results in aggregate to depend on sensitive attributes, but do not permit classification results for certain specified ground-truth labels to depend on sensitive attributes. See “Attacking discrimination with smarter machine learning” for a visualization exploring the tradeoffs when optimizing for demographic parity.

106
Q

dense feature

A

A feature in which most or all values are nonzero, typically a Tensor of floating-point values. Contrast with a sparse feature.

106
Q

denoising

A

A common approach to self-supervised learning in which:

  1. Noise is artificially added to the dataset.
  2. The model tries to remove the noise.

Denoising enables learning from unlabeled examples. The original dataset serves as the target or label and the noisy data as the input.

Some masked language models use denoising as follows:

  1. Noise is artificially added to an unlabeled sentence by masking some of the tokens.
  2. The model tries to predict the original tokens.
107
Q

dense layer

A

Synonym for fully connected layer. A hidden layer in which each node is connected to every node in the subsequent hidden layer.

108
Q

depth

A

The sum of the following in a neural network:

  • the number of hidden layers
  • the number of output layers, which is typically 1
  • the number of any embedding layers

For example, a neural network with five hidden layers and one output layer has a depth of 6.

Notice that the input layer doesn’t influence depth.

109
Q

depthwise separable convolutional neural network (sepCNN)

A

An advanced form of convolutional neural network (CNN) architecture that optimize computational efficiency by separating the convolution process into depthwise and pointwise convolutions. This makes them particularly suitable for applications where computational resources are limited.

A depthwise separable convolution (also abbreviated as separable convolution) factors a standard 3-D convolution into two separate convolution operations that are more computationally efficient: first, a depthwise convolution, with a depth of 1 (n ✕ n ✕ 1), and then second, a pointwise convolution, with length and width of 1 (1 ✕ 1 ✕ n).

110
Q

derived label

A

Synonym for a proxy label, data used to approximate labels not directly available in a dataset (e.g. number of people carrying umbrellas as a proxy for whether or not it is raining).

111
Q

device

A

An overloaded term with the following two possible definitions:

  1. A category of hardware that can run a TensorFlow session, including CPUs, GPUs, and TPUs.
  2. When training an ML model on accelerator chips (GPUs or TPUs), the part of the system that actually manipulates tensors and embeddings. The device runs on accelerator chips. In contrast, the host typically runs on a CPU.
112
Q

differential privacy

A

An anonymization approach to privacy that protects an individual’s personal information that might be included in a model’s training set. This approach ensures that the model doesn’t infer much about a specific individual. Differential privacy injects noise during training to obscure individual data points.

Differential privacy is also used outside of machine learning. For example, data scientists sometimes use differential privacy to protect individual privacy when computing product usage statistics for different demographics.

113
Q

dimension reduction

A

Decreasing the number of dimensions used to represent a particular feature in a feature vector, typically by converting to an embedding vector.

114
Q

dimensions

A

Overloaded term having any of the following definitions:

  1. The number of levels of coordinates in a Tensor. For example:
    - A scalar has zero dimensions; for example, [“Hello”].
    - A vector has one dimension; for example, [3, 5, 7, 11].
    - A matrix has two dimensions; for example, [[2, 4, 18], [5, 7, 14]].

You can uniquely specify a particular cell in a one-dimensional vector with one coordinate; you need two coordinates to uniquely specify a particular cell in a two-dimensional matrix.

  1. The number of entries in a feature vector.
  2. The number of elements in an embedding layer.
115
Q

direct prompting

A

Synonym for zero-shot prompting, a prompt that does not provide an example of how you want the large language model to respond.

116
Q

discrete feature

A

A feature with a finite set of possible values. For example, a feature whose values may only be animal, vegetable, or mineral is a discrete (or categorical) feature.

Contrast with continuous feature.

117
Q

discriminative model

A

A model that predicts labels from a set of one or more features. More formally, discriminative models define the conditional probability of an output given the features and weights; that is:

p(output | features, weights)

For example, a model that predicts whether an email is spam from features and weights is a discriminative model.

The vast majority of supervised learning models, including classification and regression models, are discriminative models.

Contrast with generative model.

118
Q

discriminator

A

A system that determines whether examples are real or fake.

Alternatively, the subsystem within a generative adversarial network that determines whether the examples created by the generator are real or fake.

119
Q

disparate impact

A

Making decisions about people that impact different population subgroups disproportionately. This usually refers to situations where an algorithmic decision-making process harms or benefits some subgroups more than others.

For example, suppose an algorithm that determines a Lilliputian’s eligibility for a miniature-home loan is more likely to classify them as “ineligible” if their mailing address contains a certain postal code. If Big-Endian Lilliputians are more likely to have mailing addresses with this postal code than Little-Endian Lilliputians, then this algorithm may result in disparate impact.

Contrast with disparate treatment, which focuses on disparities that result when subgroup characteristics are explicit inputs to an algorithmic decision-making process.

120
Q

disparate treatment

A

Factoring subjects’ sensitive attributes into an algorithmic decision-making process such that different subgroups of people are treated differently.

For example, consider an algorithm that determines Lilliputians’ eligibility for a miniature-home loan based on the data they provide in their loan application. If the algorithm uses a Lilliputian’s affiliation as Big-Endian or Little-Endian as an input, it is enacting disparate treatment along that dimension.

Contrast with disparate impact, which focuses on disparities in the societal impacts of algorithmic decisions on subgroups, irrespective of whether those subgroups are inputs to the model.

121
Q

distillation

A

The process of reducing the size of one model (known as the teacher) into a smaller model (known as the student) that emulates the original model’s predictions as faithfully as possible. Distillation is useful because the smaller model has two key benefits over the larger model (the teacher):

  • Faster inference time
  • Reduced memory and energy usage

However, the student’s predictions are typically not as good as the teacher’s predictions.

Distillation trains the student model to minimize a loss function based on the difference between the outputs of the predictions of the student and teacher models.

Compare and contrast distillation with the following terms:

  • fine-tuning
  • prompt-based learning
122
Q

downsampling

A

Overloaded term that can mean either of the following:

  1. Reducing the amount of information in a feature in order to train a model more efficiently. For example, before training an image recognition model, downsampling high-resolution images to a lower-resolution format.
  2. Training on a disproportionately low percentage of over-represented class examples in order to improve model training on under-represented classes. For example, in a class-imbalanced dataset, models tend to learn a lot about the majority class and not enough about the minority class. Downsampling helps balance the amount of training on the majority and minority classes.
123
Q

dropout regularization

A

A form of regularization useful in training neural networks. Dropout regularization removes a random selection of a fixed number of the units in a network layer for a single gradient step. The more units dropped out, the stronger the regularization. This is analogous to training the network to emulate an exponentially large ensemble of smaller networks. For full details, see Dropout: A Simple Way to Prevent Neural Networks from Overfitting.

124
Q

dynamic

A

Something done frequently or continuously. The terms dynamic and online are synonyms in machine learning. The following are common uses of dynamic and online in machine learning:

  • A dynamic model (or online model) is a model that is retrained frequently or continuously.
  • Dynamic training (or online training) is the process of training frequently or continuously.
  • Dynamic inference (or online inference) is the process of generating predictions on demand.
125
Q

dynamic model

A

A model that is frequently (maybe even continuously) retrained. A dynamic model is a “lifelong learner” that constantly adapts to evolving data. A dynamic model is also known as an online model.

Contrast with static model.

126
Q

eager execution

A

A TensorFlow programming environment in which operations run immediately. In contrast, operations called in graph execution don’t run until they are explicitly evaluated. Eager execution is an imperative interface, much like the code in most programming languages. Eager execution programs are generally far easier to debug than graph execution programs.

127
Q

early stopping

A

A method for regularization that involves ending training before training loss finishes decreasing. In early stopping, you intentionally stop training the model when the loss on a validation dataset starts to increase; that is, when generalization performance worsens.

Early stopping may seem counterintuitive. After all, telling a model to halt training while the loss is still decreasing may seem like telling a chef to stop cooking before the dessert has fully baked. However, training a model for too long can lead to overfitting. That is, if you train a model too long, the model may fit the training data so closely that the model doesn’t make good predictions on new examples.

128
Q

earth mover’s distance (EMD)

A

A measure of the relative similarity between two documents. The lower the earth mover’s distance, the more similar the documents.

129
Q

edit distance

A

A measurement of how similar two text strings are to each other. In machine learning, edit distance is useful because it is simple and easy to compute, and an effective way to compare two strings that are known to be similar or to find strings that are similar to a given string.

There are several definitions of edit distance, each using different string operations. For example, the Levenshtein distance considers the fewest delete, insert, and substitute operations.

For example, the Levenshtein distance between the words “heart” and “darts” is 3 because the following 3 edits are the fewest changes to turn one word into the other:

  1. heart → deart (substitute “h” with “d”)
  2. deart → dart (delete “e”)
  3. dart → darts (insert “s”)
130
Q

Einsum notation

A

An efficient notation for describing how two tensors are to be combined. The tensors are combined by multiplying the elements of one tensor by the elements of the other tensor and then summing the products. Einsum notation uses symbols to identify the axes of each tensor, and those same symbols are rearranged to specify the shape of the new resulting tensor.

NumPy provides a common Einsum implementation.

131
Q

embedding layer

A

A special hidden layer that trains on a high-dimensional categorical feature to gradually learn a lower dimension embedding vector. An embedding layer enables a neural network to train far more efficiently than training just on the high-dimensional categorical feature.

For example, Earth currently supports about 73,000 tree species. Suppose tree species is a feature in your model, so your model’s input layer includes a one-hot vector 73,000 elements long. A 73,000-element array is very long. If you don’t add an embedding layer to the model, training is going to be very time consuming due to multiplying 72,999 zeros. Perhaps you pick the embedding layer to consist of 12 dimensions. Consequently, the embedding layer will gradually learn a new embedding vector for each tree species.

In certain situations, hashing is a reasonable alternative to an embedding layer.

132
Q

embedding space

A

The d-dimensional vector space that features from a higher-dimensional vector space are mapped to. Ideally, the embedding space contains a structure that yields meaningful mathematical results; for example, in an ideal embedding space, addition and subtraction of embeddings can solve word analogy tasks.

The dot product of two embeddings is a measure of their similarity.

133
Q

embedding vector

A

Broadly speaking, an array of floating-point numbers taken from any hidden layer that describe the inputs to that hidden layer. Often, an embedding vector is the array of floating-point numbers trained in an embedding layer.

An embedding vector is not a bunch of random numbers. An embedding layer determines these values through training, similar to the way a neural network learns other weights during training. Each element of the array is a rating along some characteristic of a tree species. Which element represents which tree species’ characteristic? That’s very hard for humans to determine.

The mathematically remarkable part of an embedding vector is that similar items have similar sets of floating-point numbers. For example, similar tree species have a more similar set of floating-point numbers than dissimilar tree species. Redwoods and sequoias are related tree species, so they’ll have a more similar set of floating-pointing numbers than redwoods and coconut palms. The numbers in the embedding vector will change each time you retrain the model, even if you retrain the model with identical input.

134
Q

empirical risk minimization (ERM)

A

Choosing the function that minimizes loss on the training set. Contrast with structural risk minimization.

135
Q

encoder

A

In general, any ML system that converts from a raw, sparse, or external representation into a more processed, denser, or more internal representation.

Encoders are often a component of a larger model, where they are frequently paired with a decoder. Some Transformers pair encoders with decoders, though other Transformers use only the encoder or only the decoder.

Some systems use the encoder’s output as the input to a classification or regression network.

In sequence-to-sequence tasks, an encoder takes an input sequence and returns an internal state (a vector). Then, the decoder uses that internal state to predict the next sequence.

136
Q

ensemble

A

A collection of models trained independently whose predictions are averaged or aggregated. In many cases, an ensemble produces better predictions than a single model. For example, a random forest is an ensemble built from multiple decision trees. Note that not all decision forests are ensembles.

137
Q

entropy

A

In information theory, a description of how unpredictable a probability distribution is. Alternatively, entropy is also defined as how much information each example contains. A distribution has the highest possible entropy when all values of a random variable are equally likely.

The entropy of a set with two possible values “0” and “1” (for example, the labels in a binary classification problem) has the following formula:

H = -p log p - q log q = -p log p - (1-p) * log (1-p)

where:

  • H is the entropy.
  • p is the fraction of “1” examples.
  • q is the fraction of “0” examples. Note that q = (1 - p)
  • log is generally log2. In this case, the entropy unit is a bit.

For example, suppose the following:

  • 100 examples contain the value “1”
  • 300 examples contain the value “0”

Therefore, the entropy value is:

p = 0.25
q = 0.75
H = (-0.25)log2(0.25) - (0.75)log2(0.75) = 0.81 bits per example

A set that is perfectly balanced (for example, 200 “0”s and 200 “1”s) would have an entropy of 1.0 bit per example. As a set becomes more imbalanced, its entropy moves towards 0.0.

In decision trees, entropy helps formulate information gain to help the splitter select the conditions during the growth of a classification decision tree.

Compare entropy with:

  • gini impurity
  • cross-entropy loss function

Entropy is often called Shannon’s entropy.

138
Q

environment

A

In reinforcement learning, the world that contains the agent and allows the agent to observe that world’s state. For example, the represented world can be a game like chess, or a physical world like a maze. When the agent applies an action to the environment, then the environment transitions between states.

139
Q

episode

A

In reinforcement learning, each of the repeated attempts by the agent to learn an environment.

140
Q

epoch

A

A full training pass over the entire training set such that each example has been processed once.

An epoch represents N/batch size training iterations, where N is the total number of examples.

For instance, suppose the following:

  • The dataset consists of 1,000 examples.
  • The batch size is 50 examples.

Therefore, a single epoch requires 20 iterations:
1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations

141
Q

epsilon greedy policy

A

In reinforcement learning, a policy that either follows a random policy with epsilon probability or a greedy policy otherwise. For example, if epsilon is 0.9, then the policy follows a random policy 90% of the time and a greedy policy 10% of the time.

Over successive episodes, the algorithm reduces epsilon’s value in order to shift from following a random policy to following a greedy policy. By shifting the policy, the agent first randomly explores the environment and then greedily exploits the results of random exploration.

142
Q

equality of opportunity

A

A fairness metric that checks whether, for a preferred label (one that confers an advantage or benefit to a person) and a given attribute, a classifier predicts that preferred label equally well for all values of that attribute. In other words, equality of opportunity measures whether the people who should qualify for an opportunity are equally likely to do so regardless of their group membership.

For example, suppose Glubbdubdrib University admits both Lilliputians and Brobdingnagians to a rigorous mathematics program. Lilliputians’ secondary schools offer a robust curriculum of math classes, and the vast majority of students are qualified for the university program. Brobdingnagians’ secondary schools don’t offer math classes at all, and as a result, far fewer of their students are qualified. Equality of opportunity is satisfied for the preferred label of “admitted” with respect to nationality (Lilliputian or Brobdingnagian) if qualified students are equally likely to be admitted irrespective of whether they’re a Lilliputian or a Brobdingnagian.

143
Q

equalized odds

A

A fairness metric that checks if, for any particular label and attribute, a classifier predicts that label equally well for all values of that attribute.

For example, suppose Glubbdubdrib University admits both Lilliputians and Brobdingnagians to a rigorous mathematics program. Lilliputians’ secondary schools offer a robust curriculum of math classes, and the vast majority of students are qualified for the university program. Brobdingnagians’ secondary schools don’t offer math classes at all, and as a result, far fewer of their students are qualified. Equalized odds is satisfied provided that no matter whether an applicant is a Lilliputian or a Brobdingnagian, if they are qualified, they are equally as likely to get admitted to the program, and if they are not qualified, they are equally as likely to get rejected.

144
Q

example

A

The values of one row of features and possibly a label. Examples in supervised learning fall into two general categories:

  • A labeled example consists of one or more features and a label. Labeled examples are used during training.
  • An unlabeled example consists of one or more features but no label. Unlabeled examples are used during inference.

The row of a dataset is typically the raw source for an example. That is, an example typically consists of a subset of the columns in the dataset. Furthermore, the features in an example can also include synthetic features, such as feature crosses.

145
Q

experience replay

A

In reinforcement learning, a DQN technique used to reduce temporal correlations in training data. The agent stores state transitions in a replay buffer, and then samples transitions from the replay buffer to create training data.

146
Q

exploding gradient problem

A

The tendency for gradients in deep neural networks (especially recurrent neural networks) to become surprisingly steep (high). Steep gradients often cause very large updates to the weights of each node in a deep neural network.

Models suffering from the exploding gradient problem become difficult or impossible to train. Gradient clipping can mitigate this problem.

Compare to vanishing gradient problem.

147
Q

F_1

A

A “roll-up” binary classification metric that relies on both precision and recall, calculated by the formula:

F_1 = 2 * (precision * recall) ÷ (precision + recall)

When precision and recall are fairly similar, F_1 is close to their mean. When precision and recall differ significantly, F_1 is closer to the lower value.

148
Q

fairness constraint

A

Applying a constraint to an algorithm to ensure one or more definitions of fairness are satisfied. Examples of fairness constraints include:

  • Post-processing your model’s output.
  • Altering the loss function to incorporate a penalty for violating a fairness metric.
  • Directly adding a mathematical constraint to an optimization problem.
149
Q

fairness metric

A

A mathematical definition of “fairness” that is measurable. Some commonly used fairness metrics include:

  • equalized odds
  • predictive parity
  • counterfactual fairness
  • demographic parity

Many fairness metrics are mutually exclusive; see incompatibility of fairness metrics.

150
Q

false negative (FN)

A

An example in which the model mistakenly predicts the negative class. For example, the model predicts that a particular email message is not spam (the negative class), but that email message actually is spam.

151
Q

false negative rate

A

The proportion of actual positive examples for which the model mistakenly predicted the negative class. The following formula calculates the false negative rate:

false negative rate = false negatives ÷ (false negatives + true positives)

152
Q

false positive (FP)

A

An example in which the model mistakenly predicts the positive class. For example, the model predicts that a particular email message is spam (the positive class), but that email message is actually not spam.

153
Q

false positive rate (FPR)

A

The proportion of actual negative examples for which the model mistakenly predicted the positive class. The following formula calculates the false positive rate:

false positive rate = false positives ÷ (false positives + true negatives)

The false positive rate is the x-axis in an ROC curve.

154
Q

feature

A

An input variable to a machine learning model. An example consists of one or more features. For instance, suppose you are training a model to determine the influence of weather conditions on student test scores. The features might be temperature, precipitation, and wind speed, and the label might be test scores.

155
Q

feature cross

A

A synthetic feature formed by “crossing” categorical or bucketed features.

For example, consider a “mood forecasting” model that represents temperature in one of the following four buckets:

  • freezing
  • chilly
  • temperate
  • warm

And represents wind speed in one of the following three buckets:

  • still
  • light
  • windy

Without feature crosses, the linear model trains independently on each of the preceding seven various buckets. So, the model trains on, for instance, freezing independently of the training on, for instance, windy.

Alternatively, you could create a feature cross of temperature and wind speed. This synthetic feature would have the following 12 possible values:

  • freezing-still
  • freezing-light
  • freezing-windy
  • chilly-still
  • chilly-light
  • chilly-windy
  • temperate-still
  • temperate-light
  • temperate-windy
  • warm-still
  • warm-light
  • warm-windy

Thanks to feature crosses, the model can learn mood differences between a freezing-windy day and a freezing-still day.

If you create a synthetic feature from two features that each have a lot of different buckets, the resulting feature cross will have a huge number of possible combinations. For example, if one feature has 1,000 buckets and the other feature has 2,000 buckets, the resulting feature cross has 2,000,000 buckets.

Formally, a cross is a Cartesian product.

Feature crosses are mostly used with linear models and are rarely used with neural networks.

156
Q

feature engineering

A

A process that involves the following steps:

  1. Determining which features might be useful in training a model.
  2. Converting raw data from the dataset into efficient versions of those features.

For example, you might determine that temperature might be a useful feature. Then, you might experiment with bucketing to optimize what the model can learn from different temperature ranges.

Feature engineering is sometimes called feature extraction.

157
Q

feature extraction

A

Overloaded term having either of the following definitions:

  1. Retrieving intermediate feature representations calculated by an unsupervised or pretrained model (for example, hidden layer values in a neural network) for use in another model as input.
  2. Synonym for feature engineering.
158
Q

feature importances

A

Synonym for variable importances, a set of scores that indicates the relative importance of each feature to the model.

159
Q

feature set

A

The group of features your machine learning model trains on. For example, postal code, property size, and property condition might comprise a simple feature set for a model that predicts housing prices.

160
Q

feature spec

A

Describes the information required to extract features data from the tf.Example protocol buffer. Because the tf.Example protocol buffer is just a container for data, you must specify the following:

  • the data to extract (that is, the keys for the features)
  • the data type (for example, float or int)
  • The length (fixed or variable)
161
Q

feature vector

A

The array of feature values comprising an example. The feature vector is input during training and during inference.

Feature engineering determines how to represent features in the feature vector. For example, a binary categorical feature with five possible values might be represented with one-hot encoding. In this case, the portion of the feature vector for a particular example would consist of four zeroes and a single 1.

162
Q

federated learning

A

A distributed machine learning approach that trains machine learning models using decentralized examples residing on devices such as smartphones. In federated learning, a subset of devices downloads the current model from a central coordinating server. The devices use the examples stored on the devices to make improvements to the model. The devices then upload the model improvements (but not the training examples) to the coordinating server, where they are aggregated with other updates to yield an improved global model. After the aggregation, the model updates computed by devices are no longer needed, and can be discarded.

Since the training examples are never uploaded, federated learning follows the privacy principles of focused data collection and data minimization.

163
Q

feedback loop

A

In machine learning, a situation in which a model’s predictions influence the training data for the same model or another model. For example, a model that recommends movies will influence the movies that people see, which will then influence subsequent movie recommendation models.

164
Q

feedforward neural network (FFN)

A

A neural network without cyclic or recursive connections. For example, traditional deep neural networks are feedforward neural networks. Contrast with recurrent neural networks, which are cyclic.

165
Q

few-shot learning

A

A machine learning approach, often used for object classification, designed to train effective classifiers from only a small number of training examples.

See also one-shot learning and zero-shot learning.

166
Q

few-shot prompting

A

A prompt that contains more than one (a “few”) example demonstrating how the large language model should respond.

Few-shot prompting generally produces more desirable results than zero-shot prompting and one-shot prompting. However, few-shot prompting requires a lengthier prompt.

Few-shot prompting is a form of few-shot learning applied to prompt-based learning.

167
Q

Fiddle

A

A Python-first configuration library that sets the values of functions and classes without invasive code or infrastructure. In the case of Pax—and other ML codebases—these functions and classes represent models and training hyperparameters.

Fiddle assumes that machine learning codebases are typically divided into:

  • Library code, which defines the layers and optimizers.
  • Dataset “glue” code, which calls the libraries and wires everything together.

Fiddle captures the call structure of the glue code in an unevaluated and mutable form.

168
Q

fine tuning

A

A second, task-specific training pass performed on a pre-trained model to refine its parameters for a specific use case. For example, the full training sequence for some large language models is as follows:

  1. Pre-training: Train a large language model on a vast general dataset, such as all the English language Wikipedia pages.
  2. Fine-tuning: Train the pre-trained model to perform a specific task, such as responding to medical queries. Fine-tuning typically involves hundreds or thousands of examples focused on the specific task.

As another example, the full training sequence for a large image model is as follows:

  1. Pre-training: Train a large image model on a vast general image dataset, such as all the images in Wikimedia commons.
  2. Fine-tuning: Train the pre-trained model to perform a specific task, such as generating images of orcas.

Fine-tuning can entail any combination of the following strategies:

  • Modifying all of the pre-trained model’s existing parameters. This is sometimes called full fine-tuning.
  • Modifying only some of the pre-trained model’s existing parameters (typically, the layers closest to the output layer), while keeping other existing parameters unchanged (typically, the layers closest to the input layer). See parameter-efficient tuning.
  • Adding more layers, typically on top of the existing layers closest to the output layer.

Fine-tuning is a form of transfer learning. As such, fine-tuning might use a different loss function or a different model type than those used to train the pre-trained model. For example, you could fine-tune a pre-trained large image model to produce a regression model that returns the number of birds in an input image.

Compare and contrast fine-tuning with the following terms:

  • distillation
  • prompt-based learning
169
Q

Flax

A

A high-performance open-source library for deep learning built on top of JAX. Flax provides functions for training neural networks, as well as methods for evaluating their performance.

170
Q

Flaxformer

A

An open-source Transformer library, built on Flax, designed primarily for natural language processing and multimodal research.

171
Q

forget gate

A

The portion of a Long Short-Term Memory cell that regulates the flow of information through the cell. Forget gates maintain context by deciding which information to discard from the cell state.

172
Q

fully connected layer

A

A hidden layer in which each node is connected to every node in the subsequent hidden layer.

A fully connected layer is also known as a dense layer.

173
Q

generalization

A

A model’s ability to make correct predictions on new, previously unseen data. A model that can generalize is the opposite of a model that is overfitting.

You train a model on the examples in the training set. Consequently, the model learns the peculiarities of the data in the training set. Generalization essentially asks whether your model can make good predictions on examples that are not in the training set.

To encourage generalization, regularization helps a model train less exactly to the peculiarities of the data in the training set.

174
Q

generalization curve

A

A plot of both training loss and validation loss as a function of the number of iterations.

A generalization curve can help you detect possible overfitting. For example, a generalization curve where the validation loss and training loss diverge, with validation loss increasing and training loss continuing to decrease, suggests overfitting.

175
Q

generalized linear model

A

A generalization of least squares regression models, which are based on Gaussian noise, to other types of models based on other types of noise, such as Poisson noise or categorical noise. Examples of generalized linear models include:

  • logistic regression
  • multi-class regression
  • least squares regression

The parameters of a generalized linear model can be found through convex optimization.

Generalized linear models exhibit the following properties:

  • The average prediction of the optimal least squares regression model is equal to the average label on the training data.
  • The average probability predicted by the optimal logistic regression model is equal to the average label on the training data.

The power of a generalized linear model is limited by its features. Unlike a deep model, a generalized linear model cannot “learn new features.”

176
Q

generative adversarial network (GAN)

A

A system to create new data in which a generator creates data and a discriminator determines whether that created data is valid or invalid.

177
Q

generative AI

A

An emerging transformative field with no formal definition. That said, most experts agree that generative AI models can create (“generate”) content that is all of the following:

  • complex
  • coherent
  • original

For example, a generative AI model can create sophisticated essays or images.

Some earlier technologies, including LSTMs and RNNs, can also generate original and coherent content. Some experts view these earlier technologies as generative AI, while others feel that true generative AI requires more complex output than those earlier technologies can produce.

Contrast with predictive ML.

178
Q

generative model

A

Practically speaking, a model that does either of the following:

  1. Creates (generates) new examples from the training dataset. For example, a generative model could create poetry after training on a dataset of poems. The generator part of a generative adversarial network falls into this category.
  2. Determines the probability that a new example comes from the training set, or was created from the same mechanism that created the training set. For example, after training on a dataset consisting of English sentences, a generative model could determine the probability that new input is a valid English sentence.

A generative model can theoretically discern the distribution of examples or particular features in a dataset.

Unsupervised learning models are generative.

Contrast with discriminative models.

179
Q

generator

A

The subsystem within a generative adversarial network that creates new examples.

Contrast with discriminative model.

180
Q

gini impurity

A

A metric similar to entropy. Splitters use values derived from either gini impurity or entropy to compose conditions for classification decision trees. Information gain is derived from entropy. There is no universally accepted equivalent term for the metric derived from gini impurity; however, this unnamed metric is just as important as information gain.

Gini impurity is also called gini index, or simply gini.

Gini impurity is the probability of misclassifying a new piece of data taken from the same distribution. The gini impurity of a set with two possible values “0” and “1” (for example, the labels in a binary classification problem) is calculated from the following formula:

I = 1 - (p^2 + q^2) = 1 - (p^2 + (1-p)^2)

where:

  • I is the gini impurity.
  • p is the fraction of “1” examples.
  • q is the fraction of “0” examples. Note that q = 1-p

For example, consider the following dataset:

  • 100 labels (0.25 of the dataset) contain the value “1”
  • 300 labels (0.75 of the dataset) contain the value “0”

Therefore, the gini impurity is:

  • p = 0.25
  • q = 0.75
  • I = 1 - (0.252 + 0.752) = 0.375

Consequently, a random label from the same dataset would have a 37.5% chance of being misclassified, and a 62.5% chance of being properly classified.

A perfectly balanced label (for example, 200 “0”s and 200 “1”s) would have a gini impurity of 0.5. A highly imbalanced label would have a gini impurity close to 0.0.

181
Q

GPT (Generative Pre-trained Transformer)

A

A family of Transformer-based large language models developed by OpenAI.

GPT variants can apply to multiple modalities, including:

  • image generation (for example, ImageGPT)
  • text-to-image generation (for example, DALL-E).
182
Q

gradient

A

The vector of partial derivatives with respect to all of the independent variables. In machine learning, the gradient is the vector of partial derivatives of the model function. The gradient points in the direction of steepest ascent.

183
Q

gradient boosted (decision) trees (GBT)

A

A type of decision forest in which:

  • Training relies on gradient boosting.
  • The weak model is a decision tree.
184
Q

gradient boosting

A

A training algorithm where weak models are trained to iteratively improve the quality (reduce the loss) of a strong model. For example, a weak model could be a linear or small decision tree model. The strong model becomes the sum of all the previously trained weak models.

In the simplest form of gradient boosting, at each iteration, a weak model is trained to predict the loss gradient of the strong model. Then, the strong model’s output is updated by subtracting the predicted gradient, similar to gradient descent.

F_0 = 0
F_i+1 = F_i + E*f_i

where:

  • is the starting strong model.
    is the next strong model.
    is the current strong model.
    is a value between 0.0 and 1.0 called shrinkage, which is analogous to the learning rate in gradient descent.
    is the weak model trained to predict the loss gradient of.

Modern variations of gradient boosting also include the second derivative (Hessian) of the loss in their computation.

Decision trees are commonly used as weak models in gradient boosting. See gradient boosted (decision) trees.

185
Q

gradient clipping

A

A commonly used mechanism to mitigate the exploding gradient problem by artificially limiting (clipping) the maximum value of gradients when using gradient descent to train a model.

186
Q

gradient descent

A

A mathematical technique to minimize loss. Gradient descent iteratively adjusts weights and biases, gradually finding the best combination to minimize loss.

Gradient descent is older—much, much older—than machine learning.

187
Q

graph

A

In TensorFlow, a computation specification. Nodes in the graph represent operations. Edges are directed and represent passing the result of an operation (a Tensor) as an operand to another operation. Use TensorBoard to visualize a graph.

188
Q

graph execution

A

A TensorFlow programming environment in which the program first constructs a graph and then executes all or part of that graph. Graph execution is the default execution mode in TensorFlow 1.x.

Contrast with eager execution.

189
Q

greedy policy

A

In reinforcement learning, a policy that always chooses the action with the highest expected return.

190
Q

ground truth

A

Reality.

The thing that actually happened.

For example, consider a binary classification model that predicts whether a student in their first year of university will graduate within six years. Ground truth for this model is whether or not that student actually graduated within six years.

191
Q

group attribution bias

A

Assuming that what is true for an individual is also true for everyone in that group. The effects of group attribution bias can be exacerbated if a convenience sampling is used for data collection. In a non-representative sample, attributions may be made that do not reflect reality.

See also out-group homogeneity bias and in-group bias.

192
Q

hallucination

A

The production of plausible-seeming but factually incorrect output by a generative AI model that purports to be making an assertion about the real world. For example, a generative AI model that claims that Barack Obama died in 1865 is hallucinating.

193
Q

hashing

A

In machine learning, a mechanism for bucketing categorical data, particularly when the number of categories is large, but the number of categories actually appearing in the dataset is comparatively small.

For example, Earth is home to about 73,000 tree species. You could represent each of the 73,000 tree species in 73,000 separate categorical buckets. Alternatively, if only 200 of those tree species actually appear in a dataset, you could use hashing to divide tree species into perhaps 500 buckets.

A single bucket could contain multiple tree species. For example, hashing could place baobab and red maple—two genetically dissimilar species—into the same bucket. Regardless, hashing is still a good way to map large categorical sets into the desired number of buckets. Hashing turns a categorical feature having a large number of possible values into a much smaller number of values by grouping values in a deterministic way.

194
Q

heuristic

A

A simple and quickly implemented solution to a problem. For example, “With a heuristic, we achieved 86% accuracy. When we switched to a deep neural network, accuracy went up to 98%.”

195
Q

hidden layer

A

A layer in a neural network between the input layer (the features) and the output layer (the prediction). Each hidden layer consists of one or more neurons. A deep neural network contains more than one hidden layer.

196
Q

hierarchical clustering

A

A category of clustering algorithms that create a tree of clusters. Hierarchical clustering is well-suited to hierarchical data, such as botanical taxonomies. There are two types of hierarchical clustering algorithms:

  • Agglomerative clustering first assigns every example to its own cluster, and iteratively merges the closest clusters to create a hierarchical tree.
  • Divisive clustering first groups all examples into one cluster and then iteratively divides the cluster into a hierarchical tree.

Contrast with centroid-based clustering.

197
Q

hinge loss

A

A family of loss functions for classification designed to find the decision boundary as distant as possible from each training example, thus maximizing the margin between examples and the boundary. KSVMs use hinge loss (or a related function, such as squared hinge loss). For binary classification, the hinge loss function is defined as follows:
loss = max(0, 1-(yy’))
where y is the true label, either -1 or +1, and y’ is the raw output of the classifier model:
y’ = b + w1x1 + w2x2 + …wnxn
Consequently, a plot of hinge loss vs (y
y’) looks like f(x) = -x + 1 where f(x) >= 0, and f(x) = 0 where f(x) would otherwise be < 0.

198
Q

holdout data

A

Examples intentionally not used (“held out”) during training. The validation dataset and test dataset are examples of holdout data. Holdout data helps evaluate your model’s ability to generalize to data other than the data it was trained on. The loss on the holdout set provides a better estimate of the loss on an unseen dataset than does the loss on the training set.

199
Q

host

A

When training an ML model on accelerator chips (GPUs or TPUs), the part of the system that controls both of the following:

  • The overall flow of the code.
  • The extraction and transformation of the input pipeline.

The host typically runs on a CPU, not on an accelerator chip; the device manipulates tensors on the accelerator chips.

200
Q

hyperparameter

A

The variables that you or a hyperparameter tuning service adjust during successive runs of training a model. For example, learning rate is a hyperparameter. You could set the learning rate to 0.01 before one training session. If you determine that 0.01 is too high, you could perhaps set the learning rate to 0.003 for the next training session.

In contrast, parameters are the various weights and bias that the model learns during training.

201
Q

hyperplane

A

A boundary that separates a space into two subspaces. For example, a line is a hyperplane in two dimensions and a plane is a hyperplane in three dimensions. More typically in machine learning, a hyperplane is the boundary separating a high-dimensional space. Kernel Support Vector Machines use hyperplanes to separate positive classes from negative classes, often in a very high-dimensional space.

202
Q

image recognition

A

A process that classifies object(s), pattern(s), or concept(s) in an image. Image recognition is also known as image classification.

203
Q

implicit bias

A

Automatically making an association or assumption based on one’s mental models and memories. Implicit bias can affect the following:

  • How data is collected and classified.
  • How machine learning systems are designed and developed.

For example, when building a classifier to identify wedding photos, an engineer may use the presence of a white dress in a photo as a feature. However, white dresses have been customary only during certain eras and in certain cultures.

204
Q

incompatibility of fairness metrics

A

The idea that some notions of fairness are mutually incompatible and cannot be satisfied simultaneously. As a result, there is no single universal metric for quantifying fairness that can be applied to all ML problems.

While this may seem discouraging, incompatibility of fairness metrics doesn’t imply that fairness efforts are fruitless. Instead, it suggests that fairness must be defined contextually for a given ML problem, with the goal of preventing harms specific to its use cases.

205
Q

independently and identically distributed (i.i.d)

A

Data drawn from a distribution that doesn’t change, and where each value drawn doesn’t depend on values that have been drawn previously. An i.i.d. is the ideal gas of machine learning—a useful mathematical construct but almost never exactly found in the real world. For example, the distribution of visitors to a web page may be i.i.d. over a brief window of time; that is, the distribution doesn’t change during that brief window and one person’s visit is generally independent of another’s visit. However, if you expand that window of time, seasonal differences in the web page’s visitors may appear.

206
Q

individual fairness

A

A fairness metric that checks whether similar individuals are classified similarly. For example, Brobdingnagian Academy might want to satisfy individual fairness by ensuring that two students with identical grades and standardized test scores are equally likely to gain admission.

Note that individual fairness relies entirely on how you define “similarity” (in this case, grades and test scores), and you can run the risk of introducing new fairness problems if your similarity metric misses important information (such as the rigor of a student’s curriculum).

207
Q

inference

A

In machine learning, the process of making predictions by applying a trained model to unlabeled examples.

Inference has a somewhat different meaning in statistics. Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability.

208
Q

inference path

A

In a decision tree, during inference, the route a particular example takes from the root to other conditions, terminating with a leaf.

209
Q

information gain

A

In decision forests, the difference between a node’s entropy and the weighted (by number of examples) sum of the entropy of its children nodes. A node’s entropy is the entropy of the examples in that node.

For example, consider the following entropy values:

  • entropy of parent node = 0.6
  • entropy of one child node with 16 relevant examples = 0.2
  • entropy of another child node with 24 relevant examples = 0.1

So 40% of the examples are in one child node and 60% are in the other child node. Therefore:

  • weighted entropy sum of child nodes = (0.4 * 0.2) + (0.6 * 0.1) = 0.14

So, the information gain is:

  • information gain = entropy of parent node - weighted entropy sum of child nodes
    information gain = 0.6 - 0.14 = 0.46

Most splitters seek to create conditions that maximize information gain.

210
Q

in-group bias

A

Showing partiality to one’s own group or own characteristics. If testers or raters consist of the machine learning developer’s friends, family, or colleagues, then in-group bias may invalidate product testing or the dataset.

In-group bias is a form of group attribution bias. See also out-group homogeneity bias.

211
Q

input generator

A

A mechanism by which data is loaded into a neural network.

An input generator can be thought of as a component responsible for processing raw data into tensors which are iterated over to generate batches for training, evaluation, and inference.

212
Q

input layer

A

The layer of a neural network that holds the feature vector. That is, the input layer provides examples for training or inference.

213
Q

in-set condition

A

In a decision tree, a condition that tests for the presence of one item in a set of items. For example, the following is an in-set condition:

house-style in [tudor, colonial, cape]

During inference, if the value of the house-style feature is tudor or colonial or cape, then this condition evaluates to Yes. If the value of the house-style feature is something else (for example, ranch), then this condition evaluates to No.

In-set conditions usually lead to more efficient decision trees than conditions that test one-hot encoded features.

214
Q

instruction tuning

A

A form of fine-tuning that improves a generative AI model’s ability to follow instructions. Instruction tuning involves training a model on a series of instruction prompts, typically covering a wide variety of tasks. The resulting instruction-tuned model then tends to generate useful responses to zero-shot prompts across a variety of tasks.

Compare and contrast with:

  • parameter-efficient tuning
  • prompt tuning
215
Q

interpretability

A

The ability to explain or to present an ML model’s reasoning in understandable terms to a human.

Most linear regression models, for example, are highly interpretable. (You merely need to look at the trained weights for each feature.) Decision forests are also highly interpretable. Some models, however, require sophisticated visualization to become interpretable.

216
Q

inter-rater agreement

A

A measurement of how often human raters agree when doing a task. If raters disagree, the task instructions may need to be improved. Also sometimes called inter-annotator agreement or inter-rater reliability. See also Cohen’s kappa, which is one of the most popular inter-rater agreement measurements.

217
Q

intersection over union (IoU)

A

The intersection of two sets divided by their union. In machine-learning image-detection tasks, IoU is used to measure the accuracy of the model’s predicted bounding box with respect to the ground-truth bounding box. In this case, the IoU for the two boxes is the ratio between the overlapping area and the total area, and its value ranges from 0 (no overlap of predicted bounding box and ground-truth bounding box) to 1 (predicted bounding box and ground-truth bounding box have the exact same coordinates).

218
Q

item matrix

A

In recommendation systems, a matrix of embedding vectors generated by matrix factorization that holds latent signals about each item. Each row of the item matrix holds the value of a single latent feature for all items. For example, consider a movie recommendation system. Each column in the item matrix represents a single movie. The latent signals might represent genres, or might be harder-to-interpret signals that involve complex interactions among genre, stars, movie age, or other factors.

The item matrix has the same number of columns as the target matrix that is being factorized. For example, given a movie recommendation system that evaluates 10,000 movie titles, the item matrix will have 10,000 columns.

219
Q

items

A

In a recommendation system, the entities that a system recommends. For example, videos are the items that a video store recommends, while books are the items that a bookstore recommends.

220
Q

iteration

A

A single update of a model’s parameters—the model’s weights and biases—during training. The batch size determines how many examples the model processes in a single iteration. For instance, if the batch size is 20, then the model processes 20 examples before adjusting the parameters.

When training a neural network, a single iteration involves the following two passes:

  1. A forward pass to evaluate loss on a single batch.
  2. A backward pass (backpropagation) to adjust the model’s parameters based on the loss and the learning rate.
221
Q

JAX

A

An array computing library, bringing together XLA (Accelerated Linear Algebra) and automatic differentiation for high-performance numerical computing. JAX provides a simple and powerful API for writing accelerated numerical code with composable transformations. JAX provides features such as:

  • grad (automatic differentiation)
  • jit (just-in-time compilation)
  • vmap (automatic vectorization or batching)
  • pmap (parallelization)

JAX is a language for expressing and composing transformations of numerical code, analogous—but much larger in scope—to Python’s NumPy library. (In fact, the .numpy library under JAX is a functionally equivalent, but entirely rewritten version of the Python NumPy library.)

JAX is particularly well-suited for speeding up many machine learning tasks by transforming the models and data into a form suitable for parallelism across GPU and TPU accelerator chips.

Flax, Optax, Pax, and many other libraries are built on the JAX infrastructure.

222
Q

Keras

A

A popular Python machine learning API. Keras runs on several deep learning frameworks, including TensorFlow, where it is made available as tf.keras.

223
Q

Kernel Support Vector Machines (KSVMs)

A

A classification algorithm that seeks to maximize the margin between positive and negative classes by mapping input data vectors to a higher dimensional space. For example, consider a classification problem in which the input dataset has a hundred features. To maximize the margin between positive and negative classes, a KSVM could internally map those features into a million-dimension space. KSVMs uses a loss function called hinge loss.

224
Q

keypoints

A

The coordinates of particular features in an image. For example, for an image recognition model that distinguishes flower species, keypoints might be the center of each petal, the stem, the stamen, and so on.

225
Q

k-fold cross validation

A

An algorithm for predicting a model’s ability to generalize to new data. The k in k-fold refers to the number of equal groups you divide a dataset’s examples into; that is, you train and test your model k times. For each round of training and testing, a different group is the test set, and all remaining groups become the training set. After k rounds of training and testing, you calculate the mean and standard deviation of the desired test metric(s).

226
Q

k-means

A

A popular clustering algorithm that groups examples in unsupervised learning. The k-means algorithm basically does the following:

  • Iteratively determines the best k center points (known as centroids).
  • Assigns each example to the closest centroid. Those examples nearest the same centroid belong to the same group.

The k-means algorithm picks centroid locations to minimize the cumulative square of the distances from each example to its closest centroid.

227
Q

k-median

A

A clustering algorithm closely related to k-means. The practical difference between the two is as follows:

  • In k-means, centroids are determined by minimizing the sum of the squares of the distance between a centroid candidate and each of its examples.
  • In k-median, centroids are determined by minimizing the sum of the distance between a centroid candidate and each of its examples.

Note that the definitions of distance are also different:

  • k-means relies on the Euclidean distance from the centroid to an example. (In two dimensions, the Euclidean distance means using the Pythagorean theorem to calculate the hypotenuse.) For example, the k-means distance between (2,2) and (5,-2) would be:
    Euclidean distance = sqrt((2-5)^2 + (2 - -2)^2) = sqrt(9 + 16) = 5
k-median relies on the Manhattan distance from the centroid to an example. This distance is the sum of the absolute deltas in each dimension. For example, the k-median distance between (2,2) and (5,-2) would be:

 Manhattan distance = abs(2-5) + abs(2- -2) = 3 + 4 = 7
228
Q

L0 regularization

A

A type of regularization that penalizes the total number of nonzero weights in a model. For example, a model having 11 nonzero weights would be penalized more than a similar model having 10 nonzero weights.

L0 regularization is sometimes called L0-norm regularization. L0 regularization is generally impractical in large models.

229
Q

L1 loss

A

A loss function that calculates the absolute value of the difference between actual label values and the values that a model predicts.

L1 loss is less sensitive to outliers than L2 loss.

The Mean Absolute Error is the average L1 loss per example.

230
Q

L1 regularization

A

A type of regularization that penalizes weights in proportion to the sum of the absolute value of the weights. L1 regularization helps drive the weights of irrelevant or barely relevant features to exactly 0. A feature with a weight of 0 is effectively removed from the model.

231
Q

L2 loss

A

A loss function that calculates the square of the difference between actual label values and the values that a model predicts. E.g. if a model’s prediction was 8 and the actual value for the example was 3, the L2 loss would be 5^2 = 25

Due to squaring, L2 loss amplifies the influence of outliers. That is, L2 loss reacts more strongly to bad predictions than L1 loss. For example, the L1 loss for the preceding example would be 5 instead of 25.

Regression models typically use L2 loss as the loss function.

The Mean Squared Error is the average L2 loss per example. Squared loss is another name for L2 loss.

232
Q

L2 regularization

A

A type of regularization that penalizes weights in proportion to the sum of the squares of the weights. L2 regularization helps drive outlier weights (those with high positive or low negative values) closer to 0 but not quite to 0. Features with values very close to 0 remain in the model but don’t influence the model’s prediction very much.

L2 regularization always improves generalization in linear models.

233
Q

label

A

In supervised machine learning, the “answer” or “result” portion of an example.

Each labeled example consists of one or more features and a label. For instance, in a spam detection dataset, the label would probably be either “spam” or “not spam.” In a rainfall dataset, the label might be the amount of rain that fell during a certain period.

234
Q

labeled example

A

An example that contains one or more features and a label. In supervised machine learning, models train on labeled examples and make predictions on unlabeled examples.

235
Q

label leakage

A

A model design flaw in which a feature is a proxy for the label. For example, consider a binary classification model that predicts whether or not a prospective customer will purchase a particular product. Suppose that one of the features for the model is a Boolean named SpokeToCustomerAgent. Further suppose that a customer agent is only assigned after the prospective customer has actually purchased the product. During training, the model will quickly learn the association between SpokeToCustomerAgent and the label.

236
Q

lambda

A

Synonym for regularization rate, a number that specifies the relative importance of regularization during training.

Lambda is an overloaded term. Here we’re focusing on the term’s definition within regularization.

237
Q

LaMDA (Language Model for Dialogue Applications)

A

A Transformer-based large language model developed by Google trained on a large dialogue dataset that can generate realistic conversational responses.

238
Q

landmarks

A

Synonym for keypoints, the coordinates of particular features in an image.

239
Q

language model

A

A model that estimates the probability of a token or sequence of tokens occurring in a longer sequence of tokens.

240
Q

large language model

A

An informal term with no strict definition that usually means a language model that has a high number of parameters. Some large language models contain over 100 billion parameters.

241
Q

layer

A

A set of neurons in a neural network. Three common types of layers are as follows:

  • The input layer, which provides values for all the features.
  • One or more hidden layers, which find nonlinear relationships between the features and the label.
  • The output layer, which provides the prediction.
242
Q

Layers API (tf.layers)

A

A TensorFlow API for constructing a deep neural network as a composition of layers. The Layers API enables you to build different types of layers, such as:

  • tf.layers.Dense for a fully-connected layer.
  • tf.layers.Conv2D for a convolutional layer.

The Layers API follows the Keras layers API conventions. That is, aside from a different prefix, all functions in the Layers API have the same names and signatures as their counterparts in the Keras layers API.

243
Q

leaf

A

Any endpoint in a decision tree. Unlike a condition, a leaf doesn’t perform a test. Rather, a leaf is a possible prediction. A leaf is also the terminal node of an inference path.

244
Q

Learning Interpretability Tool (LIT)

A

A visual, interactive model-understanding and data visualization tool.

You can use open-source LIT to interpret models or to visualize text, image, and tabular data.

245
Q

learning rate

A

A floating-point number that tells the gradient descent algorithm how strongly to adjust weights and biases on each iteration. For example, a learning rate of 0.3 would adjust weights and biases three times more powerfully than a learning rate of 0.1.

Learning rate is a key hyperparameter. If you set the learning rate too low, training will take too long. If you set the learning rate too high, gradient descent often has trouble reaching convergence.

246
Q

least squares regression

A

A linear regression model trained by minimizing L2 Loss.

247
Q

linear

A

A relationship between two or more variables that can be represented solely through addition and multiplication.

The plot of a linear relationship is a line.

248
Q

linear model

A

A model that assigns one weight per feature to make predictions. (Linear models also incorporate a bias.) In contrast, the relationship of features to predictions in deep models is generally nonlinear.

Linear models are usually easier to train and more interpretable than deep models. However, deep models can learn complex relationships between features.

Linear regression and logistic regression are two types of linear models.

249
Q

linear regression

A

type of machine learning model in which both of the following are true:

  • The model is a linear model.
  • The prediction is a floating-point value. (This is the regression part of linear regression.)

Contrast linear regression with logistic regression. Also, contrast regression with classification.