The Hundred- Page Machine Learning (Book) Flashcards

Question

Stochastic gradient descent

Answer 1

Stochastic gradient descent (SGD) is a version of the algorithm that speeds up the computation by approximating the gradient using smaller batches (subsets) of the training data. SGD itself has various “upgrades”. Adagrad is a version of SGD that scales – for each parameter according to the history of gradients. As a result, – is reduced for very large gradients and vice-versa. Momentum is a method that helps accelerate SGD by orienting the gradient descent in the relevant direction and reducing oscillations. In neural network training, variants of SGD such as RMSprop and Adam, are most frequently used.

Answer 2

The problem of transforming raw data into a dataset is called feature engineering. For most practical problems, feature engineering is a labor-intensive process that demands from the data analyst a lot of creativity and, preferably, domain knowledge.

Answer 3

Google it punk!

Answer 4

Some learning algorithms only work with numerical feature vectors. When some feature in your dataset is categorical, like “colors” or “days of the week,” you can transform such a categorical feature into several binary ones.

Answer 5

is when you have a numerical feature but you want to convert it into a categorical one. Binning (also called bucketing) is the process of converting a continuous feature into multiple binary features called bins or buckets, typically based on value range. In some cases, a carefully designed binning can help the learning algorithm to learn using fewer examples. It happens because we give a “hint” to the learning algorithm that if the value of a feature falls within a specific range, the exact value of the feature doesn’t matter.

Answer 6

Normalization is the process of converting an actual range of values which a numerical feature can take, into a standard range of values, typically in the interval [≠1, 1] or [0, 1]. For example, suppose the natural range of a particular feature is 350 to 1450. By subtracting 350 from every value of the feature, and dividing the result by 1100, one can normalize those values into the range [0, 1].

Answer 7

Why do we normalize? Normalizing the data is not a strict requirement. However, in practice, it can lead to an increased speed of learning. Remember the gradient descent example from the previous chapter. Imagine you have a two-dimensional feature vector. When you update the parameters of w(1) and w(2), you use partial derivatives of the average squared error with respect to w(1) and w(2). If x(1) is in the range [0, 1000] and x(2) the range [0, 0.0001], then the derivative with respect to a larger feature will dominate the update.

Answer 8

Standardization (or z-score normalization) is the procedure during which the feature values are rescaled so that they have the properties of a standard normal distribution with μ = 0 and ‡ = 1, where μ is the mean (the average value of the feature, averaged over all examples in the dataset) and ‡ is the standard deviation from the mean.

Answer 9

Usually, if your dataset is not too big and you have time, you can try both and see which one performs better for your task. * unsupervised learning algorithms, in practice, more often benefit from standardization than from normalization; * standardization is also preferred for a feature if the values this feature takes are distributed close to a normal distribution (so-called bell curve); * again, standardization is preferred for a feature if it can sometimes have extremely high or low values (outliers); this is because normalization will “squeeze” the normal values into a very small range; * in all other cases, normalization is preferable.

Answer 10

* Removing the examples with missing features from the dataset. That can be done if your dataset is big enough so you can sacrifice some training examples. * Using a learning algorithm that can deal with missing feature values (depends on the library and a specific implementation of the algorithm). * Using a data imputation technique.

Answer 11

One technique consists in replacing the missing value of a feature by an average value of this feature in the dataset: Another technique is to replace the missing value by the same value outside the normal range of values. For example, if the normal range is [0, 1], then you can set the missing value equal to 2 or ≠1. The idea is that the learning algorithm will learn what is it better to do when the feature has a value significantly different from other values. Alternatively, you can replace the missing value by a value in the middle of the range. For example, if the range for a feature is [≠1, 1], you can set the missing value to be equal to 0. Here, the idea is that if we use the value in the middle of the range to replace missing features, such value will not significantly affect the prediction. A more advanced technique is to use the missing value as the target variable for a regression problem. You can use all remaining features [x(1) i , x(2) i ,...,x(j≠1) i , x(j+1) i ,...,x(D) i ] to form a feature vector xˆi, set yˆi = x(j) , where j is the feature with a missing value. Now we can build a regression model to predict yˆ from the feature vectors xˆ. Of course, to build training examples (xˆ, yˆ), you only use those examples from the original dataset, in which the value of feature j is present. Finally, if you have a significantly large dataset and just a few features with missing values, you can increase the dimensionality of your feature vectors by adding a binary indicator feature for each feature with missing values. Let’s say feature j = 12 in your D-dimensional dataset has missing values. For each feature vector x, you then add the feature j = D + 1 which is equal to 1 if the value of feature 12 is present in x and 0 otherwise. The missing feature value then can be replaced by 0 or any number of your choice.

Answer 12

Explainability In-memory vs. out-of-memory Number of features and examples Categorical vs. numerical features Nonlinearity of the data Training speed Prediction speed

Answer 13

1) training set, 2) validation set, and 3) test set.

Answer 14

- your model is too simple for the data (for example a linear model can often underfit); - the features you engineered are not informative enough.

Answer 15

* your model is too complex for the data (for example a very tall decision tree or a very deep or wide neural network often overfit); * you have too many features but a small number of training examples.

Answer 16

Regularization is an umbrella-term that encompasses methods that force the learning algorithm to build a less complex model. In practice, that often leads to slightly higher bias but significantly reduces the variance. This problem is known in the literature as the bias-variance trade off.

Answer 17

L1 and L2 regularization methods are also combined in what is called elastic net regular- ization with L1 and L2 regularizations being special cases. You can find in the literature the name ridge regularization for L2 and lasso for L1.

Answer 18

* confusion matrix, * accuracy, * cost-sensitive accuracy, * precision/recall, and * area under the ROC curve.

Answer 19

The confusion matrix is a table that summarizes how successful the classification model is at predicting examples belonging to various classes. One axis of the confusion matrix is the label that the model predicted, and the other axis is the actual label. In a binary classification problem, there are two classes.

Answer 20

Precision is the ratio of correct positive predictions to the overall number of positive predictions: Recall is the ratio of correct positive predictions to the overall number of positive examples in the test set:

Answer 21

Accuracy is given by the number of correctly classified examples divided by the total number of classified examples. In terms of the confusion matrix, it is given by:

Answer 22

For dealing with the situation in which different classes have different importance, a useful metric is cost-sensitive accuracy. To compute a cost-sensitive accuracy, you first assign a cost (a positive number) to both types of mistakes: FP and FN. You then compute the counts TP, TN, FP, FN as usual and multiply the counts for FP and FN by the corresponding cost before calculating the accuracy using eq. 5.

Answer 23

Googel/gpt

Answer 24

As you already know, hyperparameters aren’t optimized by the learning algorithm itself. The data analyst has to “tune” hyperparameters by experimentally finding the best combination of values, one per hyperparameter.

Answer 25

Grid search is the most simple hyperparameter tuning strategy. Let’s say you train an SVM and you have two hyperparameters to tune: the penalty parameter C (a positive real number) and the kernel (either “linear” or “rbf”). If it’s the first time you are working with this dataset, you don’t know what is the possible range of values for C. The most common trick is to use a logarithmic scale. For example, for C you can try the following values: [0.001, 0.01, 0.1, 1.0, 10, 100, 1000]. In this case you have 14 combinations of hyperparameters to try: [(0.001, “linear”), (0.01, “linear”), (0.1, “linear”), (1.0, “linear”), (10, “linear”), (100, “linear”), (1000, “linear”), (0.001, “rbf”), (0.01, “rbf”), (0.1, “rbf”), (1.0, “rbf”), (10, “rbf”), (100, “rbf”), (1000, “rbf”)].

Answer 26

Random search differs from grid search in that you no longer provide a discrete set of values to explore for each hyperparameter; instead, you provide a statistical distribution for each hyperparameter from which values are randomly sampled and set the total number of combinations you want to try.

Answer 27

Bayesian techniques differ from random or grid search in that they use past evaluation results to choose the next values to evaluate. The idea is to limit expensive optimization of the objective function by choosing the next hyperparameter values based on those that have done well in the past.

Answer 28

When you don’t have a decent validation set to tune your hyperparameters on, the common technique that can help you is called cross-validation. When you have few training examples, it could be prohibitive to have both validation and test set. You would prefer to use more data to train the model. In such a case, you only split your data into a training and a test set. Then you use cross-validation to on the training set to simulate a validation set.

Answer 29

Deep learning refers to training neural networks with more than two non-output layers. In the past, it became more diffcult to train such networks as the number of layers grew. The two biggest challenges were referred to as the problems of exploding gradient and vanishing gradient as gradient descent was used to train the network parameters.

Answer 30

Backpropagation is an efficient algorithm for computing gradients on neural networks using the chain rule.

Answer 31

google/gpt

Answer 32

The layers that are neither input nor output are often called hidden layers.

Answer 33

google-gpt

Answer 34

In multiclass classification, the label can be one of the C classes: y œ {1,...,C}. Many machine learning algorithms are binary; SVM is an example. Some algorithms can naturally be extended to handle multiclass problems. ID3 and other decision tree learning algorithms can be simply changed Logistic regression can be naturally extended to multiclass learning problems by replacing the sigmoid function with the softmax function which we already saw in Chapter 6.

Answer 35

The idea is to transform a multiclass problem into C binary classification problems and build C binary classifiers. For example, if we have three classes, y œ {1, 2, 3}, we create copies of the original datasets and modify them. In the first copy, we replace all labels not equal to 1 by 0. In the second copy, we replace all labels not equal to 2 by 0. In the third copy, we replace all labels not equal to 3 by 0. Now we have three binary classification problems where we have to learn to distinguish between labels 1 and 0, 2 and 0, and between labels 3 and 0. Once we have the three models and we need to classify the new input feature vector x, we apply the three models to the input, and we get three predictions. We then pick the prediction of a non-zero class which is the most certain. Remember that in logistic regression, the model returns not a label but a score (0, 1) that can be interpreted as the probability that the label is positive.

Answer 36

One-class classification, also known as unary classification or class modeling, tries to identify objects of a specific class among all objects, by learning from a training set containing only the objects of that class. That is different from and more diff cult than the traditional classification problem, which tries to distinguish between two or more classes with the training set containing objects from all classes. A typical one-class classification problem is the classification of the traffc in a secure network as normal. In this scenario, there are few,if any, examples of the traffc under an attack or during an intrusion. However, the examples of normal traffc are often in abundance. One-class classification learning algorithms are used for outlier detection, anomaly detection, and novelty detection. There are several one-class learning algorithms. The most widely used in practice are one-class Gaussian, one-class kmeans, one-class kNN, and one-class SVM.

Answer 37

The idea behind the one-class gaussian is that we model our data as if it came from a Gaussian distribution, more precisely multivariate normal distribution (MND). The probability density function (pdf) for MND is given by the following equation: ##some eq where f ,(x) returns the probability density corresponding to the input feature vector x. Probability density can be interpreted as the likelihood that example x was drawn from the probability distribution we model as an MND.

Answer 38

In multi-label classification, each training example doesn’t just have one label, but several of them. For instance, if we want to describe an image, we could assign several labels to it: “people,” “concert,” “nature,” all three at the same time

Answer 39

Ensemble learning is a learning paradigm that, instead of trying to learn one super-accurate model, focuses on training a large number of low-accuracy models and then combining the predictions given by those weak models to obtain a high-accuracy meta-model. Two most widely used and effective ensemble learning algorithms are random forest and gradient boosting.

Answer 40

Bagging, also known as Bootstrap aggregating, is an ensemble learning technique that helps to improve the performance and accuracy of machine learning algorithms. It is used to deal with bias-variance trade-offs and reduces the variance of a prediction model. Bagging avoids overfitting of data and is used for both regression and classification models, specifically for decision tree algorithms.

Answer 41

Sequence-to-sequence learning (often abbreviated as seq2seq learning) is a generalization of the sequence labeling problem. In seq2seq, Xi and Yi can have di erent length. seq2seq models have found application in machine translation (where, for example, the input is an English sentence, and the output is the corresponding French sentence), conversational interfaces (where the input is a question typed by the user, and the output is the answer from the machine), text summarization, spelling correction, and many others.

Answer 42

The role of the encoder is to read the input and generate some sort of state (similar to the state in RNN) that can be seen as a numerical representation of the meaning of the input the machine can work with. The meaning of some entity, whether it be an image, a text or a video, is usually a vector or a matrix that contains real numbers. This vector (or matrix) is called in the machine learning jargon the embedding of the input.

Answer 43

Attention mechanism is implemented by an additional set of parameters that combine some information from the encoder (in RNNs, this information is the list of state vectors of the last recurrent layer from all encoder time steps) and the current state of the decoder to generate the label. That allows for even better retention of long-term dependencies than provided by gated units and bidirectional RNN.

Answer 44

Active learning is an interesting supervised learning paradigm. It is usually applied when obtaining labeled examples is costly. That is often the case in the medical or financial domains, where the opinion of an expert may be required to annotate patients’ or customers’ data. The idea is that we start the learning with relatively few labeled examples, and a large number of unlabeled ones, and then add labels only to those examples that contribute the most to the model quality.

Answer 45

In semi-supervised learning (SSL) we also have labeled a small fraction of the dataset; most of the remaining examples are unlabeled. Our goal is to leverage a large number of unlabeled examples to improve the model performance without asking an expert for additional labeled examples. For example, it was shown that for some datasets, such as MNIST (a frequent testbench in computer vision that consists of labeled images of handwritten digits from 0 to 9) the model trained in a semi-supervised way has an almost perfect performance with just 10 labeled examples per class (100 labeled examples overall). For comparison, MNIST contains 70,000 labeled examples (60,000 for training and 10,000 for test). The neural network architecture that attained such a remarkable performance is called a ladder network.

Answer 46

An autoencoder is a feed-forward neural network with an encoder-decoder architecture. It is trained to reconstruct its input. So the training example is a pair (x, x). We want the output xˆ of the model f(x) to be as similar to the input x as possible.

Answer 47

One of them is one-shot learning. In one-shot learning, typically applied in face recognition, we want to build a model that can recognize that two photos of the same person represent that same person. If we present to the model two photos of two different people, we expect the model to recognize that the two people are different. One way to build such a model is to train a siamese neural network (SNN). An SNN can be implemented as any kind of neural network, a CNN, an RNN, or an MLP. What matters is how we train the network. To train an SNN, we use the triplet loss function. For example, let us have three images of a face: the image A (for anchor), the image P (for positive) and the image N (for negative). A and P are two different pictures of the same person; N is a picture of another person. Each training example i is now a triplet (Ai, Pi, Ni).

Answer 48

We finish this chapter with zero-shot learning. It is a relatively new research area, so there are no algorithms that proved to have a significant practical utility yet. Therefore, I only outline here the basic idea and leave the details of various algorithms for further reading. In zero-shot learning (ZSL) we want to train a model to assign labels to objects. The most frequent application is to learn to assign labels to images.

Answer 49

If you set the cost of misclassification of examples of the minority class higher, then the model will try harder to avoid misclassifying those examples, obviously for the cost of misclassification of some examples of the majority class, as illustrated in Figure 1b. Some SVM implementations (including SVC in scikit-learn) allow you to provide weights for every class. The learning algorithm takes this information into account when looking for the best hyperplane. If your learning algorithm doesn’t allow weighting classes, you can try to increase the importance of examples of some class by making multiple copies of the examples of this class (this is called oversampling). An opposite approach is to randomly remove from the training set some examples of the majority class (undersampling).

Answer 50

There two popular algorithms that oversample the minority class by creating synthetic examples: the synthetic minority oversampling technique (SMOTE) and the adaptive synthetic sampling method (ADASYN).

Answer 51

In practice, we can sometimes get an additional performance gain by combining strong models made with different learning algorithms. In this case, we usually use only two or three models. There are three typical ways to combine models: 1) averaging, 2) majority vote, and 3) stacking. Averaging works for regression as well as those classification models that return classification scores. You simply apply all your models, let’s call them base models, to the input x and then average the predictions. To see if the averaged model works better than each individual algorithm, you test it on the validation set using a metric of your choice. Majority vote works for classification models. You apply all your base models to the input x and then return the majority class among all predictions. In the case of a tie, you either randomly pick one of the classes, or, you return an error message (if the fact of misclassifying would incur a significant cost). Stacking consists of building a meta-model that takes the output of your base models as input. Let’s say you want to combine a classifier f1 and a classifier f2, both predicting the same set of classes. To create a training example (xˆi, yˆi) for the stacked model, you set xˆi = [f1(x), f2(x)] and yˆi = yi.

Answer 52

In neural networks, besides L1 and L2 regularization, you can use neural network specific regularizers: dropout, batch normalization, and early stopping. Batch normalization is technically not a regularization technique, but it often has a regularization effect on the model.

Answer 53

The concept of dropout is very simple. Each time you run a training example through the network, you temporarily exclude at random some units from the computation. The higher the percentage of units excluded the higher the regularization effect. Neural network libraries allow you to add a dropout layer between two successive layers, or you can specify the dropout parameter for the layer. The dropout parameter is in the range [0, 1] and it has to be found experimentally by tuning it on the validation data.

Answer 54

Batch normalization (which rather has to be called batch standardization) is a technique that consists of standardizing the outputs of each layer before the units of the subsequent layer receive them as input. In practice, batch normalization results in a faster and more stable training, as well as in some regularization effect.

Answer 55

Early stopping is the way to train a neural network by saving the preliminary model after every epoch and assessing the performance of the preliminary model on the validation set. As you remember from the section about gradient descent in Chapter 4, as the number of epochs increases, the cost decreases. The decreased cost means that the model fits the training data well. However, at some point, after some epoch e, the model can start overfitting: the cost keeps decreasing, but the performance of the model on the validation data deteriorates. If you keep, in a file, the version of the model after each epoch, you can stop the training once you start observing a decreased performance on the validation set. Alternatively, you can keep running the training process for a fixed number of epochs and then, in the end, you pick the best model. Models saved after each epoch are called checkpoints. Some machine learning practitioners rely on this technique very often; others try to properly regularize the model to avoid such undesirable behavior.

Answer 56

Another regularization technique that can be applied not just to neural networks, but to virtually any learning algorithm, is called data augmentation. This technique is often used to regularize models that work with images. Once you have your original labeled training set, you can create a synthetic example from an original example by applying various transformations to the original image: zooming it slightly, rotating, flipping, darkening, and so on. You keep the original label in these synthetic examples. In practice, this often results in increased performance of the model.

Answer 57

In many of your practical problems, you will work with multimodal data. For example, your input could be an image and text and the binary output could indicate whether the text describes this image or not. With neural networks, you have more flexibility. You can build two subnetworks, one for each type of input. For example, a CNN subnetwork would read the image while an RNN subnetwork would read the text. Both subnetworks have as their last layer an embedding: CNN has an embedding for the image, while RNN has an embedding for the text. You can then concatenate two embeddings and then add a classification layer, such as softmax or sigmoid, on top of the concatenated embeddings. Neural network libraries provide simple to use tools that allow concatenating or averaging layers from several subnetworks.

Answer 58

Transfer learning is probably where neural networks have a unique advantage over the shallow models. In transfer learning, you pick an existing model trained on some dataset, and you adapt this model to predict examples from another dataset, different from the one the model was built on. With neural networks, the situation is much more favorable. Transfer learning in neural networks works like this. 1. You build a deep model on the original big dataset (wild animals). 2. You compile a much smaller labeled dataset for your second model (domestic animals). 3. You remove the last one or several layers from the first model. Usually, these are layers responsible for the classification or regression; they usually follow the embedding layer. 4. You replace the removed layers with new layers adapted for your new problem. 5. You “freeze” the parameters of the layers remaining from the first model. 6. You use your smaller labeled dataset and gradient descent to train the parameters of only the new layers.

Answer 59

Use cProfile package in Python to find ineFFciencies in your code. - TEST Finally, when nothing can be improved in your code from the algorithmic perspective, you can further boost the speed of your code by using: * multiprocessing package to run computations in parallel, and * PyPy, Numba or similar tools to compile your Python code into fast, optimized machine code.

Answer 60

Unsupervised learning deals with problems in which your dataset doesn’t have labels. This property is what makes it very problematic for many practical applications. The absence of labels which represent the desired behavior for your model means the absence of a solid reference point to judge the quality of your model. In this book, I only present unsupervised learning methods that allow building models that can be evaluated based on data as opposed to human judgment.

Answer 61

Clustering is a problem of learning to assign a label to examples by leveraging an unlabeled dataset. Because the dataset is completely unlabeled, deciding on whether the learned model is optimal is much more complicated than in supervised learning.

Answer 62

The k-means clustering algorithm works as follows. First, the analyst has to choose k — the number of classes (or clusters). Then we randomly put k feature vectors, called centroids, to the feature space1. We then compute the distance from each example x to each centroid c using some metric, like the Euclidean distance. Then we assign the closest centroid to each example (like if we labeled each example with a centroid id as the label). For each centroid, we calculate the average feature vector of the examples labeled with it. These average feature vectors become the new locations of the centroids.

Answer 63

Clustering methods look in the text.

Answer 64

Many modern machine learning algorithms, such as ensemble algorithms and neural networks handle well very high-dimensional examples, up to millions of features. With modern computers and graphical processing units (GPUs), dimensionality reduction techniques are used much less in practice than in the past. The most frequent use case for dimensionality reduction is data visualization: humans can only interpret on a plot the maximum of three dimensions. Another situation in which you could benefit from dimensionality reduction is when you have to build an interpretable model and to do so you are limited in your choice of learning algorithms. For example, you can only use decision tree learning or linear regression. By reducing your data to lower dimensionality and by figuring out which quality of the original example each new feature in the reduced feature space reflects, one can use simpler algorithms. Dimensionality reduction removes redundant or highly correlated features; it also reduces the noise in the data — all that contributes to the interpretability of the model. The three most widely used techniques of dimensionality reduction are principal com- ponent analysis (PCA), uniform manifold approximation and projection (UMAP), and autoencoders.

Answer 65

Learning to rank is a supervised learning problem. Among others, one frequent problem solved using learning to rank is the optimization of search results returned by a search engine for a query. In search result ranking optimization, a labeled example Xi in the training set of size N is a ranked collection of documents of size ri (labels are ranks of documents). A feature vector represents each document in the collection. The goal of the learning is to find a ranking function f which outputs values that can be used to rank documents.

Answer 66

Mean Average Precision(mAP) is a metric used to evaluate object detection models such as Fast R-CNN, YOLO, Mask R-CNN, etc. The mean of average precision(AP) values are calculated over recall values from 0 to 1.

Answer 67

LambdaMART is a technique where ranking is transformed into a pairwise classification or regression problem. The algorithms consider a pair of items at a single time, coming up with a viable ordering of those items before initiating the final order of the entire list.

Answer 68

Leaning to recommend is an approach to build recommender systems. Usually, we have a user who consumes some content. We have the history of consumption, and we want to suggest this user new content that the user would like. It could be a movie on Netflix or a book on Amazon. Traditionally, two approaches were used to give recommendations: content-based filtering and collaborative filtering.

Answer 69

Content-based filtering is based on learning what do users like based on the description of the content they consume. For example, if the user of a news site often reads news articles on science and technology, then we would suggest to this user more documents on science and technology. More generally, we could create one training set per user and add news articles to this dataset as a feature vector x and whether the user recently read this news article as a label y. Then we build the model of each user and can regularly examine each new piece of content to determine whether a specific user would read it or not. The content-based approach has many limitations. For example, the user can be trapped in the so-called filter bubble: the system will always suggest to that user the information that looks very similar to what user already consumed. That could result in complete isolation of the user from information that disagrees with their viewpoints or expands them. On a more practical side, the users might just get recommendations of items they already know about, which is undesirable.

Answer 70

Collaborative filtering has a significant advantage over content-based filtering: the recommen- dations to one user are computed based on what other users consume or rate. For instance, if two users gave high ratings to the same ten movies, then it’s more likely that user 1 will appreciate new movies recommended based on the tastes of the user 2 and vice versa. The drawback of this approach is that the content of the recommended items is ignored.

Answer 71

Google - find in book

The Hundred- Page Machine Learning (Book) Flashcards

(96 cards)