The Hundred- Page Machine Learning (Book) Flashcards

1
Q

What is Machine Learning

A

Machine learning is a subfield of computer science that is concerned with building algorithms
which, to be useful, rely on a collection of examples of some phenomenon. These examples
can come from nature, be handcrafted by humans or generated by another algorithm.
Machine learning can also be defined as the process of solving a practical problem by 1)
gathering a dataset, and 2) algorithmically building a statistical model based on that dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Types of Learning

A

Learning can be supervised, semi-supervised, unsupervised and reinforcement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Supervised Learning

A

The goal of a supervised learning algorithm is to use the dataset to produce a model
that takes a feature vector x as input and outputs information that allows deducing the label
for this feature vector. For instance, the model created using the dataset of people could
take as input a feature vector describing a person and output a probability that the person
has cancer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Unsupervised Learning

A

In unsupervised learning, the dataset is a collection of unlabeled examples {xi}N
i=1.
Again, x is a feature vector, and the goal of an unsupervised learning algorithm is
to create a model that takes a feature vector x as input and either transforms it into
another vector or into a value that can be used to solve a practical problem. For example,
in clustering, the model returns the id of the cluster for each feature vector in the dataset.
In dimensionality reduction, the output of the model is a feature vector that has fewer
features than the input x; in outlier detection, the output is a real number that indicates
how x is di

erent from a “typical” example in the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Semi-Supervised Learning

A

In semi-supervised learning, the dataset contains both labeled and unlabeled examples.
Usually, the quantity of unlabeled examples is much higher than the number of labeled
examples. The goal of a semi-supervised learning algorithm is the same as the goal of
the supervised learning algorithm. The hope here is that using many unlabeled examples can
help the learning algorithm to find (we might say “produce” or “compute”) a better model2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Reinforcement Learning

A

Reinforcement learning is a subfield of machine learning where the machine “lives” in an
environment and is capable of perceiving the state of that environment as a vector of
features. The machine can execute actions in every state. Different actions bring different rewards and could also move the machine to another state of the environment. The goal
of a reinforcement learning algorithm is to learn a policy. A policy is a function f (similar
to the model in supervised learning) that takes the feature vector of a state as input and
outputs an optimal action to execute in that state. The action is optimal if it maximizes the
expected average reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

bag of words?

A

-the first feature is equal to 1 if the email message contains the word “a”; otherwise,
this feature is 0;
* the second feature is equal to 1 if the email message contains the word “aaron”; otherwise,
this feature equals 0;
* …
* the feature at position 20,000 is equal to 1 if the email message contains the word
“zulu”; otherwise, this feature is equal to 0.

Now you have a machine-readable input data, but the output labels are still in the form of
human-readable text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

scalar

A

A scalar is a simple numerical value, like 15 or ≠3.25. Variables or constants that take scalar
values are denoted by an italic letter, like x or a.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

vector

A

A vector is an ordered list of scalar values, called attributes. We denote a vector as a bold
character, for example, x or w. Vectors can be visualized as arrows that point to some
directions as well as points in a multi-dimensional space. Illustrations of three two-dimensional
vectors, a = [2, 3], b = [≠2, 5], and c = [1, 0] is given in fig. 1. We denote an attribute of a
vector as an italic value with an index, like this: w(j) or x(j)

. The index j denotes a specific
dimension of the vector, the position of an attribute in the list. For instance, in the vector a
shown in red in fig. 1, a(1) = 2 and a(2) = 3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

set

A

A set is an unordered collection of unique elements. We denote a set as a calligraphic capital character, for example, S.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The sum of two vectors

A

The sum of two vectors x + z is defined as the vector [x(1) + z(1), x(2) + z(2),…,x(m) + z(m)].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

vector multiplied by a scalar

A

A vector multiplied by a scalar is a vector. For example xc = [cx(1), cx(2), . . . , cx(m)].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

dot-product of two vectors

A

A dot-product of two vectors is a scalar. For example, wx def
= qm
i=1 w(i)
x(i)
. In some books,
the dot-product is denoted as w · x. The two vectors must be of the same dimensionality.
Otherwise, the dot-product is undefined.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

(0, 1) contains 0 and 1? what about [0,1]?

A

() no [] yes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Derivative and Gradient

A

A derivative fÕ of a function f is a function or a value that describes how fast f grows (or
decreases). If the derivative is a constant value, like 5 or ≠3, then the function grows (or
decreases) constantly at any point x of its domain. If the derivative fÕ is a function, then the
function f can grow at a different pace in different regions of its domain. If the derivative fÕ
is positive at some point x, then the function f grows at this point. If the derivative of f is
negative at some x, then the function decreases at this point. The derivative of zero at x
means that the function’s slope at x is horizontal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

probability distribution

A

The probability distribution of a discrete random variable is described by a list of probabilities
associated with each of its possible values. This list of probabilities is called probability mass
function (pmf). For example: Pr(X = red)=0.3, Pr(X = yellow)=0.45, Pr(X = blue) =
0.25. Each probability in a probability mass function is a value greater than or equal to 0.
The sum of probabilities equals 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

continuous random variable

A

A continuous random variable takes an infinite number of possible values in some interval.
Examples include height, weight, and time. Because the number of values of a continuous
random variable X is infinite, the probability Pr(X = c) for any c is 0. Therefore, instead
of the list of probabilities, the probability distribution of a continuous random variable (a
continuous probability distribution) is described by a probability density function (pdf). The
pdf is a function whose codomain is nonnegative and the area under the curve is equal to 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Bayes’ Rule

A

The conditional probability Pr(X = x|Y = y) is the probability of the random variable X to
have a specific value x given that another random variable Y has a specific value of y. The
Bayes’ Rule (also known as the Bayes’ Theorem) stipulates that:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Model-Based vs. Instance-Based Learning

A

Most supervised learning algorithms are model-based. We have already seen one such
algorithm: SVM. Model-based learning algorithms use the training data to create a model
that has parameters learned from the training data. In SVM, the two parameters we saw
were wú and bú. After the model was built, the training data can be discarded.
Instance-based learning algorithms use the whole dataset as the model. One instance-based
algorithm frequently used in practice is k-Nearest Neighbors (kNN). In classification, to
predict a label for an input example the kNN algorithm looks at the close neighborhood of
the input example in the space of feature vectors and outputs the label that it saw the most
often in this close neighborhood.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Shallow vs. Deep Learning

A

A shallow learning algorithm learns the parameters of the model directly from the features
of the training examples. Most supervised learning algorithms are shallow. The notorious
exceptions are neural network learning algorithms, specifically those that build neural
networks with more than one layer between input and output. Such neural networks are
called deep neural networks. In deep neural network learning (or, simply, deep learning),
contrary to shallow learning, most model parameters are learned not directly from the features
of the training examples, but from the outputs of the preceding layers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Decision Tree Learning

A

A decision tree is an acyclic graph that can be used to make decisions. In each branching
node of the graph, a specific feature j of the feature vector is examined. If the value of the
feature is below a specific threshold, then the left branch is followed; otherwise, the right
branch is followed. As the leaf node is reached, the decision is made about the class to which
the example belongs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

k-Nearest Neighbors

A

k-Nearest Neighbors (kNN) is a non-parametric learning algorithm. Contrary to other
learning algorithms that allow discarding the training data after the model is built, kNN
keeps all training examples in memory. Once a new, previously unseen example x comes in,
the kNN algorithm finds k training examples closest to x and returns the majority label (in
case of classification) or the average label (in case of regression).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Building Blocks of a Learning Algorithm

A

1) a loss function;
2) an optimization criterion based on the loss function (a cost function, for example); and
3) an optimization routine that leverages training data to find a solution to the optimization
criterion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Gradient descent

A

Gradient descent is an iterative optimization algorithm for finding the minimum of a function.
To find a local minimum of a function using gradient descent, one starts at some random
point and takes steps proportional to the negative of the gradient (or approximate gradient)
of the function at the current point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Stochastic gradient descent

A

Stochastic gradient descent (SGD) is a version of the algorithm that speeds up the
computation by approximating the gradient using smaller batches (subsets) of the training
data. SGD itself has various “upgrades”. Adagrad is a version of SGD that scales – for
each parameter according to the history of gradients. As a result, – is reduced for very large
gradients and vice-versa. Momentum is a method that helps accelerate SGD by orienting
the gradient descent in the relevant direction and reducing oscillations. In neural network
training, variants of SGD such as RMSprop and Adam, are most frequently used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Feature Engineering

A

The problem of transforming raw data into a dataset is called feature engineering. For
most practical problems, feature engineering is a labor-intensive process that demands from
the data analyst a lot of creativity and, preferably, domain knowledge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Bias and variability

A

Google it punk!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

One-Hot Encoding

A

Some learning algorithms only work with numerical feature vectors. When some feature in
your dataset is categorical, like “colors” or “days of the week,” you can transform such a
categorical feature into several binary ones.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Binning

A

is when you have a numerical
feature but you want to convert it into a categorical one. Binning (also called bucketing)
is the process of converting a continuous feature into multiple binary features called bins or
buckets, typically based on value range.

In some cases, a carefully designed binning can help the learning algorithm to learn using
fewer examples. It happens because we give a “hint” to the learning algorithm that if the
value of a feature falls within a specific range, the exact value of the feature doesn’t matter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Normalization

A

Normalization is the process of converting an actual range of values which a numerical
feature can take, into a standard range of values, typically in the interval [≠1, 1] or [0, 1].
For example, suppose the natural range of a particular feature is 350 to 1450. By subtracting
350 from every value of the feature, and dividing the result by 1100, one can normalize those
values into the range [0, 1].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Why do we normalize?

A

Why do we normalize? Normalizing the data is not a strict requirement. However, in practice,
it can lead to an increased speed of learning. Remember the gradient descent example from
the previous chapter. Imagine you have a two-dimensional feature vector. When you update
the parameters of w(1) and w(2), you use partial derivatives of the average squared error with
respect to w(1) and w(2). If x(1) is in the range [0, 1000] and x(2) the range [0, 0.0001], then
the derivative with respect to a larger feature will dominate the update.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Standardization

A

Standardization (or z-score normalization) is the procedure during which the feature
values are rescaled so that they have the properties of a standard normal distribution with
μ = 0 and ‡ = 1, where μ is the mean (the average value of the feature, averaged over all
examples in the dataset) and ‡ is the standard deviation from the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

You may ask when you should use normalization and when standardization.

A

Usually, if your dataset is not too big and you have time,
you can try both and see which one performs better for your task.

  • unsupervised learning algorithms, in practice, more often benefit from standardization
    than from normalization;
  • standardization is also preferred for a feature if the values this feature takes are
    distributed close to a normal distribution (so-called bell curve);
  • again, standardization is preferred for a feature if it can sometimes have extremely high
    or low values (outliers); this is because normalization will “squeeze” the normal values
    into a very small range;
  • in all other cases, normalization is preferable.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Dealing with Missing Features

A
  • Removing the examples with missing features from the dataset. That can be done if
    your dataset is big enough so you can sacrifice some training examples.
  • Using a learning algorithm that can deal with missing feature values (depends on the
    library and a specific implementation of the algorithm).
  • Using a data imputation technique.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Data Imputation Techniques

A

One technique consists in replacing the missing value of a feature by an average value of this
feature in the dataset:

Another technique is to replace the missing value by the same value outside the normal range
of values. For example, if the normal range is [0, 1], then you can set the missing value equal
to 2 or ≠1. The idea is that the learning algorithm will learn what is it better to do when the
feature has a value significantly different from other values.

Alternatively, you can replace the
missing value by a value in the middle of the range. For example, if the range for a feature is
[≠1, 1], you can set the missing value to be equal to 0. Here, the idea is that if we use the
value in the middle of the range to replace missing features, such value will not significantly
affect the prediction.

A more advanced technique is to use the missing value as the target variable for a regression
problem. You can use all remaining features [x(1)
i , x(2)
i ,…,x(j≠1)
i , x(j+1)
i ,…,x(D)
i ] to form

a feature vector xˆi, set yˆi = x(j)
, where j is the feature with a missing value. Now we can
build a regression model to predict yˆ from the feature vectors xˆ. Of course, to build training
examples (xˆ, yˆ), you only use those examples from the original dataset, in which the value of
feature j is present.

Finally, if you have a significantly large dataset and just a few features with missing values,
you can increase the dimensionality of your feature vectors by adding a binary indicator
feature for each feature with missing values. Let’s say feature j = 12 in your D-dimensional
dataset has missing values. For each feature vector x, you then add the feature j = D + 1
which is equal to 1 if the value of feature 12 is present in x and 0 otherwise. The missing
feature value then can be replaced by 0 or any number of your choice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Learning Algorithm Selection

A

Explainability
In-memory vs. out-of-memory
Number of features and examples
Categorical vs. numerical features
Nonlinearity of the data
Training speed
Prediction speed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Three Sets wehn working with data

A

1) training set,
2) validation set, and
3) test set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Underfit - high bias

A
  • your model is too simple for the data (for example a linear model can often underfit);
  • the features you engineered are not informative enough.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Overfitting - high variance

A
  • your model is too complex for the data (for example a very tall decision tree or a very
    deep or wide neural network often overfit);
  • you have too many features but a small number of training examples.
40
Q

Regularization

A

Regularization is an umbrella-term that encompasses methods that force the learning
algorithm to build a less complex model. In practice, that often leads to slightly higher
bias but significantly reduces the variance. This problem is known in the literature as the
bias-variance trade off.

41
Q

elastic net regularization

A

L1 and L2 regularization methods are also combined in what is called elastic net regular-
ization with L1 and L2 regularizations being special cases. You can find in the literature the name ridge regularization for L2 and lasso for L1.

42
Q

For classification, things are a little bit more complicated. The most widely used metrics and
tools to assess the classification model are:

A
  • confusion matrix,
  • accuracy,
  • cost-sensitive accuracy,
  • precision/recall, and
  • area under the ROC curve.
43
Q

Confusion Matrix

A

The confusion matrix is a table that summarizes how successful the classification model
is at predicting examples belonging to various classes. One axis of the confusion matrix
is the label that the model predicted, and the other axis is the actual label. In a binary
classification problem, there are two classes.

44
Q

Precision/Recall

A

Precision
is the ratio of correct positive predictions to the overall number of positive predictions:

Recall is the ratio of correct positive predictions to the overall number of positive examples
in the test set:

45
Q

Accuracy

A

Accuracy is given by the number of correctly classified examples divided by the total number
of classified examples. In terms of the confusion matrix, it is given by:

46
Q

Cost-Sensitive Accuracy

A

For dealing with the situation in which different classes have different importance, a useful
metric is cost-sensitive accuracy. To compute a cost-sensitive accuracy, you first assign a
cost (a positive number) to both types of mistakes: FP and FN. You then compute the counts
TP, TN, FP, FN as usual and multiply the counts for FP and FN by the corresponding cost
before calculating the accuracy using eq. 5.

47
Q

ROC curves

A

Googel/gpt

48
Q

Hyperparameter Tuning

A

As you already know, hyperparameters aren’t optimized by the learning algorithm itself. The
data analyst has to “tune” hyperparameters by experimentally finding the best combination
of values, one per hyperparameter.

49
Q

grid search?

A

Grid search is the most simple hyperparameter tuning strategy. Let’s say you train an SVM
and you have two hyperparameters to tune: the penalty parameter C (a positive real number)
and the kernel (either “linear” or “rbf”).
If it’s the first time you are working with this dataset, you don’t know what is the possible
range of values for C. The most common trick is to use a logarithmic scale. For example, for
C you can try the following values: [0.001, 0.01, 0.1, 1.0, 10, 100, 1000]. In this case you have
14 combinations of hyperparameters to try: [(0.001, “linear”), (0.01, “linear”), (0.1, “linear”),
(1.0, “linear”), (10, “linear”), (100, “linear”), (1000, “linear”), (0.001, “rbf”), (0.01, “rbf”),
(0.1, “rbf”), (1.0, “rbf”), (10, “rbf”), (100, “rbf”), (1000, “rbf”)].

50
Q

random search

A

Random search differs from grid search in that you no longer provide a discrete set of
values to explore for each hyperparameter; instead, you provide a statistical distribution for
each hyperparameter from which values are randomly sampled and set the total number of
combinations you want to try.

51
Q

Bayesian hyperparameter optimization.

A

Bayesian techniques differ from random or grid search in that they use past evaluation results
to choose the next values to evaluate. The idea is to limit expensive optimization of the
objective function by choosing the next hyperparameter values based on those that have done
well in the past.

52
Q

Cross-Validation

A

When you don’t have a decent validation set to tune your hyperparameters on, the common
technique that can help you is called cross-validation. When you have few training examples,
it could be prohibitive to have both validation and test set. You would prefer to use more
data to train the model. In such a case, you only split your data into a training and a test
set. Then you use cross-validation to on the training set to simulate a validation set.

53
Q

Deep Learning

A

Deep learning refers to training neural networks with more than two non-output layers. In the
past, it became more diffcult to train such networks as the number of layers grew. The two
biggest challenges were referred to as the problems of exploding gradient and vanishing
gradient as gradient descent was used to train the network parameters.

54
Q

backpropagation

A

Backpropagation
is an efficient algorithm for computing gradients on neural networks using the chain rule.

55
Q

exploding gradient and vanishing
gradient

A

google/gpt

56
Q

hidden layers.

A

The layers that are
neither input nor output are often called hidden layers.

57
Q

CNN, RNN

A

google-gpt

58
Q

Multiclass Classification

A

In multiclass classification, the label can be one of the C classes: y œ {1,…,C}. Many
machine learning algorithms are binary; SVM is an example. Some algorithms can naturally
be extended to handle multiclass problems. ID3 and other decision tree learning algorithms
can be simply changed

Logistic regression can be naturally extended to multiclass learning problems by replacing
the sigmoid function with the softmax function which we already saw in Chapter 6.

59
Q

One vs rest idea to transform multiclass problem?

A

The idea is to transform a multiclass problem into C binary classification
problems and build C binary classifiers. For example, if we have three classes, y œ {1, 2, 3},
we create copies of the original datasets and modify them. In the first copy, we replace all
labels not equal to 1 by 0. In the second copy, we replace all labels not equal to 2 by 0. In the
third copy, we replace all labels not equal to 3 by 0. Now we have three binary classification
problems where we have to learn to distinguish between labels 1 and 0, 2 and 0, and between
labels 3 and 0.

Once we have the three models and we need to classify the new input feature vector x,
we apply the three models to the input, and we get three predictions. We then pick the
prediction of a non-zero class which is the most certain. Remember that in logistic regression,
the model returns not a label but a score (0, 1) that can be interpreted as the probability
that the label is positive.

60
Q

One-class classification,

A

One-class classification, also known as unary classification or class modeling, tries to
identify objects of a specific class among all objects, by learning from a training set containing
only the objects of that class. That is different from and more diff
cult than the traditional
classification problem, which tries to distinguish between two or more classes with the
training set containing objects from all classes. A typical one-class classification problem is
the classification of the traffc in a secure network as normal. In this scenario, there are few,if any, examples of the traffc under an attack or during an intrusion. However, the examples
of normal traffc are often in abundance. One-class classification learning algorithms are used for outlier detection, anomaly detection, and novelty detection.

There are several one-class learning algorithms. The most widely used in practice are
one-class Gaussian, one-class kmeans, one-class kNN, and one-class SVM.

61
Q

one-class gaussian

A

The idea behind the one-class gaussian is that we model our data as if it came from a Gaussian
distribution, more precisely multivariate normal distribution (MND). The probability density
function (pdf) for MND is given by the following equation:

where f ,(x) returns the probability density corresponding to the input feature vector x.
Probability density can be interpreted as the likelihood that example x was drawn from the
probability distribution we model as an MND.

62
Q

Multi-Label Classification

A

In multi-label classification, each training example doesn’t just have one label, but several
of them. For instance, if we want to describe an image, we could assign several labels to it:
“people,” “concert,” “nature,” all three at the same time

63
Q

Ensemble Learning

A

Ensemble learning is a learning paradigm that, instead of trying to learn one super-accurate
model, focuses on training a large number of low-accuracy models and then combining the
predictions given by those weak models to obtain a high-accuracy meta-model.

Two most widely used and effective ensemble learning algorithms are random forest and gradient boosting.

64
Q

What Is Bagging in Machine Learning?

A

Bagging, also known as Bootstrap aggregating, is an ensemble learning technique that helps to improve the performance and accuracy of machine learning algorithms. It is used to deal with bias-variance trade-offs and reduces the variance of a prediction model. Bagging avoids overfitting of data and is used for both regression and classification models, specifically for decision tree algorithms.

65
Q

Random Forest/Gradient boosting

A

google

66
Q

Sequence-to-sequence learning

A

Sequence-to-sequence learning (often abbreviated as seq2seq learning) is a generalization
of the sequence labeling problem. In seq2seq, Xi and Yi can have di

erent length. seq2seq
models have found application in machine translation (where, for example, the input is
an English sentence, and the output is the corresponding French sentence), conversational
interfaces (where the input is a question typed by the user, and the output is the answer
from the machine), text summarization, spelling correction, and many others.

67
Q

embedding

A

The role of the encoder is to read
the input and generate some sort of state (similar to the state in RNN) that can be seen
as a numerical representation of the meaning of the input the machine can work with. The
meaning of some entity, whether it be an image, a text or a video, is usually a vector or a
matrix that contains real numbers. This vector (or matrix) is called in the machine learning
jargon the embedding of the input.

68
Q

architecture with attention.

A

Attention mechanism is implemented by an
additional set of parameters that combine some information from the encoder (in RNNs,
this information is the list of state vectors of the last recurrent layer from all encoder time
steps) and the current state of the decoder to generate the label. That allows for even better
retention of long-term dependencies than provided by gated units and bidirectional RNN.

69
Q

Active learning

A

Active learning is an interesting supervised learning paradigm. It is usually applied when
obtaining labeled examples is costly. That is often the case in the medical or financial
domains, where the opinion of an expert may be required to annotate patients’ or customers’
data. The idea is that we start the learning with relatively few labeled examples, and a large
number of unlabeled ones, and then add labels only to those examples that contribute the
most to the model quality.

70
Q

semi-supervised learning

A

In semi-supervised learning (SSL) we also have labeled a small fraction of the dataset;
most of the remaining examples are unlabeled. Our goal is to leverage a large number of
unlabeled examples to improve the model performance without asking an expert for additional
labeled examples.

For example, it was shown that for some datasets, such as MNIST (a
frequent testbench in computer vision that consists of labeled images of handwritten digits
from 0 to 9) the model trained in a semi-supervised way has an almost perfect performance
with just 10 labeled examples per class (100 labeled examples overall). For comparison,
MNIST contains 70,000 labeled examples (60,000 for training and 10,000 for test). The
neural network architecture that attained such a remarkable performance is called a ladder
network.

71
Q

autoencoder

A

An autoencoder is a feed-forward neural network with an encoder-decoder architecture. It
is trained to reconstruct its input. So the training example is a pair (x, x). We want the
output xˆ of the model f(x) to be as similar to the input x as possible.

72
Q

One-Shot Learning

A

One of them is one-shot learning. In one-shot learning, typically applied in
face recognition, we want to build a model that can recognize that two photos of the same
person represent that same person. If we present to the model two photos of two different
people, we expect the model to recognize that the two people are different.

One way to build such a model is to train a siamese neural network (SNN). An SNN can
be implemented as any kind of neural network, a CNN, an RNN, or an MLP. What matters
is how we train the network.
To train an SNN, we use the triplet loss function. For example, let us have three images of
a face: the image A (for anchor), the image P (for positive) and the image N (for negative).
A and P are two different pictures of the same person; N is a picture of another person.

Each training example i is now a triplet (Ai, Pi, Ni).

73
Q

Zero-Shot Learning

A

We finish this chapter with zero-shot learning. It is a relatively new
research area, so there are no algorithms that proved to have a significant
practical utility yet. Therefore, I only outline here the basic idea and
leave the details of various algorithms for further reading. In zero-shot
learning (ZSL) we want to train a model to assign labels to objects. The
most frequent application is to learn to assign labels to images.

74
Q

Handling Imbalanced Datasets

A

If you set the cost of misclassification of examples of the minority class higher, then the
model will try harder to avoid misclassifying those examples, obviously for the cost of
misclassification of some examples of the majority class, as illustrated in Figure 1b.
Some SVM implementations (including SVC in scikit-learn) allow you to provide weights for
every class. The learning algorithm takes this information into account when looking for the
best hyperplane.
If your learning algorithm doesn’t allow weighting classes, you can try to increase the
importance of examples of some class by making multiple copies of the examples of this class
(this is called oversampling).
An opposite approach is to randomly remove from the training set some examples of the
majority class (undersampling).

75
Q

popular algorithms that oversample the minority class

A

There two popular algorithms that oversample the minority class by creating
synthetic examples: the synthetic minority oversampling technique (SMOTE) and the
adaptive synthetic sampling method (ADASYN).

76
Q

Combining Models

A

In practice, we can sometimes
get an additional performance gain by combining strong models made with different learning
algorithms. In this case, we usually use only two or three models.
There are three typical ways to combine models:
1) averaging,
2) majority vote, and
3) stacking.

Averaging works for regression as well as those classification models that return classification
scores. You simply apply all your models, let’s call them base models, to the input x and
then average the predictions. To see if the averaged model works better than each individual
algorithm, you test it on the validation set using a metric of your choice.
Majority vote works for classification models. You apply all your base models to the input
x and then return the majority class among all predictions. In the case of a tie, you either
randomly pick one of the classes, or, you return an error message (if the fact of misclassifying
would incur a significant cost).
Stacking consists of building a meta-model that takes the output of your base models as
input. Let’s say you want to combine a classifier f1 and a classifier f2, both predicting the
same set of classes. To create a training example (xˆi, yˆi) for the stacked model, you set
xˆi = [f1(x), f2(x)] and yˆi = yi.

77
Q

Advanced Regularization

A

In neural networks, besides L1 and L2 regularization, you can use neural network specific
regularizers: dropout, batch normalization, and early stopping. Batch normalization
is technically not a regularization technique, but it often has a regularization effect on the model.

78
Q

dropout

A

The concept of dropout is very simple. Each time you run a training example through the
network, you temporarily exclude at random some units from the computation. The higher
the percentage of units excluded the higher the regularization effect. Neural network libraries
allow you to add a dropout layer between two successive layers, or you can specify the dropout
parameter for the layer. The dropout parameter is in the range [0, 1] and it has to be found
experimentally by tuning it on the validation data.

79
Q

Batch normalization

A

Batch normalization (which rather has to be called batch standardization) is a technique that
consists of standardizing the outputs of each layer before the units of the subsequent layer
receive them as input. In practice, batch normalization results in a faster and more stable
training, as well as in some regularization effect.

80
Q

Early stopping

A

Early stopping is the way to train a neural network by saving the preliminary model after
every epoch and assessing the performance of the preliminary model on the validation set. As
you remember from the section about gradient descent in Chapter 4, as the number of epochs
increases, the cost decreases. The decreased cost means that the model fits the training data
well. However, at some point, after some epoch e, the model can start overfitting: the cost
keeps decreasing, but the performance of the model on the validation data deteriorates. If
you keep, in a file, the version of the model after each epoch, you can stop the training once
you start observing a decreased performance on the validation set. Alternatively, you can
keep running the training process for a fixed number of epochs and then, in the end, you
pick the best model. Models saved after each epoch are called checkpoints. Some machine
learning practitioners rely on this technique very often; others try to properly regularize the
model to avoid such undesirable behavior.

81
Q

data augmentation.

A

Another regularization technique that can be applied not just to neural networks, but to
virtually any learning algorithm, is called data augmentation. This technique is often
used to regularize models that work with images. Once you have your original labeled
training set, you can create a synthetic example from an original example by applying various
transformations to the original image: zooming it slightly, rotating, flipping, darkening, and
so on. You keep the original label in these synthetic examples. In practice, this often results
in increased performance of the model.

82
Q

Handling Multiple Inputs

A

In many of your practical problems, you will work with multimodal data. For example, your
input could be an image and text and the binary output could indicate whether the text
describes this image or not.
With neural networks, you have more flexibility. You can build two subnetworks, one for
each type of input. For example, a CNN subnetwork would read the image while an RNN
subnetwork would read the text. Both subnetworks have as their last layer an embedding:
CNN has an embedding for the image, while RNN has an embedding for the text. You can then concatenate two embeddings and then add a classification layer, such as softmax or
sigmoid, on top of the concatenated embeddings. Neural network libraries provide simple to
use tools that allow concatenating or averaging layers from several subnetworks.

83
Q

Transfer learning

A

Transfer learning is probably where neural networks have a unique advantage over the
shallow models. In transfer learning, you pick an existing model trained on some dataset,
and you adapt this model to predict examples from another dataset, different from the one
the model was built on.

With neural networks, the situation is much more favorable. Transfer learning in neural
networks works like this.
1. You build a deep model on the original big dataset (wild animals).
2. You compile a much smaller labeled dataset for your second model (domestic animals).
3. You remove the last one or several layers from the first model. Usually, these are layers
responsible for the classification or regression; they usually follow the embedding layer.
4. You replace the removed layers with new layers adapted for your new problem.
5. You “freeze” the parameters of the layers remaining from the first model.
6. You use your smaller labeled dataset and gradient descent to train the parameters of
only the new layers.

84
Q

For faster comput? cProfile package?

A

Use cProfile package in Python to find ineFFciencies in your code. - TEST

Finally, when nothing can be improved in your code from the algorithmic perspective, you
can further boost the speed of your code by using:
* multiprocessing package to run computations in parallel, and
* PyPy, Numba or similar tools to compile your Python code into fast, optimized machine
code.

85
Q

Unsupervised Learning

A

Unsupervised learning deals with problems in which your dataset doesn’t have labels. This
property is what makes it very problematic for many practical applications. The absence
of labels which represent the desired behavior for your model means the absence of a solid
reference point to judge the quality of your model. In this book, I only present unsupervised
learning methods that allow building models that can be evaluated based on data as opposed
to human judgment.

86
Q

Clustering

A

Clustering is a problem of learning to assign a label to examples by leveraging an unlabeled
dataset. Because the dataset is completely unlabeled, deciding on whether the learned model
is optimal is much more complicated than in supervised learning.

87
Q

K-Means

A

The k-means clustering algorithm works as follows. First, the analyst has to choose k — the
number of classes (or clusters). Then we randomly put k feature vectors, called centroids, to the feature space1. We then compute the distance from each example x to each centroid c
using some metric, like the Euclidean distance. Then we assign the closest centroid to each
example (like if we labeled each example with a centroid id as the label). For each centroid,
we calculate the average feature vector of the examples labeled with it. These average feature
vectors become the new locations of the centroids.

88
Q

DBSCAN and HDBSCAN

A

Clustering methods look in the text.

89
Q

Dimensionality Reduction

A

Many modern machine learning algorithms, such as ensemble algorithms and neural networks
handle well very high-dimensional examples, up to millions of features. With modern
computers and graphical processing units (GPUs), dimensionality reduction techniques are
used much less in practice than in the past. The most frequent use case for dimensionality
reduction is data visualization: humans can only interpret on a plot the maximum of three
dimensions.

Another situation in which you could benefit from dimensionality reduction is when you
have to build an interpretable model and to do so you are limited in your choice of learning
algorithms. For example, you can only use decision tree learning or linear regression. By
reducing your data to lower dimensionality and by figuring out which quality of the original
example each new feature in the reduced feature space reflects, one can use simpler algorithms.
Dimensionality reduction removes redundant or highly correlated features; it also reduces the
noise in the data — all that contributes to the interpretability of the model.

The three most widely used techniques of dimensionality reduction are principal com-
ponent analysis (PCA), uniform manifold approximation and projection (UMAP), and autoencoders.

90
Q

Learning to rank

A

Learning to rank is a supervised learning problem. Among others, one frequent problem
solved using learning to rank is the optimization of search results returned by a search engine
for a query. In search result ranking optimization, a labeled example Xi in the training set
of size N is a ranked collection of documents of size ri (labels are ranks of documents). A
feature vector represents each document in the collection. The goal of the learning is to find
a ranking function f which outputs values that can be used to rank documents.

91
Q

mean average precision (MAP).

A

Mean Average Precision(mAP) is a metric used to evaluate object detection models such as Fast R-CNN, YOLO, Mask R-CNN, etc. The mean of average precision(AP) values are calculated over recall values from 0 to 1.

92
Q

LambdaMART

A

LambdaMART is a technique where ranking is transformed into a pairwise classification or regression problem. The algorithms consider a pair of items at a single time, coming up with a viable ordering of those items before initiating the final order of the entire list.

93
Q

Learning to Recommend

A

Leaning to recommend is an approach to build recommender systems. Usually, we have a
user who consumes some content. We have the history of consumption, and we want to
suggest this user new content that the user would like. It could be a movie on Netflix or a
book on Amazon.

Traditionally, two approaches were used to give recommendations: content-based filtering
and collaborative filtering.

94
Q

Content-based filtering

A

Content-based filtering is based on learning what do users like based on the description of
the content they consume. For example, if the user of a news site often reads news articles on
science and technology, then we would suggest to this user more documents on science and
technology. More generally, we could create one training set per user and add news articles
to this dataset as a feature vector x and whether the user recently read this news article as a
label y. Then we build the model of each user and can regularly examine each new piece of
content to determine whether a specific user would read it or not.
The content-based approach has many limitations. For example, the user can be trapped in
the so-called filter bubble: the system will always suggest to that user the information that
looks very similar to what user already consumed. That could result in complete isolation of
the user from information that disagrees with their viewpoints or expands them. On a more
practical side, the users might just get recommendations of items they already know about,
which is undesirable.

95
Q

Collaborative filtering

A

Collaborative filtering has a significant advantage over content-based filtering: the recommen-
dations to one user are computed based on what other users consume or rate. For instance,

if two users gave high ratings to the same ten movies, then it’s more likely that user 1 will
appreciate new movies recommended based on the tastes of the user 2 and vice versa. The
drawback of this approach is that the content of the recommended items is ignored.

96
Q

Word Embeddings

A

Google - find in book