The Hundred- Page Machine Learning (Book) Flashcards
What is Machine Learning
Machine learning is a subfield of computer science that is concerned with building algorithms
which, to be useful, rely on a collection of examples of some phenomenon. These examples
can come from nature, be handcrafted by humans or generated by another algorithm.
Machine learning can also be defined as the process of solving a practical problem by 1)
gathering a dataset, and 2) algorithmically building a statistical model based on that dataset.
Types of Learning
Learning can be supervised, semi-supervised, unsupervised and reinforcement.
Supervised Learning
The goal of a supervised learning algorithm is to use the dataset to produce a model
that takes a feature vector x as input and outputs information that allows deducing the label
for this feature vector. For instance, the model created using the dataset of people could
take as input a feature vector describing a person and output a probability that the person
has cancer.
Unsupervised Learning
In unsupervised learning, the dataset is a collection of unlabeled examples {xi}N
i=1.
Again, x is a feature vector, and the goal of an unsupervised learning algorithm is
to create a model that takes a feature vector x as input and either transforms it into
another vector or into a value that can be used to solve a practical problem. For example,
in clustering, the model returns the id of the cluster for each feature vector in the dataset.
In dimensionality reduction, the output of the model is a feature vector that has fewer
features than the input x; in outlier detection, the output is a real number that indicates
how x is di
erent from a “typical” example in the dataset.
Semi-Supervised Learning
In semi-supervised learning, the dataset contains both labeled and unlabeled examples.
Usually, the quantity of unlabeled examples is much higher than the number of labeled
examples. The goal of a semi-supervised learning algorithm is the same as the goal of
the supervised learning algorithm. The hope here is that using many unlabeled examples can
help the learning algorithm to find (we might say “produce” or “compute”) a better model2.
Reinforcement Learning
Reinforcement learning is a subfield of machine learning where the machine “lives” in an
environment and is capable of perceiving the state of that environment as a vector of
features. The machine can execute actions in every state. Different actions bring different rewards and could also move the machine to another state of the environment. The goal
of a reinforcement learning algorithm is to learn a policy. A policy is a function f (similar
to the model in supervised learning) that takes the feature vector of a state as input and
outputs an optimal action to execute in that state. The action is optimal if it maximizes the
expected average reward.
bag of words?
-the first feature is equal to 1 if the email message contains the word “a”; otherwise,
this feature is 0;
* the second feature is equal to 1 if the email message contains the word “aaron”; otherwise,
this feature equals 0;
* …
* the feature at position 20,000 is equal to 1 if the email message contains the word
“zulu”; otherwise, this feature is equal to 0.
Now you have a machine-readable input data, but the output labels are still in the form of
human-readable text.
scalar
A scalar is a simple numerical value, like 15 or ≠3.25. Variables or constants that take scalar
values are denoted by an italic letter, like x or a.
vector
A vector is an ordered list of scalar values, called attributes. We denote a vector as a bold
character, for example, x or w. Vectors can be visualized as arrows that point to some
directions as well as points in a multi-dimensional space. Illustrations of three two-dimensional
vectors, a = [2, 3], b = [≠2, 5], and c = [1, 0] is given in fig. 1. We denote an attribute of a
vector as an italic value with an index, like this: w(j) or x(j)
. The index j denotes a specific
dimension of the vector, the position of an attribute in the list. For instance, in the vector a
shown in red in fig. 1, a(1) = 2 and a(2) = 3.
set
A set is an unordered collection of unique elements. We denote a set as a calligraphic capital character, for example, S.
The sum of two vectors
The sum of two vectors x + z is defined as the vector [x(1) + z(1), x(2) + z(2),…,x(m) + z(m)].
vector multiplied by a scalar
A vector multiplied by a scalar is a vector. For example xc = [cx(1), cx(2), . . . , cx(m)].
dot-product of two vectors
A dot-product of two vectors is a scalar. For example, wx def
= qm
i=1 w(i)
x(i)
. In some books,
the dot-product is denoted as w · x. The two vectors must be of the same dimensionality.
Otherwise, the dot-product is undefined.
(0, 1) contains 0 and 1? what about [0,1]?
() no [] yes.
Derivative and Gradient
A derivative fÕ of a function f is a function or a value that describes how fast f grows (or
decreases). If the derivative is a constant value, like 5 or ≠3, then the function grows (or
decreases) constantly at any point x of its domain. If the derivative fÕ is a function, then the
function f can grow at a different pace in different regions of its domain. If the derivative fÕ
is positive at some point x, then the function f grows at this point. If the derivative of f is
negative at some x, then the function decreases at this point. The derivative of zero at x
means that the function’s slope at x is horizontal.
probability distribution
The probability distribution of a discrete random variable is described by a list of probabilities
associated with each of its possible values. This list of probabilities is called probability mass
function (pmf). For example: Pr(X = red)=0.3, Pr(X = yellow)=0.45, Pr(X = blue) =
0.25. Each probability in a probability mass function is a value greater than or equal to 0.
The sum of probabilities equals 1
continuous random variable
A continuous random variable takes an infinite number of possible values in some interval.
Examples include height, weight, and time. Because the number of values of a continuous
random variable X is infinite, the probability Pr(X = c) for any c is 0. Therefore, instead
of the list of probabilities, the probability distribution of a continuous random variable (a
continuous probability distribution) is described by a probability density function (pdf). The
pdf is a function whose codomain is nonnegative and the area under the curve is equal to 1
Bayes’ Rule
The conditional probability Pr(X = x|Y = y) is the probability of the random variable X to
have a specific value x given that another random variable Y has a specific value of y. The
Bayes’ Rule (also known as the Bayes’ Theorem) stipulates that:
Model-Based vs. Instance-Based Learning
Most supervised learning algorithms are model-based. We have already seen one such
algorithm: SVM. Model-based learning algorithms use the training data to create a model
that has parameters learned from the training data. In SVM, the two parameters we saw
were wú and bú. After the model was built, the training data can be discarded.
Instance-based learning algorithms use the whole dataset as the model. One instance-based
algorithm frequently used in practice is k-Nearest Neighbors (kNN). In classification, to
predict a label for an input example the kNN algorithm looks at the close neighborhood of
the input example in the space of feature vectors and outputs the label that it saw the most
often in this close neighborhood.
Shallow vs. Deep Learning
A shallow learning algorithm learns the parameters of the model directly from the features
of the training examples. Most supervised learning algorithms are shallow. The notorious
exceptions are neural network learning algorithms, specifically those that build neural
networks with more than one layer between input and output. Such neural networks are
called deep neural networks. In deep neural network learning (or, simply, deep learning),
contrary to shallow learning, most model parameters are learned not directly from the features
of the training examples, but from the outputs of the preceding layers.
Decision Tree Learning
A decision tree is an acyclic graph that can be used to make decisions. In each branching
node of the graph, a specific feature j of the feature vector is examined. If the value of the
feature is below a specific threshold, then the left branch is followed; otherwise, the right
branch is followed. As the leaf node is reached, the decision is made about the class to which
the example belongs.
k-Nearest Neighbors
k-Nearest Neighbors (kNN) is a non-parametric learning algorithm. Contrary to other
learning algorithms that allow discarding the training data after the model is built, kNN
keeps all training examples in memory. Once a new, previously unseen example x comes in,
the kNN algorithm finds k training examples closest to x and returns the majority label (in
case of classification) or the average label (in case of regression).
Building Blocks of a Learning Algorithm
1) a loss function;
2) an optimization criterion based on the loss function (a cost function, for example); and
3) an optimization routine that leverages training data to find a solution to the optimization
criterion.
Gradient descent
Gradient descent is an iterative optimization algorithm for finding the minimum of a function.
To find a local minimum of a function using gradient descent, one starts at some random
point and takes steps proportional to the negative of the gradient (or approximate gradient)
of the function at the current point.
Stochastic gradient descent
Stochastic gradient descent (SGD) is a version of the algorithm that speeds up the
computation by approximating the gradient using smaller batches (subsets) of the training
data. SGD itself has various “upgrades”. Adagrad is a version of SGD that scales – for
each parameter according to the history of gradients. As a result, – is reduced for very large
gradients and vice-versa. Momentum is a method that helps accelerate SGD by orienting
the gradient descent in the relevant direction and reducing oscillations. In neural network
training, variants of SGD such as RMSprop and Adam, are most frequently used.
Feature Engineering
The problem of transforming raw data into a dataset is called feature engineering. For
most practical problems, feature engineering is a labor-intensive process that demands from
the data analyst a lot of creativity and, preferably, domain knowledge.
Bias and variability
Google it punk!
One-Hot Encoding
Some learning algorithms only work with numerical feature vectors. When some feature in
your dataset is categorical, like “colors” or “days of the week,” you can transform such a
categorical feature into several binary ones.
Binning
is when you have a numerical
feature but you want to convert it into a categorical one. Binning (also called bucketing)
is the process of converting a continuous feature into multiple binary features called bins or
buckets, typically based on value range.
In some cases, a carefully designed binning can help the learning algorithm to learn using
fewer examples. It happens because we give a “hint” to the learning algorithm that if the
value of a feature falls within a specific range, the exact value of the feature doesn’t matter.
Normalization
Normalization is the process of converting an actual range of values which a numerical
feature can take, into a standard range of values, typically in the interval [≠1, 1] or [0, 1].
For example, suppose the natural range of a particular feature is 350 to 1450. By subtracting
350 from every value of the feature, and dividing the result by 1100, one can normalize those
values into the range [0, 1].
Why do we normalize?
Why do we normalize? Normalizing the data is not a strict requirement. However, in practice,
it can lead to an increased speed of learning. Remember the gradient descent example from
the previous chapter. Imagine you have a two-dimensional feature vector. When you update
the parameters of w(1) and w(2), you use partial derivatives of the average squared error with
respect to w(1) and w(2). If x(1) is in the range [0, 1000] and x(2) the range [0, 0.0001], then
the derivative with respect to a larger feature will dominate the update.
Standardization
Standardization (or z-score normalization) is the procedure during which the feature
values are rescaled so that they have the properties of a standard normal distribution with
μ = 0 and ‡ = 1, where μ is the mean (the average value of the feature, averaged over all
examples in the dataset) and ‡ is the standard deviation from the mean.
You may ask when you should use normalization and when standardization.
Usually, if your dataset is not too big and you have time,
you can try both and see which one performs better for your task.
- unsupervised learning algorithms, in practice, more often benefit from standardization
than from normalization; - standardization is also preferred for a feature if the values this feature takes are
distributed close to a normal distribution (so-called bell curve); - again, standardization is preferred for a feature if it can sometimes have extremely high
or low values (outliers); this is because normalization will “squeeze” the normal values
into a very small range; - in all other cases, normalization is preferable.
Dealing with Missing Features
- Removing the examples with missing features from the dataset. That can be done if
your dataset is big enough so you can sacrifice some training examples. - Using a learning algorithm that can deal with missing feature values (depends on the
library and a specific implementation of the algorithm). - Using a data imputation technique.
Data Imputation Techniques
One technique consists in replacing the missing value of a feature by an average value of this
feature in the dataset:
Another technique is to replace the missing value by the same value outside the normal range
of values. For example, if the normal range is [0, 1], then you can set the missing value equal
to 2 or ≠1. The idea is that the learning algorithm will learn what is it better to do when the
feature has a value significantly different from other values.
Alternatively, you can replace the
missing value by a value in the middle of the range. For example, if the range for a feature is
[≠1, 1], you can set the missing value to be equal to 0. Here, the idea is that if we use the
value in the middle of the range to replace missing features, such value will not significantly
affect the prediction.
A more advanced technique is to use the missing value as the target variable for a regression
problem. You can use all remaining features [x(1)
i , x(2)
i ,…,x(j≠1)
i , x(j+1)
i ,…,x(D)
i ] to form
a feature vector xˆi, set yˆi = x(j)
, where j is the feature with a missing value. Now we can
build a regression model to predict yˆ from the feature vectors xˆ. Of course, to build training
examples (xˆ, yˆ), you only use those examples from the original dataset, in which the value of
feature j is present.
Finally, if you have a significantly large dataset and just a few features with missing values,
you can increase the dimensionality of your feature vectors by adding a binary indicator
feature for each feature with missing values. Let’s say feature j = 12 in your D-dimensional
dataset has missing values. For each feature vector x, you then add the feature j = D + 1
which is equal to 1 if the value of feature 12 is present in x and 0 otherwise. The missing
feature value then can be replaced by 0 or any number of your choice.
Learning Algorithm Selection
Explainability
In-memory vs. out-of-memory
Number of features and examples
Categorical vs. numerical features
Nonlinearity of the data
Training speed
Prediction speed
Three Sets wehn working with data
1) training set,
2) validation set, and
3) test set.
Underfit - high bias
- your model is too simple for the data (for example a linear model can often underfit);
- the features you engineered are not informative enough.