Machine learning engineering (book) Flashcards

1
Q

Reinforcement Learning

A

Reinforcement learning is a subfield of machine learning where the machine (called an
agent) “lives’ ’ in an environment and is capable of perceiving the state of that environment
as a vector of features. The machine can execute actions in non-terminal states. Different
actions bring different rewards and could also move the machine to another state of the
environment. A common goal of a reinforcement learning algorithm is to learn an optimal
policy.

An optimal policy is a function (similar to the model in supervised learning) that takes the
feature vector of a state as input and outputs an optimal action to execute in that state. The
action is optimal if it maximizes the expected average long-term reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Tidy data

A

Tidy data can be seen as a spreadsheet, in which each row represents one
example, and columns represent various attributes of an example, as shown in Figure 3.
Sometimes raw data can be tidy, e.g., provided to you in the form of a spreadsheet. However,
in practice, to obtain tidy data from raw data, data analysts often resort to the procedure
called feature engineering, which is applied to the direct and, optionally, indirect data
with the goal to transform each raw example into a feature vector x. Chapter 4 is devoted
entirely to feature engineering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

training, validation, and test datasets

A

training, validation, and test. The training set is usually the biggest one;
the learning algorithm uses the training set to produce the model. The validation and test
sets are roughly the same size, much smaller than the size of the training set. The learning
algorithm is not allowed to use examples from the validation or test sets to train the model.
That is why those two sets are also called holdout sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Baseline

A

In machine learning, a baseline is a simple algorithm for solving a problem, usually based
on a heuristic, simple summary statistics, randomization, or very basic machine learning
algorithm. For example, if your problem is classification, you can pick a baseline classifier
and measure its performance. This baseline performance will then become what you compare
any future model to (usually, built using a more sophisticated approach).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Machine Learning Pipeline

A

A machine learning pipeline is a sequence of operations on the dataset that goes from its
initial state to the model.

A pipeline can include, among others, such stages as data partitioning, missing data im-
putation, feature extraction, data augmentation, class imbalance reduction, dimensionality

reduction, and model training.
In practice, when we deploy a model in production, we usually deploy an entire pipeline.
Furthermore, an entire pipeline is usually optimized when hyperparameters are tuned.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Hyperparameters

A

Hyperparameters are inputs of machine learning algorithms or pipelines that influence
the performance of the model. They don’t belong to the training data and cannot be
learned from it. For example, the maximum depth of the tree in the decision tree learning
algorithm, the misclassification penalty in support vector machines, k in the k-nearest
neighbors algorithm, the target dimensionality in dimensionality reduction, and the choice of
the missing data imputation technique are all examples of hyperparameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Parameters,

A

Parameters, on the other hand, are variables that define the model trained by the learning
algorithm. Parameters are directly modified by the learning algorithm based on the training
data. The goal of learning is to find such values of parameters that make the model optimal
in a certain sense. Examples of parameters are w and b in the equation of linear regression
y = wx + b. In this equation, x is the input of the model, and y is its output (the prediction).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Model-Based

A

Most supervised learning algorithms are model-based. A typical model is a support
vector machine (SVM). Model-based learning algorithms use the training data to create a
model with parameters learned from the training data. In SVM, the two parameters are w
(a vector) and b (a real number). After the model is trained, it can be saved on disk while
the training data can be discarded.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Instance-based learning algorithms

A

Instance-based learning algorithms use the whole dataset as the model. One instance-
based algorithm frequently used in practice is k-Nearest Neighbors (kNN). In classification,

to predict a label for an input example, the kNN algorithm looks at the close neighborhood
of the input example in the space of feature vectors and outputs the label that it saw most
often in this close neighborhood.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Shallow vs. Deep Learning

A

gpt

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Training vs. Scoring

A

When we apply a trained model to an input example (or, sometimes, a sequence of examples)
in order to obtain a prediction (or, predictions) or to somehow transform an input, we talk
about scoring.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When to Use Machine Learning

A

When the Problem Is Too Complex for Coding
When the Problem Is Constantly Changing
When It Is a Perceptive Problem - image recognition…
When It Is an Unstudied Phenomenon
When the Problem Has a Simple Objective
When It Is Cost-Effective

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When Not to Use Machine Learning

A
  • every action of the system or a decision made by it must be explainable,
  • every change in the system’s behavior compared to its past behavior in a similar
    situation must be explainable,
  • the cost of an error made by the system is too high,
  • you want to get to the market as fast as possible,
  • getting the right data is too hard or impossible,
  • you can solve the problem using traditional software development at a lower cost,
  • a simple heuristic would work reasonably well,
  • the phenomenon has too many outcomes while you cannot get a sufficient amount of
    examples to represent them (like in video games or word processing software),
  • you build a system that will not have to be improved frequently over time,
  • you can manually fill an exhaustive lookup table by providing the expected output
    for any input (that is, the number of possible input values is not too large, or getting
    outputs is fast and cheap).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Machine learning engineering

A

Machine learning engineering (MLE) is the use of scientific principles, tools, and techniques of
machine learning and traditional software engineering to design and build complex computing
systems. MLE encompasses all stages from data collection, to model training, to making the
model available for use by the product or the customers.
In other words, MLE includes any activity that lets machine learning algorithms be imple-
mented as a part of an effective production system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Three factors highly influence the cost of a machine learning project

A
  • the difficulty of the problem,
  • the cost of data, and
  • the need for accuracy.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Defining the Goal of a Machine Learning Project

A

The goal of a machine learning project is to build a model that solves, or helps solve, a
business problem. Within a project, the model is often seen as a black box described by the
structure of its input (or inputs) and output (or outputs), and the minimum acceptable level
of performance (as measured by accuracy of prediction or another performance metric).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What a Model Can Do

A
  • automate (for example, by taking action on the user’s behalf or by starting or stopping
    a specific activity on a server),
  • alert or prompt (for example, by asking the user if an action should be taken or by
    asking a system administrator if the traffic seems suspicious),
  • organize, by presenting a set of items in an order that might be useful for a user (for
    example, by sorting pictures or documents in the order of similarity to a query or
    according to the user’s preferences),
  • annotate (for instance, by adding contextual annotations to displayed information, or
    by highlighting, in a text, phrases relevant to the user’s task),
  • extract (for example, by detecting smaller pieces of relevant information in a larger
    input, such as named entities in the text: proper names, companies, or locations),
  • recommend (for example, by detecting and showing to a user highly relevant items in a
    large collection based on item’s content or user’s reaction to the past recommendations),
  • classify (for example, by dispatching input examples into one, or several, of a predefined
    set of distinctly-named groups),
  • quantify (for example, by assigning a number, such as a price, to an object, such
    as a house),
  • synthesize (for example, by generating new text, image, sound, or another object similar
    to the objects in a collection),
  • answer an explicit question (for example, “Does this text describe that image?” or
    “Are these two images similar?”),
  • transform its input (for example, by reducing its dimensionality for visualization
    purposes, paraphrasing a long text as a short abstract, translating a sentence into
    another language, or augmenting an image by applying a filter to it),
  • detect a novelty or an anomaly.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Properties of a Successful Model

A
  • it respects the input and output specifications and the performance requirement,
  • it benefits the organization (measured via cost reduction, increased sales or profit),
  • it helps the user (measured via productivity, engagement, and sentiment),
  • it is scientifically rigorous.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Structuring a Machine Learning Team

A

Two Cultures

One culture says that a machine learning team has to be composed of data analysts who
collaborate closely with software engineers. In such a culture, a software engineer doesn’t
need to have deep expertise in machine learning, but has to understand the vocabulary of
their fellow data analysts.

According to other culture, all engineers in a machine learning team must have a combination
of machine learning and software engineering skills.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Data engineers

A

Data engineers are software engineers responsible for ETL (for Extract, Transform, Load).
These three conceptual steps are part of a typical data pipeline. Data engineers use ETL

techniques and create an automated pipeline, in which raw data is transformed into analysis-
ready data. Data engineers design how to structure the data and how to integrate it from

various resources. They write on-demand queries on that data, or wrap the most frequent
queries into fast application programming interfaces (APIs) to make sure that the data is
easily accessible by analysts and other data consumers. Typically, data engineers are not
expected to know any machine learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

labeler

A

A labeler is person responsible for assigning labels to unlabeled examples. Again, in big
companies, data labeling experts may be organized in two or three different teams: one or
two teams of labelers (for example, one local and one outsourced) and a team of software
engineers, plus a user experience (UX) specialist, responsible for building labeling tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

machine learning projects can fail for many reasons

A
  • lack of experienced talent,
  • lack of support by the leadership,
  • missing data infrastructure,
  • data labeling challenge,
  • siloed organizations and lack of collaboration,
  • technically infeasible projects, and
  • lack of alignment between technical and business teams.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Is the Data Sizeable?

A

Check if the number of samples is big enough.. there are some rules of thumb described:
* 10 times the amount of features (this often exaggerates the size of the training set, but
works well as an upper bound),
* 100 or 1000 times the number of classes (this often underestimates the size), or
* ten times the number of trainable parameters (usually applied to neural networks).

Keep in mind that just because you have big data does not mean that you should use all of
it. A smaller sample of big data can give good results in practice and accelerate the search
for a better model. It’s important to ensure, though, that the sample is representative of the
whole big dataset. Sampling strategies such as stratified and systematic sampling can
lead to better results. We consider data sampling strategies in Section 3.10.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

data leakage (also known as target leakage)

A

What did happen is called data leakage (also known as target leakage). After a more
careful examination of the dataset, you realize that one of the columns in the spreadsheet
contained the real estate agent’s commission. Of course, the model easily learned to convert
this attribute into the house price perfectly. However, this information is not available in the
production environment before the house is sold, because the commission depends on the
selling price. In Section 3.2.8, we will consider the problem of data leakage in more detail.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Common Problems With Data

A

High Cost - Getting unlabeled data can be expensive; however, labeling data is the most expensive work,
especially if the work is done manually.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Bias? And its types?

A

Bias in data is an inconsistency with the phenomenon that data represents. This inconsistency
may occur for a number of reasons (which are not mutually exclusive).

  • Selection bias
  • Self-selection bias
  • Omitted variable bias
  • Sponsorship or funding bias
  • Sampling bias
  • Prejudice or stereotype bias
  • Systematic value distortion
  • Experimenter bias
  • Labeling bias
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Selection bias

A

Selection bias is the tendency to skew your choice of data sources to those that are easily
available, convenient, and/or cost-effective. For example, you might want to know the opinion
of the readers on your new book.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Self-selection bias

A

Self-selection bias is a form of selection bias where you get the data from sources that
“volunteered” to provide it. Most poll data has this type of bias. For example, you want to
train a model that predicts the behavior of successful entrepreneurs. You decide to first ask
entrepreneurs whether they are successful or not. Then you only keep the data obtained
from those who declared themselves successful. The problem here is that most likely, really
successful entrepreneurs don’t have time to answer your questions, while those who claim
themselves successful can be wrong on that matter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Omitted variable bias

A

Omitted variable bias happens when your featurized data doesn’t have a feature necessary
for accurate prediction. For example, let’s assume that you are working on a churn prediction
model and you want to predict whether a customer cancels their subscription within six
months. You train a model, and it’s accurate enough; however, several weeks after deployment
you see many unexpected false negatives. You investigate the decreased model performance
and discover a new competitor now offers a very similar service for a lower price. This
feature wasn’t initially available to your model, therefore important information for accurate
prediction was missing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Sponsorship or funding bias

A

Sponsorship or funding bias affects the data produced by a sponsored agency. For example,
let a famous video game company sponsor a news agency to provide news about the video
game industry. If you try to make a prediction about the video game industry, you might
include in your data the story produced by this sponsored agency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Sampling bias (also known as distribution shift)

A

occurs when the distribution of examples
used for training doesn’t reflect the distribution of the inputs the model will receive in
production. This type of bias is frequently observed in practice. For example, you are working
on a system that classifies documents according to a taxonomy of several hundred topics.
You might decide to create a collection of documents in which an equal amount of documents
represents each topic. Once you finish the work on the model, you observe 5% error. Soon after
deployment, you see the wrong assignment to about 30% of documents. Why did this happen?
One of the possible reasons is sampling bias: one or two frequent topics in production data
might account for 80% of all input. If your model doesn’t perform well for these frequent
topics, then your system will make more errors in production than you initially expected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Prejudice or stereotype bias

A

Prejudice or stereotype bias is often observed in data obtained from historical sources, such as books or photo archives, or from online activity such as social media, online forums,
and comments to online publications. Using a photo archive to train a model that distinguishes men from women might show, for
example, men more frequently in work or outdoor contexts, and women more often at home
indoors. If we use such biased data, our model will have more difficulty recognizing a woman
outdoors or a man at home.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Systematic value distortion

A

Systematic value distortion is bias usually occurring with the device making measurements
or observations. This results in a machine learning model making suboptimal predictions
when deployed in the production environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Experimenter bias

A

Experimenter bias is the tendency to search for, interpret, favor, or recall information in a
way that affirms one’s prior beliefs or hypotheses. Applied to machine learning, experimenter
bias often occurs when each example in the dataset is obtained from the answers to a survey
given by a particular person, one example per person.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Labeling bias

A

Labeling bias happens when labels are assigned to unlabeled examples by a biased process
or person.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Low predictive power

A

Low predictive power is an issue that you often don’t consider until you have spent
fruitless energy trying to train a good model. Does the model underperform because it is not
expressive enough? Does the data not contain enough information from which to learn? You
don’t know.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

concept drift.

A

Concept drift is a fundamental
change in the statistical relationship between the features and the label.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Outliers

A

Outliers are examples that look dissimilar to the majority of examples from the dataset. It’s
up to the data analyst to define “dissimilar.” Typically, dissimilarity is measured by some
distance metric, such as Euclidean distance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Data Leakage

A

Data leakage, also called target leakage, is a problem affecting several stages of the
machine learning life cycle, from data collection to model evaluation. In this section, I will
only describe how this problem manifests itself at the data collection and preparation stages.
In the subsequent chapters, I will describe its other forms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Summary of Good Data

A

For the convenience of future reference, let me once again repeat the properties of good data:
* it contains enough information that can be used for modeling,
* it has good coverage of what you want to do with the model,
* it reflects real inputs that the model will see in production,
* it is as unbiased as possible,
* it is not a result of the model itself,
* it has consistent labels, and
* it is big enough to allow generalization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Dealing With Interaction Data

A

Interaction data is the data you can collect from user interactions with the system your
model supports. You are considered lucky if you can gather good data from interactions of
the user with the system.
Good interaction data contains information on three aspects:
* context of interaction,
* action of the user in that context, and
* outcome of interaction.

As an example, assume that you build a search engine, and your model reranks search results
for each user individually. A reranking model takes as input the list of links returned by the
search engine, based on keywords provided by the user and outputs another list in which the
items change order. Usually, a reranked model “knows” something about the user and their
preferences and can reorder the generic search results for each user individually according
to that user’s learned preferences. The context here is the search query and the hundred
documents presented to the user in a specific order. The action is a click of the user on
a particular document link. The outcome is how much time the user spent reading the
document and whether the user hit “back.” Another action is the click on the “next page”
link.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

three most frequent causes of data leakage that can happen during data
collection and preparation:

A

Data leakage is when information from outside the training dataset is used to create the model. 1) target being a function of a feature, 2) feature hiding the
target, and 3) feature coming from the future.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Data leakage - Target is a Function of a Feature

A

If you don’t do a careful analysis of each attribute and
its relation to GDP, you might let a leakage happen: in the data in Figure 9, two columns,
Population and GDP per capita, multiplied, equal GDP. The model you will train will
perfectly predict GDP by looking at these two columns only. The fact that you let GDP be
one of the features, though in a slightly modified form (devised by the population), constitutes
contamination and, therefore, leads to data leakage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Data leakage - Feature Hides the Target

A

If the data about
a customer’s gender and age is factual (as opposed to being guessed by another model that
might be available in production), then the column Group constitutes a form of data leakage,
when the value you want to predict is “hidden” in the value of a feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Data leakage - Feature From the Future

A

Here is another example. Let’s say you have a news website and you want to predict the
ranking of news you serve to the user, so as to maximize the number of clicks on stories. If
in your training data, you have positional features for each news item served in the past (e.g.,
the x − y position of the title, and the abstract block on the webpage), such information will
not be available on the serving time, because you don’t know the positions of articles on the
page before you rank them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Data Partitioning

A

The training set is used by the machine learning algorithm to train the model.

The validation set is needed to find the best values for the hyperparameters of the machine
learning pipeline. The analyst tries different combinations of hyperparameter values one by
one, trains a model by using each combination, and notes the model performance on the
validation set. The hyperparameters that maximize the model performance are then used to
train the model for production. We consider techniques of hyperparameter tuning in more
detail in Section ?? of Chapter 5.

The test set is used for reporting: once you have your best model, you test its performance
on the test set and report the results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

To obtain good partitions of your entire dataset into these three disjoint sets (test, val, train) partitioning has to satisfy several conditions.

A

Condition 1: Split was applied to raw data.
Once you have access to raw examples, and before everything else, do the split. This
will allow avoiding data leakage, as we will see later.

Condition 2: Data was randomized before the split.
Randomly shuffle your examples first, then do the split.

Condition 3: Validation and test sets follow the same distribution.
When you select the best values of hyperparameters using the validation set, you want
that this selection yields a model that works well in production. The examples in the
test set are your best representatives of the production data. Hence the need for the
validation and test sets to follow the same distribution.

Condition 4: Leakage during the split was avoided.
Data leakage can happen even during the data partitioning. Below, we will see what
forms of leakage can happen at that stage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Ratio of data partitioning

A

There is no ideal ratio for the split. In older literature (pre-big data), you might find the
recommended splits of either 70%/15%/15% or 80%/10%/10% (for training, validation, and
test sets, respectively, in proportion to the entire dataset).
Today, in the era of the Internet and cheap labor (e.g., Mechanical Turk or crowdsourcing),
organizations, scientists, and even enthusiasts at home can get access to millions of training
examples. That makes it wasteful only to use 70% or 80% of the available data for training.

A small dataset of less than a
thousand examples would do best with 90% of the data used for training. In this case, you

might decide to not have a distinct validation set, and instead simulate with the cross-
validation technique.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Leakage During Partitioning

A

Group leakage may occur during partitioning. Imagine you have magnetic resonance images
of the brains of multiple patients. Each image is labeled with certain brain disease, and the
same patient may be represented by several images taken at different times. If you apply the
partitioning technique discussed above (shuffle, then split), images of the same patient might
appear in both the training and holdout data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Dealing with Missing Attributes

A
  • removing the examples with missing attributes from the dataset (this can be done if
    your dataset is big enough to safely sacrifice some data);
  • using a learning algorithm that can deal with missing attribute values (such as the
    decision tree learning algorithm);
  • using a data imputation technique.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Data Imputation Techniques

A

-To impute the value of a missing numerical attribute, one technique consists in replacing the
missing value by the average value of this attribute in the rest of the dataset.
-Another technique is to replace the missing value with a value outside the normal range of
values. For example, if the regular range is [0, 1], you can set the missing value to 2 or −1; if
the attribute is categorical, such as days of the week, then a missing value can be replaced
by the value “Unknown.” Here, the learning algorithm learns what to do when the attribute
has a value different from regular values.
-A more advanced technique is to use the missing value as the target variable for a regression
problem.
-Finally, if you have a significantly large dataset and just a few attributes with missing values,
you can add a synthetic binary indicator attribute for each original attribute with missing
values. Let’s say that examples in your dataset are D-dimensional, and attribute at position
j = 12 has missing values. For each example x, you then add the attribute at position
j = D + 1, which is equal to 1 if the value of the attribute at position 12 is present in x and
0 otherwise. The missing value then can be replaced by 0 or any value of your choice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Leakage During Imputation

A

If you use the imputation techniques that compute some statistic of one attribute (such as
average) or several attributes (by solving the regression problem), the leakage happens if you
use the whole dataset to compute this statistic. Using all available examples, you contaminate
the training data with information obtained from the validation and test examples.
This type of leakage is not as significant as other types discussed earlier. However, you still
have to be aware of it and avoid it by partitioning first, and then computing the imputation
statistic only on the training set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Data Augmentation for Images

A

In Figure 14, you can see examples of operations that can be easily applied to a given image
to obtain one or more new images: flip, rotation, crop, color shift, noise addition, perspective
change, contrast change, and information loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

data augmentation that seems counterintuitive, but works very
well in practice, is mixup.

A

As the name suggests, the technique consists of training the
model on a mix of the images from the training set. More precisely, instead of training the
model on the raw images, we take two images (that could be of the same class or not) and
use for training their linear combination:

mixup_image = t × image1 + (1 − t) × image2
,

where t is a real number between 0 and 1. The target of that mixup image is a combination
of the original targets obtained using the same value of t:

mixup_target = t × target1 + (1 − t) × target2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Data Augmentation for Text

A

When it comes to text data augmentations, it is not as straightforward. We need to use
appropriate transformation techniques to preserve the contextual and grammatical structure
of natural language texts.
-One technique involves replacing random words in a sentence with their close synonyms.
For the sentence, “The car stopped near a shopping mall.” some equivalent sentences are:
“The automobile stopped near a shopping mall.”
“The car stopped near a shopping center.”
“The auto stopped near a mall.”

-A similar technique uses hypernyms instead of synonyms. A hypernym is a word that
has more general meaning. For example, “mammal” is a hypernym for “whale” and “cat”;

“vehicle” is a hypernym for “car” and “bus.” From our example above, we could create the fol-
lowing sentences:

“The vehicle stopped near a shopping mall.”
“The car stopped near a building.”

-A modern alternative to the k-nearest-neighbors approach described above is to use a deep
pre-trained model such as Bidirectional Encoder Representations from Transformers (BERT).
Models like BERT are trained to predict a masked word given other words in a sentence.
One can use BERT to generate k most likely predictions for a masked word and then use
them as synonyms for data augmentation.

-Another useful text data augmentation technique is back translation. To create a new
example from a text written in English (it can be a sentence or a document), first translate
it into another language l using a machine translation system. Then translate it back from l
into English. If the text obtained through back translation is different from the original text,
you add it to the dataset by assigning the same label as the original text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Class imbalance

A

Class imbalance is a condition in the data that can significantly affect the performance of
the model, independently of the chosen learning algorithm. The problem is a very uneven
distribution of labels in the training data.
Typically, a machine learning algorithm tries to classify most training examples
correctly. The algorithm is pushed to do so because it needs to minimize a cost function
that typically assigns a positive loss value to each misclassified example. If the loss is the
same for the misclassification of a minority class example as it is for the misclassification of a
majority class, then it’s very likely that the learning algorithm decides to “give up” on many
minority class examples in order to make fewer mistakes in the majority class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Oversampling

A

A technique used frequently to mitigate class imbalance is oversampling. By making multi-
ple copies of minority class examples, it increases their weight, as illustrated in Figure 15a.

You might also create synthetic examples by sampling feature values of several examples of the

minority class and combining them to obtain a new example of that class. Two popular algo-
rithms that oversample the minority class by creating synthetic examples: Synthetic Minority

Oversampling Technique (SMOTE) and Adaptive Synthetic Sampling Method (ADASYN).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Undersampling

A

The undersampling can be done randomly; that is, the examples to remove from the majority
class can be chosen at random. Alternatively, examples to withdraw from the majority class
can be selected based on some property.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Cluster-based undersampling

A

Cluster-based undersampling works as follows. Decide on the number of examples you
want to have in the majority class resulting from undersampling. Let that number be k. Run
a centroid-based clustering algorithm on the majority examples only with k being the
desired number of clusters. Then replace all examples in the majority classes with the k
centroids. An example of a centroid-based clustering algorithm is k-nearest neighbors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Hybrid Strategies for sampling

A

You can develop your hybrid strategies (by combining both over- and undersampling) and
possibly get better results. One such strategy consists of using ADASYN to oversample, and
then Tomek links to undersample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

two main data sampling strategies

A

There are two main strategies: probability sampling and nonprobability sampling. In
probability sampling, all examples have a chance to be selected. These techniques involve
randomness.

Nonprobability sampling is not random. To build a sample, it follows a fixed deterministic
sequence of heuristic actions. This means that some examples don’t have a chance of being
selected, no matter how many samples you build.

The main drawback of nonprobability sampling
methods is that they include non-representative samples and might systematically exclude
important examples. These drawbacks outweigh the possible advantages of nonprobability
sampling methods. Therefore, in this book I will only present probability sampling methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Simple random sampling

A

Simple random sampling is the most straightforward method, and the one I refer to when
I say “sample randomly.” Here, each example from the entire dataset is chosen purely by
chance; each example has an equal chance of being selected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

systematic sampling

A

To implement systematic sampling (also known as interval sampling), you create a list
containing all examples. From that list, you randomly select the first example xstart from
the first k elements on the list. Then, you select every k

th item on the list starting from
xstart. You choose such a value of k that will give you a sample of the desired size.
An advantage of the systematic sampling over the simple random sampling is that it draws
examples from the whole range of values. However, systematic sampling is inappropriate if
the list of examples has periodicity or repetitive patterns. In the latter case, the obtained
sample can exhibit a bias. However, if the list of examples is randomized, then systematic
sampling often results in a better sample than simple random sampling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

Stratified Sampling

A

In stratified
sampling, you first divide your dataset into groups (called strata) and then randomly select
examples from each stratum, like in simple random sampling. The number of examples to
select from each stratum is proportional to the size of the stratum.
Stratified sampling often improves the representativeness of the sample by reducing its bias;
in the worst of cases, the resulting sample is of no less quality than the results of simple
random sampling. However, to define strata, the analyst has to understand the properties of
the dataset. Furthermore, it can be difficult to decide which attributes will define the strata.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

what is data serialization

A

Data serialization is the process of converting complex data structures, such as objects or data collections, into a format that can be easily stored, transmitted, or reconstructed. The serialized data can later be deserialized, which means it’s converted back into its original form, allowing it to be used in the same way as before serialization. Serialization is commonly used in various scenarios, such as when saving data to files, sending data over networks, or storing data in databases.

Serialization is important because it enables data to be transported or stored in a standardized format that can be understood by different systems or programming languages. It also helps preserve the structure and relationships within the data. Different serialization formats exist, each with its own characteristics and use cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

Reproducibility

A

Reproducibility should be an important concern in everything you do, including data
collection and preparation. You should avoid transforming data manually, or using powerful
tools included in text editors or command line shells, such as regular expressions, “quick and
dirty” ad hoc awk or sed commands, and piped expressions.
Usually, the data collection and transformation activities consist of multiple stages. These
include downloading data from web APIs or databases, replacing multiword expressions by
unique tokens, removing stop-words and noise, cropping and unblurring images, imputation
of missing values, and so on. Each step in this multistage process has to be implemented
as a software script, such as Python or R script with their inputs and outputs. If you are
organized like that in your work, it will allow you to keep track of all changes in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

Data First, Algorithm Second

A

Remember that in the industry, contrary to academia, “data first, algorithm second,” so
focus most of your effort and time on getting more data of wide variety and high quality,
instead of trying to squeeze the maximum out of a learning algorithm.
Data augmentation, when implemented well, will most likely contribute more to the quality
of the model than the search for the best hyperparameter values or model architecture.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

To obtain a good partition of your entire dataset into training, validation and test sets, the
process of partitioning has to satisfy several conditions:

A

1) data was randomized before the
split, 2) split was applied to raw data, 3) validation and test sets follow the same distribution,
and 4) leakage was avoided.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

Feature Engineering

A

Feature engineering is a process of first conceptually and then programmatically transforming
a raw example into a feature vector. It consists of conceptualizing a feature and then writing
the programming code that would transform the entire raw example, with potentially the
help of some indirect data, into a feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

Feature Engineering for Text

A

When it comes to text, scientists and engineers often use simple feature engineering tricks.
Two such tricks are one-hot encoding and bag-of-words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

Mean encoding,

A

Mean encoding, also known as bin counting or feature calibration, is another technique.
First, the sample mean of the label is calculated using all examples where the feature has
value z. Each value z of the categorical feature is then replaced by that sample mean value.
The advantage of this technique is that the data dimensionality doesn’t increase, and by
design, the numerical value contains some information about the label.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

sine-cosine transformation.

A

It converts a cyclical feature into two
synthetic features. Let p denote the integer value of our cyclical feature. Replace the value p
of the cyclical feature with the following two values:

psin = sin
2 × π × p
max(p)

, pcos = cos
2 × π × p
max(p)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

Feature Hashing

A

Feature hashing, or hashing trick, converts text data, or categorical attributes with many
values, into a feature vector of arbitrary dimensionality. One-hot encoding and bag-of-words
have a drawback: many unique values will create high-dimensional feature vectors. using a hash function,
you first convert all values of your categorical attribute (or all tokens in your collection of
documents) into a number, and then you convert this number into an index of your feature
vector.

74
Q

Topic Modeling

A

Topic modeling is a family of techniques that uses unlabeled data, typically in the form of
natural language text documents. The model learns to represent a document as a vector of
topics. For example, in a collection of news articles, the five major topics could be “sports,”
“politics,” “entertainment,” “finance,” and “technology”.

75
Q

Features for Time-Series

A

Time-series data is different from the traditional supervised learning data, which has a form
of unordered collections of independent observations. A time series is an ordered sequence
of observations, and each is marked with a time-related attribute, such as timestamp, date,
month-year, year, and so on.

Analysts typically use time-series data to solve two kinds of prediction problems. Given a
sequence of recent observations:
* predict something about the next observation (for example, given the stock price and the
value of stock indices for the last seven days, predict the stock price for tomorrow), or
* predict something about the phenomenon that generated that sequence (for example,
given a user’s connection log to a software system, predict whether they are likely to
cancel their subscription during the current quarter).

76
Q

Stacking Features

A

In our movie title classification problem, we first collect all the left contexts. We then apply
bag-of-words to transform each left context into a binary feature vector. Next, collect all
extractions and, using bag-of-words, transform each extraction into a binary feature vector.
Then we collect all the right contexts and apply bag-of-words to transform each right context
into a binary feature vector. Finally, we concatenate each example, joining the feature vectors
of the left context, the extraction, and the right context.

77
Q

Properties of Good Features

A

High Predictive Power
Fast Computability
Reliability
Uncorrelatedness

78
Q

Uncorrelatedness

A

Correlation of two features means their values are related. If the growth of one feature implies
the growth of the other, and the inverse is also true, then the two features are correlated.
Once the model is in production, its performance may change because the input data’s
properties may change over time. When many of your features are highly correlated, even
a minor change in the input data’s properties may result in significant changes in the
model’s behavior.
Sometimes the model was built under strict time constraints, so the developer used all
possible sources of features. With time, maintaining those sources can become costly. It’s
generally recommended to eliminate redundant or highly correlated features. Feature selection
techniques help reduce such features.

79
Q

Cutting the Long Tail

A

Typically, if a feature contains information (e.g., a non-zero value) only for a handful of
examples, such a feature could be removed from the feature vector. In bag-of-words, you
can build a graph with the distribution of token counts, and then cut off the so-called long
tail, as shown in Figure 15.

80
Q

Properties of Good Features

A

high predictive power
fast computation
reliability
fast computation

other: However, if you apply the model built
on historical tweets to predict something about current tweets, the date of your production
examples will always be out of the training distribution, which can result in a significant
error.5

81
Q

Boruta algorithm for assessing importance of features

A

google it

82
Q

l1-regularization

A

google

83
Q

stop words.

A

Stop words are the words that are too generic or common for
the problem we are trying to solve. Frequent examples of stop words are articles, prepositions,
and pronouns. Dictionaries of stop words for most languages are available online.

84
Q

Feature Discretization

A

The reasons to discretize a real-valued numerical feature can be numerous. For example, some
feature selection techniques only apply to categorical features. A successful discretization
adds useful information to the learning algorithm when the training dataset is relatively small.
Numerous studies show that discretization can lead to improved predictive accuracy. It is
also simpler for a human to interpret a model’s prediction if it is based on discrete groups of
values, such as age groups or salary ranges.

85
Q

What is binning or bucketing?

A

Binning, also known as bucketing, is a popular technique that allows transforming a
numerical feature into a categorical one by replacing numerical values in a specific range by a
constant categorical value.
There are three typical approaches to binning:
* uniform binning,
* k-means-based binning, and
* quantile-based binning.

86
Q

elbow method (PCA..)

A

tam k se prelomi

87
Q

skipgram vs cbow

A

google

88
Q

what is fastext improvement over word to vec

A

tokenization

89
Q

doc2vec

A

google

90
Q

choosing embedding dimentionality

A

d =
√4
D, that is google suggestion. Principaled way is to treat it as hyperparameter.

91
Q

Feature scaling

A

Feature scaling is bringing all your features to the same, or very similar, ranges of values
or distributions. Multiple experiments demonstrated that a learning algorithm applied to
scaled features might produce a better model. While there’s no guarantee that scaling
will have a positive impact on the quality of your model, it’s considered a best practice.
Scaling can also increase the training speed of deep neural networks. It also assures that no
individual feature dominates, especially in the initial iterations of gradient descent or other
iterative optimization algorithms. Finally, scaling reduces the risk of numerical overflow,
the problem that computers have when working with very small or very big numbers.

92
Q

Normalization

A

Normalization is the process of converting an actual range of values, which a numerical
feature can take, into a predefined and artificial range of values, typically in the interval
[−1, 1] or [0, 1].

93
Q

winsorization.

A

Winsorization consists of setting all outliers to a specified percentile of the data; for example,
a 90% winsorization would see all data below the 5th percentile set to the 5th percentile, and
data above the 95th percentile set to the 95th percentile.

94
Q

Standardization

A

normalization) is the procedure during which the feature
values are rescaled so that they have the properties of a standard normal distribution,
with μ = 0 and σ = 1, where μ is the sample mean (the average value of the feature,
averaged over all examples in the training data) and σ is the standard deviation from the
sample mean.

95
Q

Data Leakage in Feature Engineering

A

Now imagine you are working with text, and that you use bag-of-words to create features
with the entire dataset. After building the vocabulary, you split your data into the three
sets. In this situation, the learning algorithm will be exposed to features based on tokens
only present in the holdout sets. Again, the model will display artificially better performance
than had you divided your data before feature engineering.

A solution, as you might have guessed, is first to split the entire dataset into training and
holdout sets, and only do feature engineering on the training data. This also applies when
you use mean encoding to transform a categorical feature to a number: split the data first
and then compute the sample mean of the label, based on the training data only.

96
Q

Schema File - what is it And why it is usefull

A

1 feature {
2 name : “height”
3 type : float
4 min : 50.0
5 max : 300.0
6 mean : 160.0
7 variance : 17.0
8 zeroes : false
9 undefined : false
10 popularity : 1.0
11 }
12
13 feature {
14 name : “color_red”
15 type : binary
16 zeroes : true
17 undefined : false
18 popularity : 0.76
19 }
20
21 feature {
22 name : “color_green”
23 type : binary
24 zeroes : true
25 undefined : false
26 popularity : 0.65
27 }
28
29 feature {
30 name : “color_blue”
31 type : binary
32 zeroes : true
33 undefined : false
34 popularity : 0.81

97
Q

random prediction algorithm

A

The random prediction algorithm makes a prediction by randomly choosing a label from
the collection of labels assigned to the training examples. In the classification problem, it
corresponds to randomly picking one class from all classes in the problem. In the regression
problem it means selecting from all unique target values in the training data.

98
Q

zero rule algorithm

A

The zero rule algorithm yields a tighter baseline than the random prediction algorithm.
This means that it usually improves the value of the metric as compared to random prediction.
To make predictions, the zero rule algorithm uses more information about the problem.
In classification, the zero rule algorithm strategy is to always predict the class most common
in the training set, independently of the input value. It can look ineffective, but consider the
following problem. Let the training data for your classification problem contain 800 examples
of the positive class, and 200 examples of the negative class. The zero rule algorithm will
predict the positive class all the time, and the accuracy (one of the popular performance
metrics that we will consider in Section 5.5.2) of the baseline will be 800/1000 = 0.8 or 80%,
which is not bad for such a simple classifier. Now you know that your statistical model,
independently of how close it is to the optimum, must have an accuracy of at least 80%.
Now, let’s consider the zero rule algorithm for regression. According to the zero rule algorithm,
the strategy for regression is to predict the sample average of the target values observed in
the training data. This strategy will likely have a lower error rate than random prediction.

99
Q

What is Amazon Mechanical Turk service?

A

Mechanical Turk (MT) is a web-platform where people solve simple tasks for
a reward. MT provides an API that you can call to get human predictions. The quality of
such predictions can vary from very low to relatively high, depending on the task and the
reward. MT is relatively inexpensive, so you can get predictions fast and in large numbers.

100
Q

distribution shift

A

The distribution shift can be a hard problem to tackle. Using a different data distribution for
training could be a conscious choice because of the data availability. However, the analyst may
be unaware that the statistical properties of the training and development data are different.
This often happens when the model is frequently updated after production deployment, and
new examples are added to the training set. The properties of the data used to train the
model, and that of the data used to validate and test it, can diverge over time. Section ?? in
the next chapter provides guidance on how to handle that problem.

101
Q

Preconditions for Supervised Learning

A

Before you start working on your model, make sure the following conditions are satisfied:
1. You have a labeled dataset.
2. You have split the dataset into three subsets: training, validation, and test.
3. Examples in the validation and test sets are statistically similar.
4. You engineered features and filled missed values using only the training data.
5. You converted all examples into numerical feature vectors.1
6. You have selected a performance metric that returns a single number (see Section 5.5).
7. You have a baseline.

102
Q

Main Properties of a Learning Algorithm: Explainability of learning algorithm

A

Do the model predictions require explanation for a non-technical audience? The most
accurate machine learning algorithms and models are so-called “black boxes.” They
make very few prediction errors, but it may be difficult to understand, and even harder
to explain, why a model or an algorithm made a specific prediction. Examples of such
models are deep neural networks and ensemble models.
In contrast, kNN, linear regression, and decision tree learning algorithms

are not always the most accurate. However, their predictions are easy to inter-
pret by a non-expert.

103
Q

Main Properties of a Learning Algorithm: In-memory vs. out-of-memory

A

Can your dataset be fully loaded into the RAM of your laptop or server? If yes, then you
can choose from a wide variety of algorithms. Otherwise, you would prefer incremental
learning algorithms that can improve the model by reading data gradually. Examples
of such algorithms are Naïve Bayes and the algorithms for training neural networks.

104
Q

Main Properties of a Learning Algorithm: Number of features and examples

A

How many training examples do you have in your dataset? How many features does each
example have? Some algorithms, including those used for training neural networks
and random forests, can handle a huge number of examples and millions of features.
Others, like the algorithms for training support vector machines (SVM), can be
relatively modest in their capacity.

105
Q

Main Properties of a Learning Algorithm: Nonlinearity of the data

A

Is your data linearly separable? Can it be modeled using a linear model? If yes, SVM
with the linear kernel, linear and logistic regression can be good choices. Otherwise,
deep neural networks or ensemble models might work better.

106
Q

Main Properties of a Learning Algorithm: Training speed

A

How much time is a learning algorithm allowed to use to build a model, and how often
you will need to retrain the model on updated data? If training takes two days, and
you need to retrain your model every 4 hours, then your model will never be up to date.
Neural networks are slow to train. Simple algorithms like linear and logistic regression,
or decision trees, are much faster.
Specialized libraries contain very efficient implementations of some algorithms. You
may prefer to do research online to find such libraries. Some algorithms, such as
random forest learning, benefit from multiple CPU cores, so their training time can
be significantly reduced on a machine with dozens of cores. Some machine learning
libraries leverage GPU (graphics processing unit) to speed up training.

107
Q

Main Properties of a Learning Algorithm: Prediction speed

A

How fast must the model be when generating predictions? Will your model be used
in a production environment where very high throughput is required? Models like
SVMs and linear and logistic regression models, and not-very-deep feedforward neural
networks, are extremely fast at prediction time. Others, like kNN, ensemble algorithms,
and very deep or recurrent neural networks, are slower.

108
Q

algorithm
spot-checking.

A

Shortlisting candidate learning algorithms for a given problem is sometimes called algorithm
spot-checking. For the most effective spot-checking, it is recommended to:
* select algorithms based on different principles (sometimes called orthogonal), such as
instance-based algorithms, kernel-based, shallow learning, deep learning, ensembles;
* try each algorithm with 3 − 5 different values of the most sensitive hyperparameters
(such as the number of neighbors k in k-nearest neighbors, penalty C in support vector
machines, or decision threshold in logistic regression);
* use the same training/validation split for all experiments,
* if the learning algorithm is not deterministic (such as the learning algorithms for neural
networks and random forests), run several experiments, and then average the results;
* once the project is over, note which algorithms performed the best, and use this
information when working on a similar problem in the future.

109
Q

Pipeline

A

Many modern machine learning packages and frameworks support the notion of a pipeline.
A pipeline is a sequence of transformations the training data goes through, before it becomes
a model.

110
Q

Performance Metrics for Regression

A

Regression and classification models are assessed using different metrics. Let’s first consider
performance metrics for regression: mean squared error (MSE), median absolute error (MAE),
and almost correct predictions error rate (ACPER).

111
Q

When is median absolute error better than MSE?

A

If the data contains outliers, the examples very far from the “true” regression line, they
can significantly affect the value of MSE. By definition, the squared error for such outlying
examples will be high. In such situations, it is better to apply a different metric, the median
absolute error.

112
Q

What is almost correct predictions error rate (ACPER)

A

The almost correct predictions error rate (ACPER) is the percentage of predictions
that is within p percentage of the true value. To calculate ACPER, proceed as follows:
1. Define a threshold percentage error that you consider acceptable (let’s say 2%).
2. For each true value of the target yi

, the desired prediction should be between yi + 0.02yi

and yi − 0.02yi
.

  1. By using all examples i = 1, . . . , N, calculate the percentage of predicted values fulfilling
    the above rule. This will give the value of the ACPER metric for your model.
113
Q

For classification, things are a little more complicated. The most widely used metrics to
assess a classification model are:

A
  • precision-recall,
  • accuracy,
  • cost-sensitive accuracy, and
  • area under the ROC curve (AUC).
114
Q

F-measure, also known as
F-score.

A

The traditional F-measure, or F1-score, is the harmonic mean of precision and recall:

115
Q

Accuracy?

A

Accuracy is given by the number of correctly classified examples, divided by the total
number of classified examples. In terms of the confusion matrix, it is given by:

accuracy
def
=
TP + TN/
TP + TN + FP + FN

116
Q

When is accuracy not a good measure?

A

Accuracy measures the performance of the model for all classes at once, and it conveniently
returns a single number. However, accuracy is not a good performance metric when the data
is imbalanced. In an imbalanced dataset, examples belonging to some class or a few classes
constitute the vast majority, while other classes include very few examples. Imbalanced
training data can significantly and adversely affect the model. We will talk more about
dealing with the imbalanced data in Section ?? of Chapter 6.

For imbalanced data, a better metric is per-class accuracy. First, calculate the accuracy
of prediction for each class {1, . . . , C}, and then take an average of C individual accuracy
measures. For the above confusion matrix of the spam detection problem, the accuracy
for the class “spam” is 23/(23 + 1) = 0.96, the accuracy for the class “not_spam” is
556/(12 + 556) = 0.98. The per-class accuracy is then (0.96 + 0.98)/2 = 0.97.

117
Q

Cohen’s kappa statistic

A

Cohen’s kappa statistic is a performance metric that applies to both multiclass and
imbalanced learning problems. The advantage of this metric over accuracy is that Cohen’s
kappa tells you how much better your classification model is performing, compared to a
classifier that randomly guesses a class according to the frequency of each class.

118
Q

The ROC curve

A

The ROC curve (stands for “receiver operating characteristic;” the term comes from radar
engineering) is a commonly-used method of assessing classification models. ROC curves use a
combination of the true positive rate (defined exactly as recall) and false positive rate
(the proportion of negative examples predicted incorrectly) to build up a summary picture of
the classification performance.

119
Q

Grid search

A

Grid search is the simplest hyperparameter tuning technique. It’s used when the number
of hyperparameters and their range is not too large.
We explain it for the problem of tuning two numerical hyperparameters. The technique
consists of discretizing each of the two hyperparameters, and then evaluating each pair of
discrete values,

120
Q

Random search

A

Random search differs from grid search in that you do not provide a discrete set of values
to explore for each hyperparameter. Instead, you provide a statistical distribution for each
hyperparameter from which values are randomly sampled. Then set the total number of
combinations you want to evaluate,

121
Q

Hyperparameter tuning: Coarse-to-Fine Search

A

In practice, analysts often use a combination of grid search and random search called coarse-
to-fine search. This technique uses a coarse random search to first find the regions of

high potential. Then, using a fine grid search in these regions, one finds the best values for
hyperparameters, as shown in Figure 8.
You can decide to only explore one high-potential region or several such regions, depending
on the available time and computational resources.

122
Q

Hyperparameter tuning: Bayesian techniques

A

Bayesian techniques differ from random and grid searches in that they use past evaluation
results to choose the next values to evaluate. In practice, this allows Bayesian hyperparameter
optimization techniques to find better values of hyperparameters in less time.

123
Q

Cross-Validation

A

google

124
Q

Shallow models?

A

Shallow models make predictions based directly on the values in the input feature vector.
Most popular machine learning algorithms produce shallow models. The only kind of deep
models commonly used are deep neural networks. We consider a strategy to train them in
Section ?? of the next chapter.

125
Q

A typical model training strategy for shallow learning algorithms looks as follows:

A
  1. Define a performance metric P.
  2. Shortlist learning algorithms.
  3. Choose a hyperparameter tuning strategy T.
  4. Pick a learning algorithm A.
  5. Pick a combination H of hyperparameter values for algorithm A using strategy T.
  6. Use the training set and train a model M using algorithm A parametrized with
    hyperparameter values H.
  7. Use the validation set and calculate the value of metric P for model M.
  8. Decide:
    a. If there are still untested hyperparameter values, pick another combination H of
    hyperparameter values using strategy T and go back to step 6.
    b. Otherwise, pick a different learning algorithm A and go back to step 5, or proceed
    to step 9 if there are no more learning algorithms to try.
  9. Return the model for which the value of metric P is maximized.
126
Q

If the model makes too many mistakes on the training data, we say that it has a high bias, or
that the model underfits the training data. There could be several reasons for underfitting:

A
  • the model is too simple for the data (for example linear models often underfit);
  • the features are not informative enough;
  • you regularize too much (we talk about regularization in the next section).

The possible solutions to the problem of underfitting include:
* trying a more complex model,
* engineering features with higher predictive power,
* adding more training data, when possible, and
* reducing regularization.

127
Q

Several reasons can lead to overfitting:

A
  • the model is too complex for the data. Very tall decision trees or a very deep neural
    network often overfit;
  • there are too many features and few training examples; and
  • you don’t regularize enough.

Several solutions to overfitting are possible:
* use a simpler model. Try linear instead of polynomial regression, or SVM with a
linear kernel instead of radial basis function (RBF), or a neural network with fewer
layers/units;

  • reduce the dimensionality of examples in the dataset;
  • add more training data, if possible; and,
  • regularize the model.
128
Q

Regularization

A

Regularization is an umbrella term for methods that force a learning algorithm to train a
less complex model. In practice, it leads to higher bias, but significantly reduces the variance.

129
Q

lasso/ L1 and ridge regularization / L2.

A

google

130
Q

A common strategy to build a neural
network looks as follows:

A
  1. Define a performance metric P.
  2. Define the cost function C.
  3. Pick a parameter-initialization strategy W.
  4. Pick a cost-function optimization algorithm A.
  5. Choose a hyperparameter tuning strategy T.
  6. Pick a combination H of hyperparameter values using the tuning strategy T.
  7. Train model M, using algorithm A, parametrized with hyperparameters H, to optimize
    cost function C.
  8. If there are still untested hyperparameter values, pick another combination H of
    hyperparameter values using strategy T, and repeat step 7.
  9. Return the model for which the metric P was optimized.
131
Q

categorical cross-entropy (for multiclass classification) or binary cross-entropy (for binary and multi-label classifica-
tion).

A

Classification cost functions. Google for more

132
Q

Multi-class vs multi label classification

A

Note that the output layers in multiclass and multi-label classification are different. In
multiclass classification, one softmax unit is used. It generates a C-dimensional vector whose
values are bounded by the range (0, 1), and whose sum equals 1. In multi-label classification,
the output layer contains C logistic units whose values also lie in the range (0, 1), but their
sum lies in the range (0, C).

133
Q

Parameter-Initialization Strategies

A
  • ones — all parameters are initialized to 1;
  • zeros — all parameters are initialized to 0;
  • random normal — parameters are initialized to values sampled from the normal
    distribution, typically with mean of 0 and standard deviation of 0.05;
  • random uniform — parameters are initialized to values sampled from the uniform
    distribution with the range [−0.05, 0.05];
  • Xavier normal — parameters are initialized to values sampled from the truncated
    normal distribution, centered on 0, with standard deviation equal to p

2/(in + out)
where “in” is the number of units in the preceding layer to which the current unit is
connected (the one whose parameters you initialize); and “out” is the number of units
on the subsequent layer to which the current unit is connected; and,
* Xavier uniform — parameters are initialized to values sampled from a uniform
distribution within [−limit, limit], where “limit” is p

6/(in + out), and “in” and “out”

are defined as in Xavier normal, above.

134
Q

We say that f(x) has a local minimum at x = c if?

A

We say that f(x) has a local minimum at x = c if f(x) ≥ f(c) for
every x in some open interval around x = c.

135
Q

Learning Rate Decay Schedules

A

Learning rate decay consists of gradually reducing the value of the learning rate α as
the epochs progress. Consequently, the parameter updates become finer. There are several
techniques, Time-based learning rate decay schedules, Step-based learning rate decay schedules, Exponential learning rate decay schedules,

136
Q

There are several popular upgrades to minibatch SGD, such as Momentum, Root Mean
Squared Propagation (RMSProp), and Adam.

A

google them

137
Q

dropout

A

The concept of dropout is very simple. Each time you “run” a training example through
the network, you temporarily exclude at random some units from the computation. The
higher the percentage of units excluded, the stronger the regularization effect. Popular neural
network libraries allow you to add a dropout layer between two successive layers, or you can
specify the dropout hyperparameter for a layer. The dropout hyperparameter varies in the
range [0, 1] and characterizes the fraction of units to randomly exclude from computation.
The value of the hyperparameter has to be found experimentally. While simple, dropout’s
flexibility and regularizing effect are phenomenal.

138
Q

Early stopping

A

Early stopping trains a neural network by saving the preliminary model after every epoch.
Models saved after each epoch are called checkpoints. Then it assesses each checkpoint’s
performance on the validation set. You’ll find during gradient descent that the cost decreases
as the number of epochs increases. After some epoch, the model can start overfitting, and
the model’s performance on the validation data can deteriorate. Remember the bias-variance
illustration in Figure ?? in Chapter 5. By keeping a version of the model after each epoch,
you can stop the training once you start observing a decreased performance on the validation
set. Alternatively, you can keep running the training process for a fixed number of epochs,
and then pick the best checkpoint. Some machine learning practitioners rely on this technique.
Others try to properly regularize the model using appropriate techniques.

139
Q

Batch normalization

A

Batch normalization (which rather should be called batch standardization) consists of
standardizing the outputs of each layer before the next layer receives them as input. In
practice, batch normalization results in faster and more stable training, as well as some
regularization effect. So, it’s always a good idea to use batch normalization. In popular

neural network libraries, you can often insert a batch normalization layer between two subse-
quent layers.

140
Q

A pre-trained model can be used in two ways:

A

1) its learned parameters can be used to initialize your own model, or
2) it can be used as a feature extractor for your model.

141
Q

pre-trained model as feature extractors

A

In practice, it means that
you only keep several initial layers of the pre-trained model, those closest to and including
the input layer. You keep their parameters “frozen,” that is, unchanged and unchangeable.
Then you add new layers on top of the frozen layers, including the output layer appropriate
for your task. Only the parameters of the new layers will be updated by gradient descent
during training on your data.

142
Q

use a pre-trained model as an initializer

A

use a pre-trained model as an initializer for your model, it gives you more flexibility.
The gradient descent will modify the parameters in all layers, and, potentially, reach a better
performance for your problem. The downside of that is you will often end up training a very
deep neural network.

143
Q

Adversarial Validation

A

Adversarial Validation is a very clever and very simple way to let us know if our test data and our training data are similar; we combine our train and test data, labeling them with say a 0 for the training data and a 1 for the test data, mix them up, then see if we are able to correctly re-identify them using a binary classifier.

If we cannot correctly classify them, i.e. we obtain an area under the receiver operating characteristic curve (ROC) of 0.5 then they are indistinguishable and we are good to go.

However, if we can classify them (ROC > 0.5) then we have a problem, either with the whole dataset or more likely with some features in particular, which are probably from different distributions in the test and train datasets. If we have a problem, we can look at the feature that was most out of place. The problem may be that there were values that were only seen in, say, training data, but not in the test data. If the contribution to the ROC is very high from one feature, it may well be a good idea to remove that feature from the model.

We can also use it to improve the data for learning.

144
Q

Handling Imbalanced Datasets - class weight

A

google

145
Q

Handling Imbalanced Datasets - Ensemble of Resampled Datasets

A

google

146
Q

Handling Imbalanced Datasets - different learning rates for different classes:

A

If you use stochastic gradient descent, the class imbalance can be tackled in several ways.
First, you can have different learning rates for different classes: a lower value for the examples
of the majority class, and a higher value otherwise. Second, you can make several consecutive
updates of the model parameters each time you encounter an example of a minority class.

147
Q

There are two techniques often used to calibrate a binary model: Platt scaling and isotonic
regression.

A

google

148
Q

If your model does poorly on the training data (underfits it), common reasons are:

A
  • the model architecture or learning algorithm are not expressive enough (try more
    advanced learning algorithm, an ensemble method, or a deeper neural network);
  • you regularize too much (reduce regularization);
  • you have chosen suboptimal values for hyperparameters (tune hyperparameters);
  • the features you engineered don’t have enough predictive power (add more informative
    features);
  • you don’t have enough data for the model to generalize (try to get more data, use data
    augmentation, or transfer learning); or
  • you have a bug in your code (debug the code that defines and trains the model).
149
Q

If your model does well on the training data, but poorly on the holdout data (overfits the
training data), common reasons are:

A
  • you don’t have enough data for generalization (add more data or use data augmentation);
  • your model is under-regularized (add regularization or, for neural networks, both
    regularization and batch normalization);
  • your training data distribution is different from the holdout data distribution (reduce
    the distribution shift);
  • you have chosen suboptimal values for hyperparameters (tune hyperparameters); or
  • your features have low predictive power (add features with high predictive power).
150
Q

Iterative Model Refinement
If you have access to new labeled data (for example, you can label examples yourself, or easily
request the help of a labeler) then, you can refine the model using a simple iterative process:

A
  1. Train the model using the best values of hyperparameters identified so far.
  2. Test the model by applying it to a small subset of the validation set (100−300 examples).
  3. Find the most frequent error patterns on that small validation set. Remove those
    examples from the validation set, because your model will now overfit to them.
  4. Generate new features, or add more training data to fix the observed error patterns.
  5. Repeat until no frequent error patterns are observed (most errors look dissimilar).
151
Q

Fixing Wrong Labels

A

Here is a simple way to identify the examples that have wrong labels. Apply the model to
the training data from which it was built, and analyze the examples for which it made a
different prediction as compared to the labels provided by humans. If you see that some
predictions are indeed correct, change those labels.
If you have time and resources, you could also examine the predictions with the score close
to the decision threshold. Those are often mislabeled cases too.

152
Q

Finding Additional Examples to Label in best way possible?

A

As discussed above, error analysis can reveal that more labeled data is needed from specific
regions of feature space. You might have an abundance of unlabeled examples. How should
you decide which examples to label so as to maximize the positive impact on the model?
If your model returns a prediction score, an effective way is to use your best model to score
the unlabeled examples. Then label those examples, whose prediction score is close to the
prediction threshold.

153
Q

Troubleshooting Deep Learning

A

look in the book: Machine learning engineering

154
Q

What is a good model?

A

A good model has two properties:
* it has the desired quality according to the performance metric; and
* it is safe to serve in a production environment.
For a model to be safe-to-serve means satisfying the following requirements:
* it will not crash or cause errors in the serving system when being loaded, or when
loaded with bad or unexpected inputs;
* it will not use an unreasonable amount of resources (such as CPU, GPU, or RAM).

155
Q

catastrophic forgetting

A

Furthermore, frequent model upgrades without retraining from scratch can lead to catas-
trophic forgetting. It’s a situation in which the model that was once capable of something,

“forgets” that capability because of learning something new.

156
Q

warm-starting

A

However, avoid the
practice of warm-starting. It consists of iteratively upgrading the existing model by using
only new training examples and running additional training iterations.

157
Q

Correction Cascades

A

You might have model mA that solves problem A, but you need a solution mB for a slightly
different problem B. It can be tempting to use the output of mA as input for mB, and
only train mB on a small sample of examples that “correct” the output of mA for solving
problem B. Such technique is called correction cascading, and it is not recommended.

It’s important to note that model cascading is not always a bad practice. Using the output
of one model, as one of many inputs for another model, is common. It might significantly
reduce time to market. However, cascading must be used with caution, because the update
of one model in a cascade must involve an update of all models in the cascade, which can
end up being costly in the long-term.

158
Q

glue code

A

Reduce glue code to a minimum. This how Google engineers put it. Machine learning
researchers tend to develop general purpose solutions as self-contained packages. A wide
variety of these are available as open-source packages or from in-house code, proprietary
packages, and cloud-based platforms. Using generic packages often results in a glue-code
system design pattern, in which a massive amount of supporting code is written to get data
into and out of general-purpose packages.

159
Q

More Data Beats Cleverer Algorithm

A

In practice, however, better results often come from getting more data, specifically, more
labeled examples. If designed well, the data labeling process can allow a labeler to produce
several thousand training examples daily. It can also be less expensive, compared to the
expertise needed to invent a more advanced machine learning algorithm.

160
Q

New Data Beats Cleverer Features

A

If, despite adding more training examples and designing clever features, the performance of
your model plateaus, think about different information sources.
For example, if you want to predict whether user U will like a news article, try to add
historical data about the user U as features. Or cluster all the users, and use the information
on the k-nearest users to user U as new features. This is a simpler approach compared to
programming very complex features, or combining existing features in a complex way.

161
Q

Facilitate Reproducibility

A

The random seed can be set as np.random.seed(15) (in NumPy and scikit-learn),
tf.random.set_seed(15) in TensorFlow, torch.manual_seed(15) (in PyTorch), and
set.seed(15) (in R). The seed value doesn’t matter as long as it remains constant.

162
Q

When delivering the model, make sure it’s accompanied by all relevant information for
reproducibility.

A

Besides the description of the dataset and features, such as documentation
and metadata considered in Sections ?? and ??, each model should contain the documentation
with the following details:
* a specification of all hyperparameters, including the ranges considered, and the default
values used,
* the method used to select the best hyperparameter configuration,
* the definition of the specific measure or statistics used to evaluate the candidate models,
and the value of it for the best model,
* a description of the computing infrastructure used, and
* the average runtime for each trained model, and an estimated cost of the training.

163
Q

There are different forms of online evaluation, each serving a different purpose.

A

runtime monitoring is checking whether the running system meets the runtime requirements.
Another common scenario is to monitor user behavior in response to different versions of the
model. One popular technique used in this scenario is A/B testing. We split the users of a
system into two groups, A and B. The two groups are served the old and the new models,
respectively. Then we apply a statistical significance test to decide whether the performance
of the new model is better than the old one.
Multi-armed bandit (MAB) is another popular technique of online model evaluation. Similar
to A/B testing, it identifies the best performing models by exposing model candidates to
a fraction of users. Then it gradually exposes the best model to more users, by keeping
gathering performance statistics until it’s reliable.

164
Q

why A/B testing

A

To figure out which model is better.

165
Q

A/B testing - G-Test

A

google

166
Q

A/B testing - Z-Test

A

google

167
Q

Multi-Armed Bandit

A

A more advanced, and often preferable way of online model evaluation and selection, is
multi-armed bandit (MAB). A/B testing has one major drawback. The number of test
results in groups A and B you need to calculate the value of the A/B test is high. A significant
portion of users routed to a suboptimal model would experience suboptimal behavior for a
long time.

168
Q

UCB1

A

UCB1 (for Upper Confidence Bound) is a popular algorithm for solving the multi-armed
bandit problem. The algorithm dynamically chooses an arm, based on the performance of that
arm in the past, and how much the algorithm knows about it. In other words, UCB1 routes the
user to the best performing model more often when its confidence about the model performance
is high. Otherwise, UCB1 might route the user to a suboptimal model so as to get a more
confident estimate of that model’s performance. Once the algorithm is confident enough about
the performance of each model, it almost always routes users to the best performing model.

169
Q

Neuron Coverage

A

When we evaluate a neural network, especially one to be used in a mission-critical scenario,
such as a self-driving car or a space rocket, our test set must have good coverage. Neuron
coverage of a test set for a neural network model is defined as the ratio of the units (neurons)
activated by the examples from the test set, to the total number of units. A good test set
has close to 100% neuron coverage.

A unit is considered activated when its output is above a certain threshold. For ReLU, it’s
usually zero; for a logistic sigmoid, it’s 0.5.

170
Q

Mutation Testing

A

In software engineering, good test coverage for a software under test (SUT) can be
determined using the approach known as mutation testing. Let’s have a set of tests
designed to test an SUT. We generate several “mutants” of the SUT. A mutant is a version of
the SUT in which we randomly make some modifications, such as replacing in the source code,
a “+” with a “−”, a “<” with a “>”, delete the else command in an if-else statement,
and so on. Then we apply the test set to each mutant, and see if at least one test breaks on
that mutant. We say that we kill a mutant if one test breaks on it. We then compute the
ratio of killed mutants in the entire collection of mutants. A good test set makes this ratio
equal to 100%.
In machine learning, a similar approach can be followed. However, to create a mutant
statistical model, instead of modifying the code, we modify the training data. If the model
is deep, we can also randomly remove or add a layer, or remove or replace an activation
function. The training data can be modified by,
* adding duplicated examples,
* falsifying the labels of some examples,
* removing some examples, or
* adding random noise to the values of some features.
We say that we kill a mutant if at least one test example gets a wrong prediction by that
mutant statistical model.

171
Q

Robustness of a model

A

The robustness of a machine learning model refers to the stability of the model performance
after adding some noise to the input data. A robust model would exhibit the following
behavior. If the input example is perturbed by adding random noise, the performance of the
model would degrade proportionally to the level of noise.

172
Q

Fairness

A

Machine learning algorithms tend to learn what humans are teaching them. The teaching
comes in the form of training examples. Humans have biases which may affect how they
collect and label data. Sometimes, bias is present in historical, cultural, or geographical data.
This, in turn, as we have seen in Section ?? in Chapter 3, may lead to biased models.

173
Q

A model can be deployed following several patterns:

A
  • statically, as a part of an installable software package,
  • dynamically on the user’s device,
  • dynamically on a server, or
  • via model streaming.
174
Q

Static Deployment

A

The static deployment of a machine learning model is very similar to traditional software
deployment: you prepare an installable binary of the entire software. The model is packaged as a resource available at the runtime. Depending on the operating system and the runtime
environment, the objects of both the model and the feature extractor can be packaged as a
part of a dynamic-link library (DLL on Windows), Shared Objects (*.so files on Linux), or
be serialized and saved in the standard resource location for virtual machine-based systems,
such as Java and .Net.
Static deployment has many advantages:
* the software has direct access to the model, so the execution time is fast for the user,
* the user data doesn’t have to be uploaded to the server at the time of prediction; this
saves time and preserves privacy,
* the model can be called when the user is offline, and
* the software vendor doesn’t have to care about keeping the model operational; it
becomes the user’s responsibility.

175
Q

what is load balancer?

A

A load balancer dispatches the incoming requests to a specific virtual machine, depending on its availability. The virtual machines can be added and closed manually, or be a part of
an autoscaling group that launches or terminates virtual machines based on their usage.
Figure 2 illustrates that deployment pattern. Each instance, denoted as an orange square,
contains all the code needed to run the feature extractor and the model. The instance also
contains a web service that has access to that code.

176
Q

example of a hidden feedback loop.

A

Model mB
used the output of model mA as a feature, without knowing that model mA also used the
output of model mB as its feature.
Another kind of hidden feedback loop only involves one model. Let’s say we have a model
that classifies incoming email messages as spam or not spam. Let the user interface allow
the user to mark messages as spam or not spam. Obviously, we want to use those marked
messages to improve our model. However, by so doing, we risk creating a hidden feedback
loop, and here is why.
In our application, the user will only mark a message as spam when they see it. However,
users only see the messages that our model classified as not spam. Also, it is unlikely that
the user will regularly go to the spam folder and mark some messages as not spam. So, the
action of the user is significantly affected by our model, which makes the data we get from
the user skewed: we influence the phenomenon from which we learn.

177
Q

what is message broker?

A

To deal with such situations, on-demand architectures include a message broker, such as
RabbitMQ or Apache Kafka. A message broker allows one process to write messages in
a queue, and another to read from that queue. On-demand requests are placed in the input
queue. The model runtime process periodically connects to the broker. It reads a batch
of input data elements from the input queue and generates predictions for each element in
batch mode. It then writes the predictions to the output queue. Another process periodically
connects to the broker, reads the predictions from the output queue, and pushes them to
users who sent the requests (Figure 3). In addition to allowing us to cope with demand
spikes, such an approach is more resource-efficient.

178
Q

There are three “cannots” we must accept and embrace:

A
  1. We cannot always explain why an error happened.
  2. We cannot reliably predict when it will happen, and even a high confidence prediction
    can be false.
  3. We cannot always know how to fix a specific error. If it’s fixable, what kind and how
    much training data is needed?
179
Q

thundersvm and cuML

A

Modern libraries, such as thundersvm and cuML, allow the analyst to
run shallow learning algorithms on GPUs, with a significant gain in training time. If you
cannot afford to wait for days or weeks to get an updated model, using a less complex (and,
therefore, less accurate) model might be your only choice.

180
Q
A