Machine learning engineering (book) Flashcards

1
Q

Reinforcement Learning

A

Reinforcement learning is a subfield of machine learning where the machine (called an
agent) “lives’ ’ in an environment and is capable of perceiving the state of that environment
as a vector of features. The machine can execute actions in non-terminal states. Different
actions bring different rewards and could also move the machine to another state of the
environment. A common goal of a reinforcement learning algorithm is to learn an optimal
policy.

An optimal policy is a function (similar to the model in supervised learning) that takes the
feature vector of a state as input and outputs an optimal action to execute in that state. The
action is optimal if it maximizes the expected average long-term reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Tidy data

A

Tidy data can be seen as a spreadsheet, in which each row represents one
example, and columns represent various attributes of an example, as shown in Figure 3.
Sometimes raw data can be tidy, e.g., provided to you in the form of a spreadsheet. However,
in practice, to obtain tidy data from raw data, data analysts often resort to the procedure
called feature engineering, which is applied to the direct and, optionally, indirect data
with the goal to transform each raw example into a feature vector x. Chapter 4 is devoted
entirely to feature engineering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

training, validation, and test datasets

A

training, validation, and test. The training set is usually the biggest one;
the learning algorithm uses the training set to produce the model. The validation and test
sets are roughly the same size, much smaller than the size of the training set. The learning
algorithm is not allowed to use examples from the validation or test sets to train the model.
That is why those two sets are also called holdout sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Baseline

A

In machine learning, a baseline is a simple algorithm for solving a problem, usually based
on a heuristic, simple summary statistics, randomization, or very basic machine learning
algorithm. For example, if your problem is classification, you can pick a baseline classifier
and measure its performance. This baseline performance will then become what you compare
any future model to (usually, built using a more sophisticated approach).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Machine Learning Pipeline

A

A machine learning pipeline is a sequence of operations on the dataset that goes from its
initial state to the model.

A pipeline can include, among others, such stages as data partitioning, missing data im-
putation, feature extraction, data augmentation, class imbalance reduction, dimensionality

reduction, and model training.
In practice, when we deploy a model in production, we usually deploy an entire pipeline.
Furthermore, an entire pipeline is usually optimized when hyperparameters are tuned.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Hyperparameters

A

Hyperparameters are inputs of machine learning algorithms or pipelines that influence
the performance of the model. They don’t belong to the training data and cannot be
learned from it. For example, the maximum depth of the tree in the decision tree learning
algorithm, the misclassification penalty in support vector machines, k in the k-nearest
neighbors algorithm, the target dimensionality in dimensionality reduction, and the choice of
the missing data imputation technique are all examples of hyperparameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Parameters,

A

Parameters, on the other hand, are variables that define the model trained by the learning
algorithm. Parameters are directly modified by the learning algorithm based on the training
data. The goal of learning is to find such values of parameters that make the model optimal
in a certain sense. Examples of parameters are w and b in the equation of linear regression
y = wx + b. In this equation, x is the input of the model, and y is its output (the prediction).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Model-Based

A

Most supervised learning algorithms are model-based. A typical model is a support
vector machine (SVM). Model-based learning algorithms use the training data to create a
model with parameters learned from the training data. In SVM, the two parameters are w
(a vector) and b (a real number). After the model is trained, it can be saved on disk while
the training data can be discarded.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Instance-based learning algorithms

A

Instance-based learning algorithms use the whole dataset as the model. One instance-
based algorithm frequently used in practice is k-Nearest Neighbors (kNN). In classification,

to predict a label for an input example, the kNN algorithm looks at the close neighborhood
of the input example in the space of feature vectors and outputs the label that it saw most
often in this close neighborhood.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Shallow vs. Deep Learning

A

gpt

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Training vs. Scoring

A

When we apply a trained model to an input example (or, sometimes, a sequence of examples)
in order to obtain a prediction (or, predictions) or to somehow transform an input, we talk
about scoring.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When to Use Machine Learning

A

When the Problem Is Too Complex for Coding
When the Problem Is Constantly Changing
When It Is a Perceptive Problem - image recognition…
When It Is an Unstudied Phenomenon
When the Problem Has a Simple Objective
When It Is Cost-Effective

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When Not to Use Machine Learning

A
  • every action of the system or a decision made by it must be explainable,
  • every change in the system’s behavior compared to its past behavior in a similar
    situation must be explainable,
  • the cost of an error made by the system is too high,
  • you want to get to the market as fast as possible,
  • getting the right data is too hard or impossible,
  • you can solve the problem using traditional software development at a lower cost,
  • a simple heuristic would work reasonably well,
  • the phenomenon has too many outcomes while you cannot get a sufficient amount of
    examples to represent them (like in video games or word processing software),
  • you build a system that will not have to be improved frequently over time,
  • you can manually fill an exhaustive lookup table by providing the expected output
    for any input (that is, the number of possible input values is not too large, or getting
    outputs is fast and cheap).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Machine learning engineering

A

Machine learning engineering (MLE) is the use of scientific principles, tools, and techniques of
machine learning and traditional software engineering to design and build complex computing
systems. MLE encompasses all stages from data collection, to model training, to making the
model available for use by the product or the customers.
In other words, MLE includes any activity that lets machine learning algorithms be imple-
mented as a part of an effective production system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Three factors highly influence the cost of a machine learning project

A
  • the difficulty of the problem,
  • the cost of data, and
  • the need for accuracy.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Defining the Goal of a Machine Learning Project

A

The goal of a machine learning project is to build a model that solves, or helps solve, a
business problem. Within a project, the model is often seen as a black box described by the
structure of its input (or inputs) and output (or outputs), and the minimum acceptable level
of performance (as measured by accuracy of prediction or another performance metric).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What a Model Can Do

A
  • automate (for example, by taking action on the user’s behalf or by starting or stopping
    a specific activity on a server),
  • alert or prompt (for example, by asking the user if an action should be taken or by
    asking a system administrator if the traffic seems suspicious),
  • organize, by presenting a set of items in an order that might be useful for a user (for
    example, by sorting pictures or documents in the order of similarity to a query or
    according to the user’s preferences),
  • annotate (for instance, by adding contextual annotations to displayed information, or
    by highlighting, in a text, phrases relevant to the user’s task),
  • extract (for example, by detecting smaller pieces of relevant information in a larger
    input, such as named entities in the text: proper names, companies, or locations),
  • recommend (for example, by detecting and showing to a user highly relevant items in a
    large collection based on item’s content or user’s reaction to the past recommendations),
  • classify (for example, by dispatching input examples into one, or several, of a predefined
    set of distinctly-named groups),
  • quantify (for example, by assigning a number, such as a price, to an object, such
    as a house),
  • synthesize (for example, by generating new text, image, sound, or another object similar
    to the objects in a collection),
  • answer an explicit question (for example, “Does this text describe that image?” or
    “Are these two images similar?”),
  • transform its input (for example, by reducing its dimensionality for visualization
    purposes, paraphrasing a long text as a short abstract, translating a sentence into
    another language, or augmenting an image by applying a filter to it),
  • detect a novelty or an anomaly.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Properties of a Successful Model

A
  • it respects the input and output specifications and the performance requirement,
  • it benefits the organization (measured via cost reduction, increased sales or profit),
  • it helps the user (measured via productivity, engagement, and sentiment),
  • it is scientifically rigorous.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Structuring a Machine Learning Team

A

Two Cultures

One culture says that a machine learning team has to be composed of data analysts who
collaborate closely with software engineers. In such a culture, a software engineer doesn’t
need to have deep expertise in machine learning, but has to understand the vocabulary of
their fellow data analysts.

According to other culture, all engineers in a machine learning team must have a combination
of machine learning and software engineering skills.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Data engineers

A

Data engineers are software engineers responsible for ETL (for Extract, Transform, Load).
These three conceptual steps are part of a typical data pipeline. Data engineers use ETL

techniques and create an automated pipeline, in which raw data is transformed into analysis-
ready data. Data engineers design how to structure the data and how to integrate it from

various resources. They write on-demand queries on that data, or wrap the most frequent
queries into fast application programming interfaces (APIs) to make sure that the data is
easily accessible by analysts and other data consumers. Typically, data engineers are not
expected to know any machine learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

labeler

A

A labeler is person responsible for assigning labels to unlabeled examples. Again, in big
companies, data labeling experts may be organized in two or three different teams: one or
two teams of labelers (for example, one local and one outsourced) and a team of software
engineers, plus a user experience (UX) specialist, responsible for building labeling tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

machine learning projects can fail for many reasons

A
  • lack of experienced talent,
  • lack of support by the leadership,
  • missing data infrastructure,
  • data labeling challenge,
  • siloed organizations and lack of collaboration,
  • technically infeasible projects, and
  • lack of alignment between technical and business teams.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Is the Data Sizeable?

A

Check if the number of samples is big enough.. there are some rules of thumb described:
* 10 times the amount of features (this often exaggerates the size of the training set, but
works well as an upper bound),
* 100 or 1000 times the number of classes (this often underestimates the size), or
* ten times the number of trainable parameters (usually applied to neural networks).

Keep in mind that just because you have big data does not mean that you should use all of
it. A smaller sample of big data can give good results in practice and accelerate the search
for a better model. It’s important to ensure, though, that the sample is representative of the
whole big dataset. Sampling strategies such as stratified and systematic sampling can
lead to better results. We consider data sampling strategies in Section 3.10.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

data leakage (also known as target leakage)

A

What did happen is called data leakage (also known as target leakage). After a more
careful examination of the dataset, you realize that one of the columns in the spreadsheet
contained the real estate agent’s commission. Of course, the model easily learned to convert
this attribute into the house price perfectly. However, this information is not available in the
production environment before the house is sold, because the commission depends on the
selling price. In Section 3.2.8, we will consider the problem of data leakage in more detail.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Common Problems With Data
High Cost - Getting unlabeled data can be expensive; however, labeling data is the most expensive work, especially if the work is done manually.
26
Bias? And its types?
Bias in data is an inconsistency with the phenomenon that data represents. This inconsistency may occur for a number of reasons (which are not mutually exclusive). - Selection bias - Self-selection bias - Omitted variable bias - Sponsorship or funding bias - Sampling bias - Prejudice or stereotype bias - Systematic value distortion - Experimenter bias - Labeling bias
27
Selection bias
Selection bias is the tendency to skew your choice of data sources to those that are easily available, convenient, and/or cost-effective. For example, you might want to know the opinion of the readers on your new book.
28
Self-selection bias
Self-selection bias is a form of selection bias where you get the data from sources that “volunteered” to provide it. Most poll data has this type of bias. For example, you want to train a model that predicts the behavior of successful entrepreneurs. You decide to first ask entrepreneurs whether they are successful or not. Then you only keep the data obtained from those who declared themselves successful. The problem here is that most likely, really successful entrepreneurs don’t have time to answer your questions, while those who claim themselves successful can be wrong on that matter.
29
Omitted variable bias
Omitted variable bias happens when your featurized data doesn’t have a feature necessary for accurate prediction. For example, let’s assume that you are working on a churn prediction model and you want to predict whether a customer cancels their subscription within six months. You train a model, and it’s accurate enough; however, several weeks after deployment you see many unexpected false negatives. You investigate the decreased model performance and discover a new competitor now offers a very similar service for a lower price. This feature wasn’t initially available to your model, therefore important information for accurate prediction was missing.
30
Sponsorship or funding bias
Sponsorship or funding bias affects the data produced by a sponsored agency. For example, let a famous video game company sponsor a news agency to provide news about the video game industry. If you try to make a prediction about the video game industry, you might include in your data the story produced by this sponsored agency.
31
Sampling bias (also known as distribution shift)
occurs when the distribution of examples used for training doesn’t reflect the distribution of the inputs the model will receive in production. This type of bias is frequently observed in practice. For example, you are working on a system that classifies documents according to a taxonomy of several hundred topics. You might decide to create a collection of documents in which an equal amount of documents represents each topic. Once you finish the work on the model, you observe 5% error. Soon after deployment, you see the wrong assignment to about 30% of documents. Why did this happen? One of the possible reasons is sampling bias: one or two frequent topics in production data might account for 80% of all input. If your model doesn’t perform well for these frequent topics, then your system will make more errors in production than you initially expected.
32
Prejudice or stereotype bias
Prejudice or stereotype bias is often observed in data obtained from historical sources, such as books or photo archives, or from online activity such as social media, online forums, and comments to online publications. Using a photo archive to train a model that distinguishes men from women might show, for example, men more frequently in work or outdoor contexts, and women more often at home indoors. If we use such biased data, our model will have more difficulty recognizing a woman outdoors or a man at home.
33
Systematic value distortion
Systematic value distortion is bias usually occurring with the device making measurements or observations. This results in a machine learning model making suboptimal predictions when deployed in the production environment.
34
Experimenter bias
Experimenter bias is the tendency to search for, interpret, favor, or recall information in a way that affirms one’s prior beliefs or hypotheses. Applied to machine learning, experimenter bias often occurs when each example in the dataset is obtained from the answers to a survey given by a particular person, one example per person.
35
Labeling bias
Labeling bias happens when labels are assigned to unlabeled examples by a biased process or person.
36
Low predictive power
Low predictive power is an issue that you often don’t consider until you have spent fruitless energy trying to train a good model. Does the model underperform because it is not expressive enough? Does the data not contain enough information from which to learn? You don’t know.
37
concept drift.
Concept drift is a fundamental change in the statistical relationship between the features and the label.
38
Outliers
Outliers are examples that look dissimilar to the majority of examples from the dataset. It’s up to the data analyst to define “dissimilar.” Typically, dissimilarity is measured by some distance metric, such as Euclidean distance.
39
Data Leakage
Data leakage, also called target leakage, is a problem affecting several stages of the machine learning life cycle, from data collection to model evaluation. In this section, I will only describe how this problem manifests itself at the data collection and preparation stages. In the subsequent chapters, I will describe its other forms.
40
Summary of Good Data
For the convenience of future reference, let me once again repeat the properties of good data: * it contains enough information that can be used for modeling, * it has good coverage of what you want to do with the model, * it reflects real inputs that the model will see in production, * it is as unbiased as possible, * it is not a result of the model itself, * it has consistent labels, and * it is big enough to allow generalization.
41
Dealing With Interaction Data
Interaction data is the data you can collect from user interactions with the system your model supports. You are considered lucky if you can gather good data from interactions of the user with the system. Good interaction data contains information on three aspects: * context of interaction, * action of the user in that context, and * outcome of interaction. As an example, assume that you build a search engine, and your model reranks search results for each user individually. A reranking model takes as input the list of links returned by the search engine, based on keywords provided by the user and outputs another list in which the items change order. Usually, a reranked model “knows” something about the user and their preferences and can reorder the generic search results for each user individually according to that user’s learned preferences. The context here is the search query and the hundred documents presented to the user in a specific order. The action is a click of the user on a particular document link. The outcome is how much time the user spent reading the document and whether the user hit “back.” Another action is the click on the “next page” link.
42
three most frequent causes of data leakage that can happen during data collection and preparation:
Data leakage is when information from outside the training dataset is used to create the model. 1) target being a function of a feature, 2) feature hiding the target, and 3) feature coming from the future.
43
Data leakage - Target is a Function of a Feature
If you don’t do a careful analysis of each attribute and its relation to GDP, you might let a leakage happen: in the data in Figure 9, two columns, Population and GDP per capita, multiplied, equal GDP. The model you will train will perfectly predict GDP by looking at these two columns only. The fact that you let GDP be one of the features, though in a slightly modified form (devised by the population), constitutes contamination and, therefore, leads to data leakage.
44
Data leakage - Feature Hides the Target
If the data about a customer’s gender and age is factual (as opposed to being guessed by another model that might be available in production), then the column Group constitutes a form of data leakage, when the value you want to predict is “hidden” in the value of a feature.
45
Data leakage - Feature From the Future
Here is another example. Let’s say you have a news website and you want to predict the ranking of news you serve to the user, so as to maximize the number of clicks on stories. If in your training data, you have positional features for each news item served in the past (e.g., the x − y position of the title, and the abstract block on the webpage), such information will not be available on the serving time, because you don’t know the positions of articles on the page before you rank them.
46
Data Partitioning
The training set is used by the machine learning algorithm to train the model. The validation set is needed to find the best values for the hyperparameters of the machine learning pipeline. The analyst tries different combinations of hyperparameter values one by one, trains a model by using each combination, and notes the model performance on the validation set. The hyperparameters that maximize the model performance are then used to train the model for production. We consider techniques of hyperparameter tuning in more detail in Section ?? of Chapter 5. The test set is used for reporting: once you have your best model, you test its performance on the test set and report the results.
47
To obtain good partitions of your entire dataset into these three disjoint sets (test, val, train) partitioning has to satisfy several conditions.
Condition 1: Split was applied to raw data. Once you have access to raw examples, and before everything else, do the split. This will allow avoiding data leakage, as we will see later. Condition 2: Data was randomized before the split. Randomly shuffle your examples first, then do the split. Condition 3: Validation and test sets follow the same distribution. When you select the best values of hyperparameters using the validation set, you want that this selection yields a model that works well in production. The examples in the test set are your best representatives of the production data. Hence the need for the validation and test sets to follow the same distribution. Condition 4: Leakage during the split was avoided. Data leakage can happen even during the data partitioning. Below, we will see what forms of leakage can happen at that stage.
48
Ratio of data partitioning
There is no ideal ratio for the split. In older literature (pre-big data), you might find the recommended splits of either 70%/15%/15% or 80%/10%/10% (for training, validation, and test sets, respectively, in proportion to the entire dataset). Today, in the era of the Internet and cheap labor (e.g., Mechanical Turk or crowdsourcing), organizations, scientists, and even enthusiasts at home can get access to millions of training examples. That makes it wasteful only to use 70% or 80% of the available data for training. A small dataset of less than a thousand examples would do best with 90% of the data used for training. In this case, you might decide to not have a distinct validation set, and instead simulate with the cross- validation technique.
49
Leakage During Partitioning
Group leakage may occur during partitioning. Imagine you have magnetic resonance images of the brains of multiple patients. Each image is labeled with certain brain disease, and the same patient may be represented by several images taken at different times. If you apply the partitioning technique discussed above (shuffle, then split), images of the same patient might appear in both the training and holdout data.
50
Dealing with Missing Attributes
* removing the examples with missing attributes from the dataset (this can be done if your dataset is big enough to safely sacrifice some data); * using a learning algorithm that can deal with missing attribute values (such as the decision tree learning algorithm); * using a data imputation technique.
51
Data Imputation Techniques
-To impute the value of a missing numerical attribute, one technique consists in replacing the missing value by the average value of this attribute in the rest of the dataset. -Another technique is to replace the missing value with a value outside the normal range of values. For example, if the regular range is [0, 1], you can set the missing value to 2 or −1; if the attribute is categorical, such as days of the week, then a missing value can be replaced by the value “Unknown.” Here, the learning algorithm learns what to do when the attribute has a value different from regular values. -A more advanced technique is to use the missing value as the target variable for a regression problem. -Finally, if you have a significantly large dataset and just a few attributes with missing values, you can add a synthetic binary indicator attribute for each original attribute with missing values. Let’s say that examples in your dataset are D-dimensional, and attribute at position j = 12 has missing values. For each example x, you then add the attribute at position j = D + 1, which is equal to 1 if the value of the attribute at position 12 is present in x and 0 otherwise. The missing value then can be replaced by 0 or any value of your choice.
52
Leakage During Imputation
If you use the imputation techniques that compute some statistic of one attribute (such as average) or several attributes (by solving the regression problem), the leakage happens if you use the whole dataset to compute this statistic. Using all available examples, you contaminate the training data with information obtained from the validation and test examples. This type of leakage is not as significant as other types discussed earlier. However, you still have to be aware of it and avoid it by partitioning first, and then computing the imputation statistic only on the training set.
53
Data Augmentation for Images
In Figure 14, you can see examples of operations that can be easily applied to a given image to obtain one or more new images: flip, rotation, crop, color shift, noise addition, perspective change, contrast change, and information loss.
54
data augmentation that seems counterintuitive, but works very well in practice, is mixup.
As the name suggests, the technique consists of training the model on a mix of the images from the training set. More precisely, instead of training the model on the raw images, we take two images (that could be of the same class or not) and use for training their linear combination: mixup_image = t × image1 + (1 − t) × image2 , where t is a real number between 0 and 1. The target of that mixup image is a combination of the original targets obtained using the same value of t: mixup_target = t × target1 + (1 − t) × target2
55
Data Augmentation for Text
When it comes to text data augmentations, it is not as straightforward. We need to use appropriate transformation techniques to preserve the contextual and grammatical structure of natural language texts. -One technique involves replacing random words in a sentence with their close synonyms. For the sentence, “The car stopped near a shopping mall.” some equivalent sentences are: “The automobile stopped near a shopping mall.” “The car stopped near a shopping center.” “The auto stopped near a mall.” -A similar technique uses hypernyms instead of synonyms. A hypernym is a word that has more general meaning. For example, “mammal” is a hypernym for “whale” and “cat”; “vehicle” is a hypernym for “car” and “bus.” From our example above, we could create the fol- lowing sentences: “The vehicle stopped near a shopping mall.” “The car stopped near a building.” -A modern alternative to the k-nearest-neighbors approach described above is to use a deep pre-trained model such as Bidirectional Encoder Representations from Transformers (BERT). Models like BERT are trained to predict a masked word given other words in a sentence. One can use BERT to generate k most likely predictions for a masked word and then use them as synonyms for data augmentation. -Another useful text data augmentation technique is back translation. To create a new example from a text written in English (it can be a sentence or a document), first translate it into another language l using a machine translation system. Then translate it back from l into English. If the text obtained through back translation is different from the original text, you add it to the dataset by assigning the same label as the original text.
56
Class imbalance
Class imbalance is a condition in the data that can significantly affect the performance of the model, independently of the chosen learning algorithm. The problem is a very uneven distribution of labels in the training data. Typically, a machine learning algorithm tries to classify most training examples correctly. The algorithm is pushed to do so because it needs to minimize a cost function that typically assigns a positive loss value to each misclassified example. If the loss is the same for the misclassification of a minority class example as it is for the misclassification of a majority class, then it’s very likely that the learning algorithm decides to “give up” on many minority class examples in order to make fewer mistakes in the majority class.
57
Oversampling
A technique used frequently to mitigate class imbalance is oversampling. By making multi- ple copies of minority class examples, it increases their weight, as illustrated in Figure 15a. You might also create synthetic examples by sampling feature values of several examples of the minority class and combining them to obtain a new example of that class. Two popular algo- rithms that oversample the minority class by creating synthetic examples: Synthetic Minority Oversampling Technique (SMOTE) and Adaptive Synthetic Sampling Method (ADASYN).
58
Undersampling
The undersampling can be done randomly; that is, the examples to remove from the majority class can be chosen at random. Alternatively, examples to withdraw from the majority class can be selected based on some property.
59
Cluster-based undersampling
Cluster-based undersampling works as follows. Decide on the number of examples you want to have in the majority class resulting from undersampling. Let that number be k. Run a centroid-based clustering algorithm on the majority examples only with k being the desired number of clusters. Then replace all examples in the majority classes with the k centroids. An example of a centroid-based clustering algorithm is k-nearest neighbors.
60
Hybrid Strategies for sampling
You can develop your hybrid strategies (by combining both over- and undersampling) and possibly get better results. One such strategy consists of using ADASYN to oversample, and then Tomek links to undersample.
61
two main data sampling strategies
There are two main strategies: probability sampling and nonprobability sampling. In probability sampling, all examples have a chance to be selected. These techniques involve randomness. Nonprobability sampling is not random. To build a sample, it follows a fixed deterministic sequence of heuristic actions. This means that some examples don’t have a chance of being selected, no matter how many samples you build. The main drawback of nonprobability sampling methods is that they include non-representative samples and might systematically exclude important examples. These drawbacks outweigh the possible advantages of nonprobability sampling methods. Therefore, in this book I will only present probability sampling methods.
62
Simple random sampling
Simple random sampling is the most straightforward method, and the one I refer to when I say “sample randomly.” Here, each example from the entire dataset is chosen purely by chance; each example has an equal chance of being selected.
63
systematic sampling
To implement systematic sampling (also known as interval sampling), you create a list containing all examples. From that list, you randomly select the first example xstart from the first k elements on the list. Then, you select every k th item on the list starting from xstart. You choose such a value of k that will give you a sample of the desired size. An advantage of the systematic sampling over the simple random sampling is that it draws examples from the whole range of values. However, systematic sampling is inappropriate if the list of examples has periodicity or repetitive patterns. In the latter case, the obtained sample can exhibit a bias. However, if the list of examples is randomized, then systematic sampling often results in a better sample than simple random sampling.
64
Stratified Sampling
In stratified sampling, you first divide your dataset into groups (called strata) and then randomly select examples from each stratum, like in simple random sampling. The number of examples to select from each stratum is proportional to the size of the stratum. Stratified sampling often improves the representativeness of the sample by reducing its bias; in the worst of cases, the resulting sample is of no less quality than the results of simple random sampling. However, to define strata, the analyst has to understand the properties of the dataset. Furthermore, it can be difficult to decide which attributes will define the strata.
65
what is data serialization
Data serialization is the process of converting complex data structures, such as objects or data collections, into a format that can be easily stored, transmitted, or reconstructed. The serialized data can later be deserialized, which means it's converted back into its original form, allowing it to be used in the same way as before serialization. Serialization is commonly used in various scenarios, such as when saving data to files, sending data over networks, or storing data in databases. Serialization is important because it enables data to be transported or stored in a standardized format that can be understood by different systems or programming languages. It also helps preserve the structure and relationships within the data. Different serialization formats exist, each with its own characteristics and use cases.
66
Reproducibility
Reproducibility should be an important concern in everything you do, including data collection and preparation. You should avoid transforming data manually, or using powerful tools included in text editors or command line shells, such as regular expressions, “quick and dirty” ad hoc awk or sed commands, and piped expressions. Usually, the data collection and transformation activities consist of multiple stages. These include downloading data from web APIs or databases, replacing multiword expressions by unique tokens, removing stop-words and noise, cropping and unblurring images, imputation of missing values, and so on. Each step in this multistage process has to be implemented as a software script, such as Python or R script with their inputs and outputs. If you are organized like that in your work, it will allow you to keep track of all changes in the data.
67
Data First, Algorithm Second
Remember that in the industry, contrary to academia, “data first, algorithm second,” so focus most of your effort and time on getting more data of wide variety and high quality, instead of trying to squeeze the maximum out of a learning algorithm. Data augmentation, when implemented well, will most likely contribute more to the quality of the model than the search for the best hyperparameter values or model architecture.
68
To obtain a good partition of your entire dataset into training, validation and test sets, the process of partitioning has to satisfy several conditions:
1) data was randomized before the split, 2) split was applied to raw data, 3) validation and test sets follow the same distribution, and 4) leakage was avoided.
69
Feature Engineering
Feature engineering is a process of first conceptually and then programmatically transforming a raw example into a feature vector. It consists of conceptualizing a feature and then writing the programming code that would transform the entire raw example, with potentially the help of some indirect data, into a feature.
70
Feature Engineering for Text
When it comes to text, scientists and engineers often use simple feature engineering tricks. Two such tricks are one-hot encoding and bag-of-words.
71
Mean encoding,
Mean encoding, also known as bin counting or feature calibration, is another technique. First, the sample mean of the label is calculated using all examples where the feature has value z. Each value z of the categorical feature is then replaced by that sample mean value. The advantage of this technique is that the data dimensionality doesn’t increase, and by design, the numerical value contains some information about the label.
72
sine-cosine transformation.
It converts a cyclical feature into two synthetic features. Let p denote the integer value of our cyclical feature. Replace the value p of the cyclical feature with the following two values: psin = sin 2 × π × p max(p) , pcos = cos 2 × π × p max(p)
73
Feature Hashing
Feature hashing, or hashing trick, converts text data, or categorical attributes with many values, into a feature vector of arbitrary dimensionality. One-hot encoding and bag-of-words have a drawback: many unique values will create high-dimensional feature vectors. using a hash function, you first convert all values of your categorical attribute (or all tokens in your collection of documents) into a number, and then you convert this number into an index of your feature vector.
74
Topic Modeling
Topic modeling is a family of techniques that uses unlabeled data, typically in the form of natural language text documents. The model learns to represent a document as a vector of topics. For example, in a collection of news articles, the five major topics could be “sports,” “politics,” “entertainment,” “finance,” and “technology”.
75
Features for Time-Series
Time-series data is different from the traditional supervised learning data, which has a form of unordered collections of independent observations. A time series is an ordered sequence of observations, and each is marked with a time-related attribute, such as timestamp, date, month-year, year, and so on. Analysts typically use time-series data to solve two kinds of prediction problems. Given a sequence of recent observations: * predict something about the next observation (for example, given the stock price and the value of stock indices for the last seven days, predict the stock price for tomorrow), or * predict something about the phenomenon that generated that sequence (for example, given a user’s connection log to a software system, predict whether they are likely to cancel their subscription during the current quarter).
76
Stacking Features
In our movie title classification problem, we first collect all the left contexts. We then apply bag-of-words to transform each left context into a binary feature vector. Next, collect all extractions and, using bag-of-words, transform each extraction into a binary feature vector. Then we collect all the right contexts and apply bag-of-words to transform each right context into a binary feature vector. Finally, we concatenate each example, joining the feature vectors of the left context, the extraction, and the right context.
77
Properties of Good Features
High Predictive Power Fast Computability Reliability Uncorrelatedness
78
Uncorrelatedness
Correlation of two features means their values are related. If the growth of one feature implies the growth of the other, and the inverse is also true, then the two features are correlated. Once the model is in production, its performance may change because the input data’s properties may change over time. When many of your features are highly correlated, even a minor change in the input data’s properties may result in significant changes in the model’s behavior. Sometimes the model was built under strict time constraints, so the developer used all possible sources of features. With time, maintaining those sources can become costly. It’s generally recommended to eliminate redundant or highly correlated features. Feature selection techniques help reduce such features.
79
Cutting the Long Tail
Typically, if a feature contains information (e.g., a non-zero value) only for a handful of examples, such a feature could be removed from the feature vector. In bag-of-words, you can build a graph with the distribution of token counts, and then cut off the so-called long tail, as shown in Figure 15.
80
Properties of Good Features
high predictive power fast computation reliability fast computation other: However, if you apply the model built on historical tweets to predict something about current tweets, the date of your production examples will always be out of the training distribution, which can result in a significant error.5
81
Boruta algorithm for assessing importance of features
google it
82
l1-regularization
google
83
stop words.
Stop words are the words that are too generic or common for the problem we are trying to solve. Frequent examples of stop words are articles, prepositions, and pronouns. Dictionaries of stop words for most languages are available online.
84
Feature Discretization
The reasons to discretize a real-valued numerical feature can be numerous. For example, some feature selection techniques only apply to categorical features. A successful discretization adds useful information to the learning algorithm when the training dataset is relatively small. Numerous studies show that discretization can lead to improved predictive accuracy. It is also simpler for a human to interpret a model’s prediction if it is based on discrete groups of values, such as age groups or salary ranges.
85
What is binning or bucketing?
Binning, also known as bucketing, is a popular technique that allows transforming a numerical feature into a categorical one by replacing numerical values in a specific range by a constant categorical value. There are three typical approaches to binning: * uniform binning, * k-means-based binning, and * quantile-based binning.
86
elbow method (PCA..)
tam k se prelomi
87
skipgram vs cbow
google
88
what is fastext improvement over word to vec
tokenization
89
doc2vec
google
90
choosing embedding dimentionality
d = √4 D, that is google suggestion. Principaled way is to treat it as hyperparameter.
91
Feature scaling
Feature scaling is bringing all your features to the same, or very similar, ranges of values or distributions. Multiple experiments demonstrated that a learning algorithm applied to scaled features might produce a better model. While there’s no guarantee that scaling will have a positive impact on the quality of your model, it’s considered a best practice. Scaling can also increase the training speed of deep neural networks. It also assures that no individual feature dominates, especially in the initial iterations of gradient descent or other iterative optimization algorithms. Finally, scaling reduces the risk of numerical overflow, the problem that computers have when working with very small or very big numbers.
92
Normalization
Normalization is the process of converting an actual range of values, which a numerical feature can take, into a predefined and artificial range of values, typically in the interval [−1, 1] or [0, 1].
93
winsorization.
Winsorization consists of setting all outliers to a specified percentile of the data; for example, a 90% winsorization would see all data below the 5th percentile set to the 5th percentile, and data above the 95th percentile set to the 95th percentile.
94
Standardization
normalization) is the procedure during which the feature values are rescaled so that they have the properties of a standard normal distribution, with μ = 0 and σ = 1, where μ is the sample mean (the average value of the feature, averaged over all examples in the training data) and σ is the standard deviation from the sample mean.
95
Data Leakage in Feature Engineering
Now imagine you are working with text, and that you use bag-of-words to create features with the entire dataset. After building the vocabulary, you split your data into the three sets. In this situation, the learning algorithm will be exposed to features based on tokens only present in the holdout sets. Again, the model will display artificially better performance than had you divided your data before feature engineering. A solution, as you might have guessed, is first to split the entire dataset into training and holdout sets, and only do feature engineering on the training data. This also applies when you use mean encoding to transform a categorical feature to a number: split the data first and then compute the sample mean of the label, based on the training data only.
96
Schema File - what is it And why it is usefull
1 feature { 2 name : "height" 3 type : float 4 min : 50.0 5 max : 300.0 6 mean : 160.0 7 variance : 17.0 8 zeroes : false 9 undefined : false 10 popularity : 1.0 11 } 12 13 feature { 14 name : "color_red" 15 type : binary 16 zeroes : true 17 undefined : false 18 popularity : 0.76 19 } 20 21 feature { 22 name : "color_green" 23 type : binary 24 zeroes : true 25 undefined : false 26 popularity : 0.65 27 } 28 29 feature { 30 name : "color_blue" 31 type : binary 32 zeroes : true 33 undefined : false 34 popularity : 0.81
97
random prediction algorithm
The random prediction algorithm makes a prediction by randomly choosing a label from the collection of labels assigned to the training examples. In the classification problem, it corresponds to randomly picking one class from all classes in the problem. In the regression problem it means selecting from all unique target values in the training data.
98
zero rule algorithm
The zero rule algorithm yields a tighter baseline than the random prediction algorithm. This means that it usually improves the value of the metric as compared to random prediction. To make predictions, the zero rule algorithm uses more information about the problem. In classification, the zero rule algorithm strategy is to always predict the class most common in the training set, independently of the input value. It can look ineffective, but consider the following problem. Let the training data for your classification problem contain 800 examples of the positive class, and 200 examples of the negative class. The zero rule algorithm will predict the positive class all the time, and the accuracy (one of the popular performance metrics that we will consider in Section 5.5.2) of the baseline will be 800/1000 = 0.8 or 80%, which is not bad for such a simple classifier. Now you know that your statistical model, independently of how close it is to the optimum, must have an accuracy of at least 80%. Now, let’s consider the zero rule algorithm for regression. According to the zero rule algorithm, the strategy for regression is to predict the sample average of the target values observed in the training data. This strategy will likely have a lower error rate than random prediction.
99
What is Amazon Mechanical Turk service?
Mechanical Turk (MT) is a web-platform where people solve simple tasks for a reward. MT provides an API that you can call to get human predictions. The quality of such predictions can vary from very low to relatively high, depending on the task and the reward. MT is relatively inexpensive, so you can get predictions fast and in large numbers.
100
distribution shift
The distribution shift can be a hard problem to tackle. Using a different data distribution for training could be a conscious choice because of the data availability. However, the analyst may be unaware that the statistical properties of the training and development data are different. This often happens when the model is frequently updated after production deployment, and new examples are added to the training set. The properties of the data used to train the model, and that of the data used to validate and test it, can diverge over time. Section ?? in the next chapter provides guidance on how to handle that problem.
101
Preconditions for Supervised Learning
Before you start working on your model, make sure the following conditions are satisfied: 1. You have a labeled dataset. 2. You have split the dataset into three subsets: training, validation, and test. 3. Examples in the validation and test sets are statistically similar. 4. You engineered features and filled missed values using only the training data. 5. You converted all examples into numerical feature vectors.1 6. You have selected a performance metric that returns a single number (see Section 5.5). 7. You have a baseline.
102
Main Properties of a Learning Algorithm: Explainability of learning algorithm
Do the model predictions require explanation for a non-technical audience? The most accurate machine learning algorithms and models are so-called “black boxes.” They make very few prediction errors, but it may be difficult to understand, and even harder to explain, why a model or an algorithm made a specific prediction. Examples of such models are deep neural networks and ensemble models. In contrast, kNN, linear regression, and decision tree learning algorithms are not always the most accurate. However, their predictions are easy to inter- pret by a non-expert.
103
Main Properties of a Learning Algorithm: In-memory vs. out-of-memory
Can your dataset be fully loaded into the RAM of your laptop or server? If yes, then you can choose from a wide variety of algorithms. Otherwise, you would prefer incremental learning algorithms that can improve the model by reading data gradually. Examples of such algorithms are Naïve Bayes and the algorithms for training neural networks.
104
Main Properties of a Learning Algorithm: Number of features and examples
How many training examples do you have in your dataset? How many features does each example have? Some algorithms, including those used for training neural networks and random forests, can handle a huge number of examples and millions of features. Others, like the algorithms for training support vector machines (SVM), can be relatively modest in their capacity.
105
Main Properties of a Learning Algorithm: Nonlinearity of the data
Is your data linearly separable? Can it be modeled using a linear model? If yes, SVM with the linear kernel, linear and logistic regression can be good choices. Otherwise, deep neural networks or ensemble models might work better.
106
Main Properties of a Learning Algorithm: Training speed
How much time is a learning algorithm allowed to use to build a model, and how often you will need to retrain the model on updated data? If training takes two days, and you need to retrain your model every 4 hours, then your model will never be up to date. Neural networks are slow to train. Simple algorithms like linear and logistic regression, or decision trees, are much faster. Specialized libraries contain very efficient implementations of some algorithms. You may prefer to do research online to find such libraries. Some algorithms, such as random forest learning, benefit from multiple CPU cores, so their training time can be significantly reduced on a machine with dozens of cores. Some machine learning libraries leverage GPU (graphics processing unit) to speed up training.
107
Main Properties of a Learning Algorithm: Prediction speed
How fast must the model be when generating predictions? Will your model be used in a production environment where very high throughput is required? Models like SVMs and linear and logistic regression models, and not-very-deep feedforward neural networks, are extremely fast at prediction time. Others, like kNN, ensemble algorithms, and very deep or recurrent neural networks, are slower.
108
algorithm spot-checking.
Shortlisting candidate learning algorithms for a given problem is sometimes called algorithm spot-checking. For the most effective spot-checking, it is recommended to: * select algorithms based on different principles (sometimes called orthogonal), such as instance-based algorithms, kernel-based, shallow learning, deep learning, ensembles; * try each algorithm with 3 − 5 different values of the most sensitive hyperparameters (such as the number of neighbors k in k-nearest neighbors, penalty C in support vector machines, or decision threshold in logistic regression); * use the same training/validation split for all experiments, * if the learning algorithm is not deterministic (such as the learning algorithms for neural networks and random forests), run several experiments, and then average the results; * once the project is over, note which algorithms performed the best, and use this information when working on a similar problem in the future.
109
Pipeline
Many modern machine learning packages and frameworks support the notion of a pipeline. A pipeline is a sequence of transformations the training data goes through, before it becomes a model.
110
Performance Metrics for Regression
Regression and classification models are assessed using different metrics. Let’s first consider performance metrics for regression: mean squared error (MSE), median absolute error (MAE), and almost correct predictions error rate (ACPER).
111
When is median absolute error better than MSE?
If the data contains outliers, the examples very far from the “true” regression line, they can significantly affect the value of MSE. By definition, the squared error for such outlying examples will be high. In such situations, it is better to apply a different metric, the median absolute error.
112
What is almost correct predictions error rate (ACPER)
The almost correct predictions error rate (ACPER) is the percentage of predictions that is within p percentage of the true value. To calculate ACPER, proceed as follows: 1. Define a threshold percentage error that you consider acceptable (let’s say 2%). 2. For each true value of the target yi , the desired prediction should be between yi + 0.02yi and yi − 0.02yi . 3. By using all examples i = 1, . . . , N, calculate the percentage of predicted values fulfilling the above rule. This will give the value of the ACPER metric for your model.
113
For classification, things are a little more complicated. The most widely used metrics to assess a classification model are:
* precision-recall, * accuracy, * cost-sensitive accuracy, and * area under the ROC curve (AUC).
114
F-measure, also known as F-score.
The traditional F-measure, or F1-score, is the harmonic mean of precision and recall:
115
Accuracy?
Accuracy is given by the number of correctly classified examples, divided by the total number of classified examples. In terms of the confusion matrix, it is given by: accuracy def = TP + TN/ TP + TN + FP + FN
116
When is accuracy not a good measure?
Accuracy measures the performance of the model for all classes at once, and it conveniently returns a single number. However, accuracy is not a good performance metric when the data is imbalanced. In an imbalanced dataset, examples belonging to some class or a few classes constitute the vast majority, while other classes include very few examples. Imbalanced training data can significantly and adversely affect the model. We will talk more about dealing with the imbalanced data in Section ?? of Chapter 6. For imbalanced data, a better metric is per-class accuracy. First, calculate the accuracy of prediction for each class {1, . . . , C}, and then take an average of C individual accuracy measures. For the above confusion matrix of the spam detection problem, the accuracy for the class “spam” is 23/(23 + 1) = 0.96, the accuracy for the class “not_spam” is 556/(12 + 556) = 0.98. The per-class accuracy is then (0.96 + 0.98)/2 = 0.97.
117
Cohen’s kappa statistic
Cohen’s kappa statistic is a performance metric that applies to both multiclass and imbalanced learning problems. The advantage of this metric over accuracy is that Cohen’s kappa tells you how much better your classification model is performing, compared to a classifier that randomly guesses a class according to the frequency of each class.
118
The ROC curve
The ROC curve (stands for “receiver operating characteristic;” the term comes from radar engineering) is a commonly-used method of assessing classification models. ROC curves use a combination of the true positive rate (defined exactly as recall) and false positive rate (the proportion of negative examples predicted incorrectly) to build up a summary picture of the classification performance.
119
Grid search
Grid search is the simplest hyperparameter tuning technique. It’s used when the number of hyperparameters and their range is not too large. We explain it for the problem of tuning two numerical hyperparameters. The technique consists of discretizing each of the two hyperparameters, and then evaluating each pair of discrete values,
120
Random search
Random search differs from grid search in that you do not provide a discrete set of values to explore for each hyperparameter. Instead, you provide a statistical distribution for each hyperparameter from which values are randomly sampled. Then set the total number of combinations you want to evaluate,
121
Hyperparameter tuning: Coarse-to-Fine Search
In practice, analysts often use a combination of grid search and random search called coarse- to-fine search. This technique uses a coarse random search to first find the regions of high potential. Then, using a fine grid search in these regions, one finds the best values for hyperparameters, as shown in Figure 8. You can decide to only explore one high-potential region or several such regions, depending on the available time and computational resources.
122
Hyperparameter tuning: Bayesian techniques
Bayesian techniques differ from random and grid searches in that they use past evaluation results to choose the next values to evaluate. In practice, this allows Bayesian hyperparameter optimization techniques to find better values of hyperparameters in less time.
123
Cross-Validation
google
124
Shallow models?
Shallow models make predictions based directly on the values in the input feature vector. Most popular machine learning algorithms produce shallow models. The only kind of deep models commonly used are deep neural networks. We consider a strategy to train them in Section ?? of the next chapter.
125
A typical model training strategy for shallow learning algorithms looks as follows:
1. Define a performance metric P. 2. Shortlist learning algorithms. 3. Choose a hyperparameter tuning strategy T. 4. Pick a learning algorithm A. 5. Pick a combination H of hyperparameter values for algorithm A using strategy T. 6. Use the training set and train a model M using algorithm A parametrized with hyperparameter values H. 7. Use the validation set and calculate the value of metric P for model M. 8. Decide: a. If there are still untested hyperparameter values, pick another combination H of hyperparameter values using strategy T and go back to step 6. b. Otherwise, pick a different learning algorithm A and go back to step 5, or proceed to step 9 if there are no more learning algorithms to try. 9. Return the model for which the value of metric P is maximized.
126
If the model makes too many mistakes on the training data, we say that it has a high bias, or that the model underfits the training data. There could be several reasons for underfitting:
* the model is too simple for the data (for example linear models often underfit); * the features are not informative enough; * you regularize too much (we talk about regularization in the next section). The possible solutions to the problem of underfitting include: * trying a more complex model, * engineering features with higher predictive power, * adding more training data, when possible, and * reducing regularization.
127
Several reasons can lead to overfitting:
* the model is too complex for the data. Very tall decision trees or a very deep neural network often overfit; * there are too many features and few training examples; and * you don’t regularize enough. Several solutions to overfitting are possible: * use a simpler model. Try linear instead of polynomial regression, or SVM with a linear kernel instead of radial basis function (RBF), or a neural network with fewer layers/units; * reduce the dimensionality of examples in the dataset; * add more training data, if possible; and, * regularize the model.
128
Regularization
Regularization is an umbrella term for methods that force a learning algorithm to train a less complex model. In practice, it leads to higher bias, but significantly reduces the variance.
129
lasso/ L1 and ridge regularization / L2.
google
130
A common strategy to build a neural network looks as follows:
1. Define a performance metric P. 2. Define the cost function C. 3. Pick a parameter-initialization strategy W. 4. Pick a cost-function optimization algorithm A. 5. Choose a hyperparameter tuning strategy T. 6. Pick a combination H of hyperparameter values using the tuning strategy T. 7. Train model M, using algorithm A, parametrized with hyperparameters H, to optimize cost function C. 8. If there are still untested hyperparameter values, pick another combination H of hyperparameter values using strategy T, and repeat step 7. 9. Return the model for which the metric P was optimized.
131
categorical cross-entropy (for multiclass classification) or binary cross-entropy (for binary and multi-label classifica- tion).
Classification cost functions. Google for more
132
Multi-class vs multi label classification
Note that the output layers in multiclass and multi-label classification are different. In multiclass classification, one softmax unit is used. It generates a C-dimensional vector whose values are bounded by the range (0, 1), and whose sum equals 1. In multi-label classification, the output layer contains C logistic units whose values also lie in the range (0, 1), but their sum lies in the range (0, C).
133
Parameter-Initialization Strategies
* ones — all parameters are initialized to 1; * zeros — all parameters are initialized to 0; * random normal — parameters are initialized to values sampled from the normal distribution, typically with mean of 0 and standard deviation of 0.05; * random uniform — parameters are initialized to values sampled from the uniform distribution with the range [−0.05, 0.05]; * Xavier normal — parameters are initialized to values sampled from the truncated normal distribution, centered on 0, with standard deviation equal to p 2/(in + out) where “in” is the number of units in the preceding layer to which the current unit is connected (the one whose parameters you initialize); and “out” is the number of units on the subsequent layer to which the current unit is connected; and, * Xavier uniform — parameters are initialized to values sampled from a uniform distribution within [−limit, limit], where “limit” is p 6/(in + out), and “in” and “out” are defined as in Xavier normal, above.
134
We say that f(x) has a local minimum at x = c if?
We say that f(x) has a local minimum at x = c if f(x) ≥ f(c) for every x in some open interval around x = c.
135
Learning Rate Decay Schedules
Learning rate decay consists of gradually reducing the value of the learning rate α as the epochs progress. Consequently, the parameter updates become finer. There are several techniques, Time-based learning rate decay schedules, Step-based learning rate decay schedules, Exponential learning rate decay schedules,
136
There are several popular upgrades to minibatch SGD, such as Momentum, Root Mean Squared Propagation (RMSProp), and Adam.
google them
137
dropout
The concept of dropout is very simple. Each time you “run” a training example through the network, you temporarily exclude at random some units from the computation. The higher the percentage of units excluded, the stronger the regularization effect. Popular neural network libraries allow you to add a dropout layer between two successive layers, or you can specify the dropout hyperparameter for a layer. The dropout hyperparameter varies in the range [0, 1] and characterizes the fraction of units to randomly exclude from computation. The value of the hyperparameter has to be found experimentally. While simple, dropout’s flexibility and regularizing effect are phenomenal.
138
Early stopping
Early stopping trains a neural network by saving the preliminary model after every epoch. Models saved after each epoch are called checkpoints. Then it assesses each checkpoint’s performance on the validation set. You’ll find during gradient descent that the cost decreases as the number of epochs increases. After some epoch, the model can start overfitting, and the model’s performance on the validation data can deteriorate. Remember the bias-variance illustration in Figure ?? in Chapter 5. By keeping a version of the model after each epoch, you can stop the training once you start observing a decreased performance on the validation set. Alternatively, you can keep running the training process for a fixed number of epochs, and then pick the best checkpoint. Some machine learning practitioners rely on this technique. Others try to properly regularize the model using appropriate techniques.
139
Batch normalization
Batch normalization (which rather should be called batch standardization) consists of standardizing the outputs of each layer before the next layer receives them as input. In practice, batch normalization results in faster and more stable training, as well as some regularization effect. So, it’s always a good idea to use batch normalization. In popular neural network libraries, you can often insert a batch normalization layer between two subse- quent layers.
140
A pre-trained model can be used in two ways:
1) its learned parameters can be used to initialize your own model, or 2) it can be used as a feature extractor for your model.
141
pre-trained model as feature extractors
In practice, it means that you only keep several initial layers of the pre-trained model, those closest to and including the input layer. You keep their parameters “frozen,” that is, unchanged and unchangeable. Then you add new layers on top of the frozen layers, including the output layer appropriate for your task. Only the parameters of the new layers will be updated by gradient descent during training on your data.
142
use a pre-trained model as an initializer
use a pre-trained model as an initializer for your model, it gives you more flexibility. The gradient descent will modify the parameters in all layers, and, potentially, reach a better performance for your problem. The downside of that is you will often end up training a very deep neural network.
143
Adversarial Validation
Adversarial Validation is a very clever and very simple way to let us know if our test data and our training data are similar; we combine our train and test data, labeling them with say a 0 for the training data and a 1 for the test data, mix them up, then see if we are able to correctly re-identify them using a binary classifier. If we cannot correctly classify them, i.e. we obtain an area under the receiver operating characteristic curve (ROC) of 0.5 then they are indistinguishable and we are good to go. However, if we can classify them (ROC > 0.5) then we have a problem, either with the whole dataset or more likely with some features in particular, which are probably from different distributions in the test and train datasets. If we have a problem, we can look at the feature that was most out of place. The problem may be that there were values that were only seen in, say, training data, but not in the test data. If the contribution to the ROC is very high from one feature, it may well be a good idea to remove that feature from the model. We can also use it to improve the data for learning.
144
Handling Imbalanced Datasets - class weight
google
145
Handling Imbalanced Datasets - Ensemble of Resampled Datasets
google
146
Handling Imbalanced Datasets - different learning rates for different classes:
If you use stochastic gradient descent, the class imbalance can be tackled in several ways. First, you can have different learning rates for different classes: a lower value for the examples of the majority class, and a higher value otherwise. Second, you can make several consecutive updates of the model parameters each time you encounter an example of a minority class.
147
There are two techniques often used to calibrate a binary model: Platt scaling and isotonic regression.
google
148
If your model does poorly on the training data (underfits it), common reasons are:
* the model architecture or learning algorithm are not expressive enough (try more advanced learning algorithm, an ensemble method, or a deeper neural network); * you regularize too much (reduce regularization); * you have chosen suboptimal values for hyperparameters (tune hyperparameters); * the features you engineered don’t have enough predictive power (add more informative features); * you don’t have enough data for the model to generalize (try to get more data, use data augmentation, or transfer learning); or * you have a bug in your code (debug the code that defines and trains the model).
149
If your model does well on the training data, but poorly on the holdout data (overfits the training data), common reasons are:
* you don’t have enough data for generalization (add more data or use data augmentation); * your model is under-regularized (add regularization or, for neural networks, both regularization and batch normalization); * your training data distribution is different from the holdout data distribution (reduce the distribution shift); * you have chosen suboptimal values for hyperparameters (tune hyperparameters); or * your features have low predictive power (add features with high predictive power).
150
Iterative Model Refinement If you have access to new labeled data (for example, you can label examples yourself, or easily request the help of a labeler) then, you can refine the model using a simple iterative process:
1. Train the model using the best values of hyperparameters identified so far. 2. Test the model by applying it to a small subset of the validation set (100−300 examples). 3. Find the most frequent error patterns on that small validation set. Remove those examples from the validation set, because your model will now overfit to them. 4. Generate new features, or add more training data to fix the observed error patterns. 5. Repeat until no frequent error patterns are observed (most errors look dissimilar).
151
Fixing Wrong Labels
Here is a simple way to identify the examples that have wrong labels. Apply the model to the training data from which it was built, and analyze the examples for which it made a different prediction as compared to the labels provided by humans. If you see that some predictions are indeed correct, change those labels. If you have time and resources, you could also examine the predictions with the score close to the decision threshold. Those are often mislabeled cases too.
152
Finding Additional Examples to Label in best way possible?
As discussed above, error analysis can reveal that more labeled data is needed from specific regions of feature space. You might have an abundance of unlabeled examples. How should you decide which examples to label so as to maximize the positive impact on the model? If your model returns a prediction score, an effective way is to use your best model to score the unlabeled examples. Then label those examples, whose prediction score is close to the prediction threshold.
153
Troubleshooting Deep Learning
look in the book: Machine learning engineering
154
What is a good model?
A good model has two properties: * it has the desired quality according to the performance metric; and * it is safe to serve in a production environment. For a model to be safe-to-serve means satisfying the following requirements: * it will not crash or cause errors in the serving system when being loaded, or when loaded with bad or unexpected inputs; * it will not use an unreasonable amount of resources (such as CPU, GPU, or RAM).
155
catastrophic forgetting
Furthermore, frequent model upgrades without retraining from scratch can lead to catas- trophic forgetting. It’s a situation in which the model that was once capable of something, “forgets” that capability because of learning something new.
156
warm-starting
However, avoid the practice of warm-starting. It consists of iteratively upgrading the existing model by using only new training examples and running additional training iterations.
157
Correction Cascades
You might have model mA that solves problem A, but you need a solution mB for a slightly different problem B. It can be tempting to use the output of mA as input for mB, and only train mB on a small sample of examples that “correct” the output of mA for solving problem B. Such technique is called correction cascading, and it is not recommended. It’s important to note that model cascading is not always a bad practice. Using the output of one model, as one of many inputs for another model, is common. It might significantly reduce time to market. However, cascading must be used with caution, because the update of one model in a cascade must involve an update of all models in the cascade, which can end up being costly in the long-term.
158
glue code
Reduce glue code to a minimum. This how Google engineers put it. Machine learning researchers tend to develop general purpose solutions as self-contained packages. A wide variety of these are available as open-source packages or from in-house code, proprietary packages, and cloud-based platforms. Using generic packages often results in a glue-code system design pattern, in which a massive amount of supporting code is written to get data into and out of general-purpose packages.
159
More Data Beats Cleverer Algorithm
In practice, however, better results often come from getting more data, specifically, more labeled examples. If designed well, the data labeling process can allow a labeler to produce several thousand training examples daily. It can also be less expensive, compared to the expertise needed to invent a more advanced machine learning algorithm.
160
New Data Beats Cleverer Features
If, despite adding more training examples and designing clever features, the performance of your model plateaus, think about different information sources. For example, if you want to predict whether user U will like a news article, try to add historical data about the user U as features. Or cluster all the users, and use the information on the k-nearest users to user U as new features. This is a simpler approach compared to programming very complex features, or combining existing features in a complex way.
161
Facilitate Reproducibility
The random seed can be set as np.random.seed(15) (in NumPy and scikit-learn), tf.random.set_seed(15) in TensorFlow, torch.manual_seed(15) (in PyTorch), and set.seed(15) (in R). The seed value doesn’t matter as long as it remains constant.
162
When delivering the model, make sure it’s accompanied by all relevant information for reproducibility.
Besides the description of the dataset and features, such as documentation and metadata considered in Sections ?? and ??, each model should contain the documentation with the following details: * a specification of all hyperparameters, including the ranges considered, and the default values used, * the method used to select the best hyperparameter configuration, * the definition of the specific measure or statistics used to evaluate the candidate models, and the value of it for the best model, * a description of the computing infrastructure used, and * the average runtime for each trained model, and an estimated cost of the training.
163
There are different forms of online evaluation, each serving a different purpose.
runtime monitoring is checking whether the running system meets the runtime requirements. Another common scenario is to monitor user behavior in response to different versions of the model. One popular technique used in this scenario is A/B testing. We split the users of a system into two groups, A and B. The two groups are served the old and the new models, respectively. Then we apply a statistical significance test to decide whether the performance of the new model is better than the old one. Multi-armed bandit (MAB) is another popular technique of online model evaluation. Similar to A/B testing, it identifies the best performing models by exposing model candidates to a fraction of users. Then it gradually exposes the best model to more users, by keeping gathering performance statistics until it’s reliable.
164
why A/B testing
To figure out which model is better.
165
A/B testing - G-Test
google
166
A/B testing - Z-Test
google
167
Multi-Armed Bandit
A more advanced, and often preferable way of online model evaluation and selection, is multi-armed bandit (MAB). A/B testing has one major drawback. The number of test results in groups A and B you need to calculate the value of the A/B test is high. A significant portion of users routed to a suboptimal model would experience suboptimal behavior for a long time.
168
UCB1
UCB1 (for Upper Confidence Bound) is a popular algorithm for solving the multi-armed bandit problem. The algorithm dynamically chooses an arm, based on the performance of that arm in the past, and how much the algorithm knows about it. In other words, UCB1 routes the user to the best performing model more often when its confidence about the model performance is high. Otherwise, UCB1 might route the user to a suboptimal model so as to get a more confident estimate of that model’s performance. Once the algorithm is confident enough about the performance of each model, it almost always routes users to the best performing model.
169
Neuron Coverage
When we evaluate a neural network, especially one to be used in a mission-critical scenario, such as a self-driving car or a space rocket, our test set must have good coverage. Neuron coverage of a test set for a neural network model is defined as the ratio of the units (neurons) activated by the examples from the test set, to the total number of units. A good test set has close to 100% neuron coverage. A unit is considered activated when its output is above a certain threshold. For ReLU, it’s usually zero; for a logistic sigmoid, it’s 0.5.
170
Mutation Testing
In software engineering, good test coverage for a software under test (SUT) can be determined using the approach known as mutation testing. Let’s have a set of tests designed to test an SUT. We generate several “mutants” of the SUT. A mutant is a version of the SUT in which we randomly make some modifications, such as replacing in the source code, a “+” with a “−”, a “<” with a “>”, delete the else command in an if-else statement, and so on. Then we apply the test set to each mutant, and see if at least one test breaks on that mutant. We say that we kill a mutant if one test breaks on it. We then compute the ratio of killed mutants in the entire collection of mutants. A good test set makes this ratio equal to 100%. In machine learning, a similar approach can be followed. However, to create a mutant statistical model, instead of modifying the code, we modify the training data. If the model is deep, we can also randomly remove or add a layer, or remove or replace an activation function. The training data can be modified by, * adding duplicated examples, * falsifying the labels of some examples, * removing some examples, or * adding random noise to the values of some features. We say that we kill a mutant if at least one test example gets a wrong prediction by that mutant statistical model.
171
Robustness of a model
The robustness of a machine learning model refers to the stability of the model performance after adding some noise to the input data. A robust model would exhibit the following behavior. If the input example is perturbed by adding random noise, the performance of the model would degrade proportionally to the level of noise.
172
Fairness
Machine learning algorithms tend to learn what humans are teaching them. The teaching comes in the form of training examples. Humans have biases which may affect how they collect and label data. Sometimes, bias is present in historical, cultural, or geographical data. This, in turn, as we have seen in Section ?? in Chapter 3, may lead to biased models.
173
A model can be deployed following several patterns:
* statically, as a part of an installable software package, * dynamically on the user’s device, * dynamically on a server, or * via model streaming.
174
Static Deployment
The static deployment of a machine learning model is very similar to traditional software deployment: you prepare an installable binary of the entire software. The model is packaged as a resource available at the runtime. Depending on the operating system and the runtime environment, the objects of both the model and the feature extractor can be packaged as a part of a dynamic-link library (DLL on Windows), Shared Objects (*.so files on Linux), or be serialized and saved in the standard resource location for virtual machine-based systems, such as Java and .Net. Static deployment has many advantages: * the software has direct access to the model, so the execution time is fast for the user, * the user data doesn’t have to be uploaded to the server at the time of prediction; this saves time and preserves privacy, * the model can be called when the user is offline, and * the software vendor doesn’t have to care about keeping the model operational; it becomes the user’s responsibility.
175
what is load balancer?
A load balancer dispatches the incoming requests to a specific virtual machine, depending on its availability. The virtual machines can be added and closed manually, or be a part of an autoscaling group that launches or terminates virtual machines based on their usage. Figure 2 illustrates that deployment pattern. Each instance, denoted as an orange square, contains all the code needed to run the feature extractor and the model. The instance also contains a web service that has access to that code.
176
example of a hidden feedback loop.
Model mB used the output of model mA as a feature, without knowing that model mA also used the output of model mB as its feature. Another kind of hidden feedback loop only involves one model. Let’s say we have a model that classifies incoming email messages as spam or not spam. Let the user interface allow the user to mark messages as spam or not spam. Obviously, we want to use those marked messages to improve our model. However, by so doing, we risk creating a hidden feedback loop, and here is why. In our application, the user will only mark a message as spam when they see it. However, users only see the messages that our model classified as not spam. Also, it is unlikely that the user will regularly go to the spam folder and mark some messages as not spam. So, the action of the user is significantly affected by our model, which makes the data we get from the user skewed: we influence the phenomenon from which we learn.
177
what is message broker?
To deal with such situations, on-demand architectures include a message broker, such as RabbitMQ or Apache Kafka. A message broker allows one process to write messages in a queue, and another to read from that queue. On-demand requests are placed in the input queue. The model runtime process periodically connects to the broker. It reads a batch of input data elements from the input queue and generates predictions for each element in batch mode. It then writes the predictions to the output queue. Another process periodically connects to the broker, reads the predictions from the output queue, and pushes them to users who sent the requests (Figure 3). In addition to allowing us to cope with demand spikes, such an approach is more resource-efficient.
178
There are three “cannots” we must accept and embrace:
1. We cannot always explain why an error happened. 2. We cannot reliably predict when it will happen, and even a high confidence prediction can be false. 3. We cannot always know how to fix a specific error. If it’s fixable, what kind and how much training data is needed?
179
thundersvm and cuML
Modern libraries, such as thundersvm and cuML, allow the analyst to run shallow learning algorithms on GPUs, with a significant gain in training time. If you cannot afford to wait for days or weeks to get an updated model, using a less complex (and, therefore, less accurate) model might be your only choice.
180