Machine learning engineering (book) Flashcards
Reinforcement Learning
Reinforcement learning is a subfield of machine learning where the machine (called an
agent) “lives’ ’ in an environment and is capable of perceiving the state of that environment
as a vector of features. The machine can execute actions in non-terminal states. Different
actions bring different rewards and could also move the machine to another state of the
environment. A common goal of a reinforcement learning algorithm is to learn an optimal
policy.
An optimal policy is a function (similar to the model in supervised learning) that takes the
feature vector of a state as input and outputs an optimal action to execute in that state. The
action is optimal if it maximizes the expected average long-term reward.
Tidy data
Tidy data can be seen as a spreadsheet, in which each row represents one
example, and columns represent various attributes of an example, as shown in Figure 3.
Sometimes raw data can be tidy, e.g., provided to you in the form of a spreadsheet. However,
in practice, to obtain tidy data from raw data, data analysts often resort to the procedure
called feature engineering, which is applied to the direct and, optionally, indirect data
with the goal to transform each raw example into a feature vector x. Chapter 4 is devoted
entirely to feature engineering.
training, validation, and test datasets
training, validation, and test. The training set is usually the biggest one;
the learning algorithm uses the training set to produce the model. The validation and test
sets are roughly the same size, much smaller than the size of the training set. The learning
algorithm is not allowed to use examples from the validation or test sets to train the model.
That is why those two sets are also called holdout sets.
Baseline
In machine learning, a baseline is a simple algorithm for solving a problem, usually based
on a heuristic, simple summary statistics, randomization, or very basic machine learning
algorithm. For example, if your problem is classification, you can pick a baseline classifier
and measure its performance. This baseline performance will then become what you compare
any future model to (usually, built using a more sophisticated approach).
Machine Learning Pipeline
A machine learning pipeline is a sequence of operations on the dataset that goes from its
initial state to the model.
A pipeline can include, among others, such stages as data partitioning, missing data im-
putation, feature extraction, data augmentation, class imbalance reduction, dimensionality
reduction, and model training.
In practice, when we deploy a model in production, we usually deploy an entire pipeline.
Furthermore, an entire pipeline is usually optimized when hyperparameters are tuned.
Hyperparameters
Hyperparameters are inputs of machine learning algorithms or pipelines that influence
the performance of the model. They don’t belong to the training data and cannot be
learned from it. For example, the maximum depth of the tree in the decision tree learning
algorithm, the misclassification penalty in support vector machines, k in the k-nearest
neighbors algorithm, the target dimensionality in dimensionality reduction, and the choice of
the missing data imputation technique are all examples of hyperparameters.
Parameters,
Parameters, on the other hand, are variables that define the model trained by the learning
algorithm. Parameters are directly modified by the learning algorithm based on the training
data. The goal of learning is to find such values of parameters that make the model optimal
in a certain sense. Examples of parameters are w and b in the equation of linear regression
y = wx + b. In this equation, x is the input of the model, and y is its output (the prediction).
Model-Based
Most supervised learning algorithms are model-based. A typical model is a support
vector machine (SVM). Model-based learning algorithms use the training data to create a
model with parameters learned from the training data. In SVM, the two parameters are w
(a vector) and b (a real number). After the model is trained, it can be saved on disk while
the training data can be discarded.
Instance-based learning algorithms
Instance-based learning algorithms use the whole dataset as the model. One instance-
based algorithm frequently used in practice is k-Nearest Neighbors (kNN). In classification,
to predict a label for an input example, the kNN algorithm looks at the close neighborhood
of the input example in the space of feature vectors and outputs the label that it saw most
often in this close neighborhood.
Shallow vs. Deep Learning
gpt
Training vs. Scoring
When we apply a trained model to an input example (or, sometimes, a sequence of examples)
in order to obtain a prediction (or, predictions) or to somehow transform an input, we talk
about scoring.
When to Use Machine Learning
When the Problem Is Too Complex for Coding
When the Problem Is Constantly Changing
When It Is a Perceptive Problem - image recognition…
When It Is an Unstudied Phenomenon
When the Problem Has a Simple Objective
When It Is Cost-Effective
When Not to Use Machine Learning
- every action of the system or a decision made by it must be explainable,
- every change in the system’s behavior compared to its past behavior in a similar
situation must be explainable, - the cost of an error made by the system is too high,
- you want to get to the market as fast as possible,
- getting the right data is too hard or impossible,
- you can solve the problem using traditional software development at a lower cost,
- a simple heuristic would work reasonably well,
- the phenomenon has too many outcomes while you cannot get a sufficient amount of
examples to represent them (like in video games or word processing software), - you build a system that will not have to be improved frequently over time,
- you can manually fill an exhaustive lookup table by providing the expected output
for any input (that is, the number of possible input values is not too large, or getting
outputs is fast and cheap).
Machine learning engineering
Machine learning engineering (MLE) is the use of scientific principles, tools, and techniques of
machine learning and traditional software engineering to design and build complex computing
systems. MLE encompasses all stages from data collection, to model training, to making the
model available for use by the product or the customers.
In other words, MLE includes any activity that lets machine learning algorithms be imple-
mented as a part of an effective production system.
Three factors highly influence the cost of a machine learning project
- the difficulty of the problem,
- the cost of data, and
- the need for accuracy.
Defining the Goal of a Machine Learning Project
The goal of a machine learning project is to build a model that solves, or helps solve, a
business problem. Within a project, the model is often seen as a black box described by the
structure of its input (or inputs) and output (or outputs), and the minimum acceptable level
of performance (as measured by accuracy of prediction or another performance metric).
What a Model Can Do
- automate (for example, by taking action on the user’s behalf or by starting or stopping
a specific activity on a server), - alert or prompt (for example, by asking the user if an action should be taken or by
asking a system administrator if the traffic seems suspicious), - organize, by presenting a set of items in an order that might be useful for a user (for
example, by sorting pictures or documents in the order of similarity to a query or
according to the user’s preferences), - annotate (for instance, by adding contextual annotations to displayed information, or
by highlighting, in a text, phrases relevant to the user’s task), - extract (for example, by detecting smaller pieces of relevant information in a larger
input, such as named entities in the text: proper names, companies, or locations), - recommend (for example, by detecting and showing to a user highly relevant items in a
large collection based on item’s content or user’s reaction to the past recommendations), - classify (for example, by dispatching input examples into one, or several, of a predefined
set of distinctly-named groups), - quantify (for example, by assigning a number, such as a price, to an object, such
as a house), - synthesize (for example, by generating new text, image, sound, or another object similar
to the objects in a collection), - answer an explicit question (for example, “Does this text describe that image?” or
“Are these two images similar?”), - transform its input (for example, by reducing its dimensionality for visualization
purposes, paraphrasing a long text as a short abstract, translating a sentence into
another language, or augmenting an image by applying a filter to it), - detect a novelty or an anomaly.
Properties of a Successful Model
- it respects the input and output specifications and the performance requirement,
- it benefits the organization (measured via cost reduction, increased sales or profit),
- it helps the user (measured via productivity, engagement, and sentiment),
- it is scientifically rigorous.
Structuring a Machine Learning Team
Two Cultures
One culture says that a machine learning team has to be composed of data analysts who
collaborate closely with software engineers. In such a culture, a software engineer doesn’t
need to have deep expertise in machine learning, but has to understand the vocabulary of
their fellow data analysts.
According to other culture, all engineers in a machine learning team must have a combination
of machine learning and software engineering skills.
Data engineers
Data engineers are software engineers responsible for ETL (for Extract, Transform, Load).
These three conceptual steps are part of a typical data pipeline. Data engineers use ETL
techniques and create an automated pipeline, in which raw data is transformed into analysis-
ready data. Data engineers design how to structure the data and how to integrate it from
various resources. They write on-demand queries on that data, or wrap the most frequent
queries into fast application programming interfaces (APIs) to make sure that the data is
easily accessible by analysts and other data consumers. Typically, data engineers are not
expected to know any machine learning.
labeler
A labeler is person responsible for assigning labels to unlabeled examples. Again, in big
companies, data labeling experts may be organized in two or three different teams: one or
two teams of labelers (for example, one local and one outsourced) and a team of software
engineers, plus a user experience (UX) specialist, responsible for building labeling tools.
machine learning projects can fail for many reasons
- lack of experienced talent,
- lack of support by the leadership,
- missing data infrastructure,
- data labeling challenge,
- siloed organizations and lack of collaboration,
- technically infeasible projects, and
- lack of alignment between technical and business teams.
Is the Data Sizeable?
Check if the number of samples is big enough.. there are some rules of thumb described:
* 10 times the amount of features (this often exaggerates the size of the training set, but
works well as an upper bound),
* 100 or 1000 times the number of classes (this often underestimates the size), or
* ten times the number of trainable parameters (usually applied to neural networks).
Keep in mind that just because you have big data does not mean that you should use all of
it. A smaller sample of big data can give good results in practice and accelerate the search
for a better model. It’s important to ensure, though, that the sample is representative of the
whole big dataset. Sampling strategies such as stratified and systematic sampling can
lead to better results. We consider data sampling strategies in Section 3.10.
data leakage (also known as target leakage)
What did happen is called data leakage (also known as target leakage). After a more
careful examination of the dataset, you realize that one of the columns in the spreadsheet
contained the real estate agent’s commission. Of course, the model easily learned to convert
this attribute into the house price perfectly. However, this information is not available in the
production environment before the house is sold, because the commission depends on the
selling price. In Section 3.2.8, we will consider the problem of data leakage in more detail.