Machine learning engineering (book) Flashcards
Reinforcement Learning
Reinforcement learning is a subfield of machine learning where the machine (called an
agent) “lives’ ’ in an environment and is capable of perceiving the state of that environment
as a vector of features. The machine can execute actions in non-terminal states. Different
actions bring different rewards and could also move the machine to another state of the
environment. A common goal of a reinforcement learning algorithm is to learn an optimal
policy.
An optimal policy is a function (similar to the model in supervised learning) that takes the
feature vector of a state as input and outputs an optimal action to execute in that state. The
action is optimal if it maximizes the expected average long-term reward.
Tidy data
Tidy data can be seen as a spreadsheet, in which each row represents one
example, and columns represent various attributes of an example, as shown in Figure 3.
Sometimes raw data can be tidy, e.g., provided to you in the form of a spreadsheet. However,
in practice, to obtain tidy data from raw data, data analysts often resort to the procedure
called feature engineering, which is applied to the direct and, optionally, indirect data
with the goal to transform each raw example into a feature vector x. Chapter 4 is devoted
entirely to feature engineering.
training, validation, and test datasets
training, validation, and test. The training set is usually the biggest one;
the learning algorithm uses the training set to produce the model. The validation and test
sets are roughly the same size, much smaller than the size of the training set. The learning
algorithm is not allowed to use examples from the validation or test sets to train the model.
That is why those two sets are also called holdout sets.
Baseline
In machine learning, a baseline is a simple algorithm for solving a problem, usually based
on a heuristic, simple summary statistics, randomization, or very basic machine learning
algorithm. For example, if your problem is classification, you can pick a baseline classifier
and measure its performance. This baseline performance will then become what you compare
any future model to (usually, built using a more sophisticated approach).
Machine Learning Pipeline
A machine learning pipeline is a sequence of operations on the dataset that goes from its
initial state to the model.
A pipeline can include, among others, such stages as data partitioning, missing data im-
putation, feature extraction, data augmentation, class imbalance reduction, dimensionality
reduction, and model training.
In practice, when we deploy a model in production, we usually deploy an entire pipeline.
Furthermore, an entire pipeline is usually optimized when hyperparameters are tuned.
Hyperparameters
Hyperparameters are inputs of machine learning algorithms or pipelines that influence
the performance of the model. They don’t belong to the training data and cannot be
learned from it. For example, the maximum depth of the tree in the decision tree learning
algorithm, the misclassification penalty in support vector machines, k in the k-nearest
neighbors algorithm, the target dimensionality in dimensionality reduction, and the choice of
the missing data imputation technique are all examples of hyperparameters.
Parameters,
Parameters, on the other hand, are variables that define the model trained by the learning
algorithm. Parameters are directly modified by the learning algorithm based on the training
data. The goal of learning is to find such values of parameters that make the model optimal
in a certain sense. Examples of parameters are w and b in the equation of linear regression
y = wx + b. In this equation, x is the input of the model, and y is its output (the prediction).
Model-Based
Most supervised learning algorithms are model-based. A typical model is a support
vector machine (SVM). Model-based learning algorithms use the training data to create a
model with parameters learned from the training data. In SVM, the two parameters are w
(a vector) and b (a real number). After the model is trained, it can be saved on disk while
the training data can be discarded.
Instance-based learning algorithms
Instance-based learning algorithms use the whole dataset as the model. One instance-
based algorithm frequently used in practice is k-Nearest Neighbors (kNN). In classification,
to predict a label for an input example, the kNN algorithm looks at the close neighborhood
of the input example in the space of feature vectors and outputs the label that it saw most
often in this close neighborhood.
Shallow vs. Deep Learning
gpt
Training vs. Scoring
When we apply a trained model to an input example (or, sometimes, a sequence of examples)
in order to obtain a prediction (or, predictions) or to somehow transform an input, we talk
about scoring.
When to Use Machine Learning
When the Problem Is Too Complex for Coding
When the Problem Is Constantly Changing
When It Is a Perceptive Problem - image recognition…
When It Is an Unstudied Phenomenon
When the Problem Has a Simple Objective
When It Is Cost-Effective
When Not to Use Machine Learning
- every action of the system or a decision made by it must be explainable,
- every change in the system’s behavior compared to its past behavior in a similar
situation must be explainable, - the cost of an error made by the system is too high,
- you want to get to the market as fast as possible,
- getting the right data is too hard or impossible,
- you can solve the problem using traditional software development at a lower cost,
- a simple heuristic would work reasonably well,
- the phenomenon has too many outcomes while you cannot get a sufficient amount of
examples to represent them (like in video games or word processing software), - you build a system that will not have to be improved frequently over time,
- you can manually fill an exhaustive lookup table by providing the expected output
for any input (that is, the number of possible input values is not too large, or getting
outputs is fast and cheap).
Machine learning engineering
Machine learning engineering (MLE) is the use of scientific principles, tools, and techniques of
machine learning and traditional software engineering to design and build complex computing
systems. MLE encompasses all stages from data collection, to model training, to making the
model available for use by the product or the customers.
In other words, MLE includes any activity that lets machine learning algorithms be imple-
mented as a part of an effective production system.
Three factors highly influence the cost of a machine learning project
- the difficulty of the problem,
- the cost of data, and
- the need for accuracy.
Defining the Goal of a Machine Learning Project
The goal of a machine learning project is to build a model that solves, or helps solve, a
business problem. Within a project, the model is often seen as a black box described by the
structure of its input (or inputs) and output (or outputs), and the minimum acceptable level
of performance (as measured by accuracy of prediction or another performance metric).
What a Model Can Do
- automate (for example, by taking action on the user’s behalf or by starting or stopping
a specific activity on a server), - alert or prompt (for example, by asking the user if an action should be taken or by
asking a system administrator if the traffic seems suspicious), - organize, by presenting a set of items in an order that might be useful for a user (for
example, by sorting pictures or documents in the order of similarity to a query or
according to the user’s preferences), - annotate (for instance, by adding contextual annotations to displayed information, or
by highlighting, in a text, phrases relevant to the user’s task), - extract (for example, by detecting smaller pieces of relevant information in a larger
input, such as named entities in the text: proper names, companies, or locations), - recommend (for example, by detecting and showing to a user highly relevant items in a
large collection based on item’s content or user’s reaction to the past recommendations), - classify (for example, by dispatching input examples into one, or several, of a predefined
set of distinctly-named groups), - quantify (for example, by assigning a number, such as a price, to an object, such
as a house), - synthesize (for example, by generating new text, image, sound, or another object similar
to the objects in a collection), - answer an explicit question (for example, “Does this text describe that image?” or
“Are these two images similar?”), - transform its input (for example, by reducing its dimensionality for visualization
purposes, paraphrasing a long text as a short abstract, translating a sentence into
another language, or augmenting an image by applying a filter to it), - detect a novelty or an anomaly.
Properties of a Successful Model
- it respects the input and output specifications and the performance requirement,
- it benefits the organization (measured via cost reduction, increased sales or profit),
- it helps the user (measured via productivity, engagement, and sentiment),
- it is scientifically rigorous.
Structuring a Machine Learning Team
Two Cultures
One culture says that a machine learning team has to be composed of data analysts who
collaborate closely with software engineers. In such a culture, a software engineer doesn’t
need to have deep expertise in machine learning, but has to understand the vocabulary of
their fellow data analysts.
According to other culture, all engineers in a machine learning team must have a combination
of machine learning and software engineering skills.
Data engineers
Data engineers are software engineers responsible for ETL (for Extract, Transform, Load).
These three conceptual steps are part of a typical data pipeline. Data engineers use ETL
techniques and create an automated pipeline, in which raw data is transformed into analysis-
ready data. Data engineers design how to structure the data and how to integrate it from
various resources. They write on-demand queries on that data, or wrap the most frequent
queries into fast application programming interfaces (APIs) to make sure that the data is
easily accessible by analysts and other data consumers. Typically, data engineers are not
expected to know any machine learning.
labeler
A labeler is person responsible for assigning labels to unlabeled examples. Again, in big
companies, data labeling experts may be organized in two or three different teams: one or
two teams of labelers (for example, one local and one outsourced) and a team of software
engineers, plus a user experience (UX) specialist, responsible for building labeling tools.
machine learning projects can fail for many reasons
- lack of experienced talent,
- lack of support by the leadership,
- missing data infrastructure,
- data labeling challenge,
- siloed organizations and lack of collaboration,
- technically infeasible projects, and
- lack of alignment between technical and business teams.
Is the Data Sizeable?
Check if the number of samples is big enough.. there are some rules of thumb described:
* 10 times the amount of features (this often exaggerates the size of the training set, but
works well as an upper bound),
* 100 or 1000 times the number of classes (this often underestimates the size), or
* ten times the number of trainable parameters (usually applied to neural networks).
Keep in mind that just because you have big data does not mean that you should use all of
it. A smaller sample of big data can give good results in practice and accelerate the search
for a better model. It’s important to ensure, though, that the sample is representative of the
whole big dataset. Sampling strategies such as stratified and systematic sampling can
lead to better results. We consider data sampling strategies in Section 3.10.
data leakage (also known as target leakage)
What did happen is called data leakage (also known as target leakage). After a more
careful examination of the dataset, you realize that one of the columns in the spreadsheet
contained the real estate agent’s commission. Of course, the model easily learned to convert
this attribute into the house price perfectly. However, this information is not available in the
production environment before the house is sold, because the commission depends on the
selling price. In Section 3.2.8, we will consider the problem of data leakage in more detail.
Common Problems With Data
High Cost - Getting unlabeled data can be expensive; however, labeling data is the most expensive work,
especially if the work is done manually.
Bias? And its types?
Bias in data is an inconsistency with the phenomenon that data represents. This inconsistency
may occur for a number of reasons (which are not mutually exclusive).
- Selection bias
- Self-selection bias
- Omitted variable bias
- Sponsorship or funding bias
- Sampling bias
- Prejudice or stereotype bias
- Systematic value distortion
- Experimenter bias
- Labeling bias
Selection bias
Selection bias is the tendency to skew your choice of data sources to those that are easily
available, convenient, and/or cost-effective. For example, you might want to know the opinion
of the readers on your new book.
Self-selection bias
Self-selection bias is a form of selection bias where you get the data from sources that
“volunteered” to provide it. Most poll data has this type of bias. For example, you want to
train a model that predicts the behavior of successful entrepreneurs. You decide to first ask
entrepreneurs whether they are successful or not. Then you only keep the data obtained
from those who declared themselves successful. The problem here is that most likely, really
successful entrepreneurs don’t have time to answer your questions, while those who claim
themselves successful can be wrong on that matter.
Omitted variable bias
Omitted variable bias happens when your featurized data doesn’t have a feature necessary
for accurate prediction. For example, let’s assume that you are working on a churn prediction
model and you want to predict whether a customer cancels their subscription within six
months. You train a model, and it’s accurate enough; however, several weeks after deployment
you see many unexpected false negatives. You investigate the decreased model performance
and discover a new competitor now offers a very similar service for a lower price. This
feature wasn’t initially available to your model, therefore important information for accurate
prediction was missing.
Sponsorship or funding bias
Sponsorship or funding bias affects the data produced by a sponsored agency. For example,
let a famous video game company sponsor a news agency to provide news about the video
game industry. If you try to make a prediction about the video game industry, you might
include in your data the story produced by this sponsored agency.
Sampling bias (also known as distribution shift)
occurs when the distribution of examples
used for training doesn’t reflect the distribution of the inputs the model will receive in
production. This type of bias is frequently observed in practice. For example, you are working
on a system that classifies documents according to a taxonomy of several hundred topics.
You might decide to create a collection of documents in which an equal amount of documents
represents each topic. Once you finish the work on the model, you observe 5% error. Soon after
deployment, you see the wrong assignment to about 30% of documents. Why did this happen?
One of the possible reasons is sampling bias: one or two frequent topics in production data
might account for 80% of all input. If your model doesn’t perform well for these frequent
topics, then your system will make more errors in production than you initially expected.
Prejudice or stereotype bias
Prejudice or stereotype bias is often observed in data obtained from historical sources, such as books or photo archives, or from online activity such as social media, online forums,
and comments to online publications. Using a photo archive to train a model that distinguishes men from women might show, for
example, men more frequently in work or outdoor contexts, and women more often at home
indoors. If we use such biased data, our model will have more difficulty recognizing a woman
outdoors or a man at home.
Systematic value distortion
Systematic value distortion is bias usually occurring with the device making measurements
or observations. This results in a machine learning model making suboptimal predictions
when deployed in the production environment.
Experimenter bias
Experimenter bias is the tendency to search for, interpret, favor, or recall information in a
way that affirms one’s prior beliefs or hypotheses. Applied to machine learning, experimenter
bias often occurs when each example in the dataset is obtained from the answers to a survey
given by a particular person, one example per person.
Labeling bias
Labeling bias happens when labels are assigned to unlabeled examples by a biased process
or person.
Low predictive power
Low predictive power is an issue that you often don’t consider until you have spent
fruitless energy trying to train a good model. Does the model underperform because it is not
expressive enough? Does the data not contain enough information from which to learn? You
don’t know.
concept drift.
Concept drift is a fundamental
change in the statistical relationship between the features and the label.
Outliers
Outliers are examples that look dissimilar to the majority of examples from the dataset. It’s
up to the data analyst to define “dissimilar.” Typically, dissimilarity is measured by some
distance metric, such as Euclidean distance.
Data Leakage
Data leakage, also called target leakage, is a problem affecting several stages of the
machine learning life cycle, from data collection to model evaluation. In this section, I will
only describe how this problem manifests itself at the data collection and preparation stages.
In the subsequent chapters, I will describe its other forms.
Summary of Good Data
For the convenience of future reference, let me once again repeat the properties of good data:
* it contains enough information that can be used for modeling,
* it has good coverage of what you want to do with the model,
* it reflects real inputs that the model will see in production,
* it is as unbiased as possible,
* it is not a result of the model itself,
* it has consistent labels, and
* it is big enough to allow generalization.
Dealing With Interaction Data
Interaction data is the data you can collect from user interactions with the system your
model supports. You are considered lucky if you can gather good data from interactions of
the user with the system.
Good interaction data contains information on three aspects:
* context of interaction,
* action of the user in that context, and
* outcome of interaction.
As an example, assume that you build a search engine, and your model reranks search results
for each user individually. A reranking model takes as input the list of links returned by the
search engine, based on keywords provided by the user and outputs another list in which the
items change order. Usually, a reranked model “knows” something about the user and their
preferences and can reorder the generic search results for each user individually according
to that user’s learned preferences. The context here is the search query and the hundred
documents presented to the user in a specific order. The action is a click of the user on
a particular document link. The outcome is how much time the user spent reading the
document and whether the user hit “back.” Another action is the click on the “next page”
link.
three most frequent causes of data leakage that can happen during data
collection and preparation:
Data leakage is when information from outside the training dataset is used to create the model. 1) target being a function of a feature, 2) feature hiding the
target, and 3) feature coming from the future.
Data leakage - Target is a Function of a Feature
If you don’t do a careful analysis of each attribute and
its relation to GDP, you might let a leakage happen: in the data in Figure 9, two columns,
Population and GDP per capita, multiplied, equal GDP. The model you will train will
perfectly predict GDP by looking at these two columns only. The fact that you let GDP be
one of the features, though in a slightly modified form (devised by the population), constitutes
contamination and, therefore, leads to data leakage.
Data leakage - Feature Hides the Target
If the data about
a customer’s gender and age is factual (as opposed to being guessed by another model that
might be available in production), then the column Group constitutes a form of data leakage,
when the value you want to predict is “hidden” in the value of a feature.
Data leakage - Feature From the Future
Here is another example. Let’s say you have a news website and you want to predict the
ranking of news you serve to the user, so as to maximize the number of clicks on stories. If
in your training data, you have positional features for each news item served in the past (e.g.,
the x − y position of the title, and the abstract block on the webpage), such information will
not be available on the serving time, because you don’t know the positions of articles on the
page before you rank them.
Data Partitioning
The training set is used by the machine learning algorithm to train the model.
The validation set is needed to find the best values for the hyperparameters of the machine
learning pipeline. The analyst tries different combinations of hyperparameter values one by
one, trains a model by using each combination, and notes the model performance on the
validation set. The hyperparameters that maximize the model performance are then used to
train the model for production. We consider techniques of hyperparameter tuning in more
detail in Section ?? of Chapter 5.
The test set is used for reporting: once you have your best model, you test its performance
on the test set and report the results.
To obtain good partitions of your entire dataset into these three disjoint sets (test, val, train) partitioning has to satisfy several conditions.
Condition 1: Split was applied to raw data.
Once you have access to raw examples, and before everything else, do the split. This
will allow avoiding data leakage, as we will see later.
Condition 2: Data was randomized before the split.
Randomly shuffle your examples first, then do the split.
Condition 3: Validation and test sets follow the same distribution.
When you select the best values of hyperparameters using the validation set, you want
that this selection yields a model that works well in production. The examples in the
test set are your best representatives of the production data. Hence the need for the
validation and test sets to follow the same distribution.
Condition 4: Leakage during the split was avoided.
Data leakage can happen even during the data partitioning. Below, we will see what
forms of leakage can happen at that stage.
Ratio of data partitioning
There is no ideal ratio for the split. In older literature (pre-big data), you might find the
recommended splits of either 70%/15%/15% or 80%/10%/10% (for training, validation, and
test sets, respectively, in proportion to the entire dataset).
Today, in the era of the Internet and cheap labor (e.g., Mechanical Turk or crowdsourcing),
organizations, scientists, and even enthusiasts at home can get access to millions of training
examples. That makes it wasteful only to use 70% or 80% of the available data for training.
A small dataset of less than a
thousand examples would do best with 90% of the data used for training. In this case, you
might decide to not have a distinct validation set, and instead simulate with the cross-
validation technique.
Leakage During Partitioning
Group leakage may occur during partitioning. Imagine you have magnetic resonance images
of the brains of multiple patients. Each image is labeled with certain brain disease, and the
same patient may be represented by several images taken at different times. If you apply the
partitioning technique discussed above (shuffle, then split), images of the same patient might
appear in both the training and holdout data.
Dealing with Missing Attributes
- removing the examples with missing attributes from the dataset (this can be done if
your dataset is big enough to safely sacrifice some data); - using a learning algorithm that can deal with missing attribute values (such as the
decision tree learning algorithm); - using a data imputation technique.
Data Imputation Techniques
-To impute the value of a missing numerical attribute, one technique consists in replacing the
missing value by the average value of this attribute in the rest of the dataset.
-Another technique is to replace the missing value with a value outside the normal range of
values. For example, if the regular range is [0, 1], you can set the missing value to 2 or −1; if
the attribute is categorical, such as days of the week, then a missing value can be replaced
by the value “Unknown.” Here, the learning algorithm learns what to do when the attribute
has a value different from regular values.
-A more advanced technique is to use the missing value as the target variable for a regression
problem.
-Finally, if you have a significantly large dataset and just a few attributes with missing values,
you can add a synthetic binary indicator attribute for each original attribute with missing
values. Let’s say that examples in your dataset are D-dimensional, and attribute at position
j = 12 has missing values. For each example x, you then add the attribute at position
j = D + 1, which is equal to 1 if the value of the attribute at position 12 is present in x and
0 otherwise. The missing value then can be replaced by 0 or any value of your choice.
Leakage During Imputation
If you use the imputation techniques that compute some statistic of one attribute (such as
average) or several attributes (by solving the regression problem), the leakage happens if you
use the whole dataset to compute this statistic. Using all available examples, you contaminate
the training data with information obtained from the validation and test examples.
This type of leakage is not as significant as other types discussed earlier. However, you still
have to be aware of it and avoid it by partitioning first, and then computing the imputation
statistic only on the training set.
Data Augmentation for Images
In Figure 14, you can see examples of operations that can be easily applied to a given image
to obtain one or more new images: flip, rotation, crop, color shift, noise addition, perspective
change, contrast change, and information loss.
data augmentation that seems counterintuitive, but works very
well in practice, is mixup.
As the name suggests, the technique consists of training the
model on a mix of the images from the training set. More precisely, instead of training the
model on the raw images, we take two images (that could be of the same class or not) and
use for training their linear combination:
mixup_image = t × image1 + (1 − t) × image2
,
where t is a real number between 0 and 1. The target of that mixup image is a combination
of the original targets obtained using the same value of t:
mixup_target = t × target1 + (1 − t) × target2
Data Augmentation for Text
When it comes to text data augmentations, it is not as straightforward. We need to use
appropriate transformation techniques to preserve the contextual and grammatical structure
of natural language texts.
-One technique involves replacing random words in a sentence with their close synonyms.
For the sentence, “The car stopped near a shopping mall.” some equivalent sentences are:
“The automobile stopped near a shopping mall.”
“The car stopped near a shopping center.”
“The auto stopped near a mall.”
-A similar technique uses hypernyms instead of synonyms. A hypernym is a word that
has more general meaning. For example, “mammal” is a hypernym for “whale” and “cat”;
“vehicle” is a hypernym for “car” and “bus.” From our example above, we could create the fol-
lowing sentences:
“The vehicle stopped near a shopping mall.”
“The car stopped near a building.”
-A modern alternative to the k-nearest-neighbors approach described above is to use a deep
pre-trained model such as Bidirectional Encoder Representations from Transformers (BERT).
Models like BERT are trained to predict a masked word given other words in a sentence.
One can use BERT to generate k most likely predictions for a masked word and then use
them as synonyms for data augmentation.
-Another useful text data augmentation technique is back translation. To create a new
example from a text written in English (it can be a sentence or a document), first translate
it into another language l using a machine translation system. Then translate it back from l
into English. If the text obtained through back translation is different from the original text,
you add it to the dataset by assigning the same label as the original text.
Class imbalance
Class imbalance is a condition in the data that can significantly affect the performance of
the model, independently of the chosen learning algorithm. The problem is a very uneven
distribution of labels in the training data.
Typically, a machine learning algorithm tries to classify most training examples
correctly. The algorithm is pushed to do so because it needs to minimize a cost function
that typically assigns a positive loss value to each misclassified example. If the loss is the
same for the misclassification of a minority class example as it is for the misclassification of a
majority class, then it’s very likely that the learning algorithm decides to “give up” on many
minority class examples in order to make fewer mistakes in the majority class.
Oversampling
A technique used frequently to mitigate class imbalance is oversampling. By making multi-
ple copies of minority class examples, it increases their weight, as illustrated in Figure 15a.
You might also create synthetic examples by sampling feature values of several examples of the
minority class and combining them to obtain a new example of that class. Two popular algo-
rithms that oversample the minority class by creating synthetic examples: Synthetic Minority
Oversampling Technique (SMOTE) and Adaptive Synthetic Sampling Method (ADASYN).
Undersampling
The undersampling can be done randomly; that is, the examples to remove from the majority
class can be chosen at random. Alternatively, examples to withdraw from the majority class
can be selected based on some property.
Cluster-based undersampling
Cluster-based undersampling works as follows. Decide on the number of examples you
want to have in the majority class resulting from undersampling. Let that number be k. Run
a centroid-based clustering algorithm on the majority examples only with k being the
desired number of clusters. Then replace all examples in the majority classes with the k
centroids. An example of a centroid-based clustering algorithm is k-nearest neighbors.
Hybrid Strategies for sampling
You can develop your hybrid strategies (by combining both over- and undersampling) and
possibly get better results. One such strategy consists of using ADASYN to oversample, and
then Tomek links to undersample.
two main data sampling strategies
There are two main strategies: probability sampling and nonprobability sampling. In
probability sampling, all examples have a chance to be selected. These techniques involve
randomness.
Nonprobability sampling is not random. To build a sample, it follows a fixed deterministic
sequence of heuristic actions. This means that some examples don’t have a chance of being
selected, no matter how many samples you build.
The main drawback of nonprobability sampling
methods is that they include non-representative samples and might systematically exclude
important examples. These drawbacks outweigh the possible advantages of nonprobability
sampling methods. Therefore, in this book I will only present probability sampling methods.
Simple random sampling
Simple random sampling is the most straightforward method, and the one I refer to when
I say “sample randomly.” Here, each example from the entire dataset is chosen purely by
chance; each example has an equal chance of being selected.
systematic sampling
To implement systematic sampling (also known as interval sampling), you create a list
containing all examples. From that list, you randomly select the first example xstart from
the first k elements on the list. Then, you select every k
th item on the list starting from
xstart. You choose such a value of k that will give you a sample of the desired size.
An advantage of the systematic sampling over the simple random sampling is that it draws
examples from the whole range of values. However, systematic sampling is inappropriate if
the list of examples has periodicity or repetitive patterns. In the latter case, the obtained
sample can exhibit a bias. However, if the list of examples is randomized, then systematic
sampling often results in a better sample than simple random sampling.
Stratified Sampling
In stratified
sampling, you first divide your dataset into groups (called strata) and then randomly select
examples from each stratum, like in simple random sampling. The number of examples to
select from each stratum is proportional to the size of the stratum.
Stratified sampling often improves the representativeness of the sample by reducing its bias;
in the worst of cases, the resulting sample is of no less quality than the results of simple
random sampling. However, to define strata, the analyst has to understand the properties of
the dataset. Furthermore, it can be difficult to decide which attributes will define the strata.
what is data serialization
Data serialization is the process of converting complex data structures, such as objects or data collections, into a format that can be easily stored, transmitted, or reconstructed. The serialized data can later be deserialized, which means it’s converted back into its original form, allowing it to be used in the same way as before serialization. Serialization is commonly used in various scenarios, such as when saving data to files, sending data over networks, or storing data in databases.
Serialization is important because it enables data to be transported or stored in a standardized format that can be understood by different systems or programming languages. It also helps preserve the structure and relationships within the data. Different serialization formats exist, each with its own characteristics and use cases.
Reproducibility
Reproducibility should be an important concern in everything you do, including data
collection and preparation. You should avoid transforming data manually, or using powerful
tools included in text editors or command line shells, such as regular expressions, “quick and
dirty” ad hoc awk or sed commands, and piped expressions.
Usually, the data collection and transformation activities consist of multiple stages. These
include downloading data from web APIs or databases, replacing multiword expressions by
unique tokens, removing stop-words and noise, cropping and unblurring images, imputation
of missing values, and so on. Each step in this multistage process has to be implemented
as a software script, such as Python or R script with their inputs and outputs. If you are
organized like that in your work, it will allow you to keep track of all changes in the data.
Data First, Algorithm Second
Remember that in the industry, contrary to academia, “data first, algorithm second,” so
focus most of your effort and time on getting more data of wide variety and high quality,
instead of trying to squeeze the maximum out of a learning algorithm.
Data augmentation, when implemented well, will most likely contribute more to the quality
of the model than the search for the best hyperparameter values or model architecture.
To obtain a good partition of your entire dataset into training, validation and test sets, the
process of partitioning has to satisfy several conditions:
1) data was randomized before the
split, 2) split was applied to raw data, 3) validation and test sets follow the same distribution,
and 4) leakage was avoided.
Feature Engineering
Feature engineering is a process of first conceptually and then programmatically transforming
a raw example into a feature vector. It consists of conceptualizing a feature and then writing
the programming code that would transform the entire raw example, with potentially the
help of some indirect data, into a feature.
Feature Engineering for Text
When it comes to text, scientists and engineers often use simple feature engineering tricks.
Two such tricks are one-hot encoding and bag-of-words.
Mean encoding,
Mean encoding, also known as bin counting or feature calibration, is another technique.
First, the sample mean of the label is calculated using all examples where the feature has
value z. Each value z of the categorical feature is then replaced by that sample mean value.
The advantage of this technique is that the data dimensionality doesn’t increase, and by
design, the numerical value contains some information about the label.
sine-cosine transformation.
It converts a cyclical feature into two
synthetic features. Let p denote the integer value of our cyclical feature. Replace the value p
of the cyclical feature with the following two values:
psin = sin
2 × π × p
max(p)
, pcos = cos
2 × π × p
max(p)