Machine learning engineering (book) Flashcards

Question

Common Problems With Data

Answer 1

High Cost - Getting unlabeled data can be expensive; however, labeling data is the most expensive work, especially if the work is done manually.

Answer 2

Bias in data is an inconsistency with the phenomenon that data represents. This inconsistency may occur for a number of reasons (which are not mutually exclusive). - Selection bias - Self-selection bias - Omitted variable bias - Sponsorship or funding bias - Sampling bias - Prejudice or stereotype bias - Systematic value distortion - Experimenter bias - Labeling bias

Answer 3

Selection bias is the tendency to skew your choice of data sources to those that are easily available, convenient, and/or cost-effective. For example, you might want to know the opinion of the readers on your new book.

Answer 4

Self-selection bias is a form of selection bias where you get the data from sources that “volunteered” to provide it. Most poll data has this type of bias. For example, you want to train a model that predicts the behavior of successful entrepreneurs. You decide to first ask entrepreneurs whether they are successful or not. Then you only keep the data obtained from those who declared themselves successful. The problem here is that most likely, really successful entrepreneurs don’t have time to answer your questions, while those who claim themselves successful can be wrong on that matter.

Answer 5

Omitted variable bias happens when your featurized data doesn’t have a feature necessary for accurate prediction. For example, let’s assume that you are working on a churn prediction model and you want to predict whether a customer cancels their subscription within six months. You train a model, and it’s accurate enough; however, several weeks after deployment you see many unexpected false negatives. You investigate the decreased model performance and discover a new competitor now offers a very similar service for a lower price. This feature wasn’t initially available to your model, therefore important information for accurate prediction was missing.

Answer 6

Sponsorship or funding bias affects the data produced by a sponsored agency. For example, let a famous video game company sponsor a news agency to provide news about the video game industry. If you try to make a prediction about the video game industry, you might include in your data the story produced by this sponsored agency.

Answer 7

occurs when the distribution of examples used for training doesn’t reflect the distribution of the inputs the model will receive in production. This type of bias is frequently observed in practice. For example, you are working on a system that classifies documents according to a taxonomy of several hundred topics. You might decide to create a collection of documents in which an equal amount of documents represents each topic. Once you finish the work on the model, you observe 5% error. Soon after deployment, you see the wrong assignment to about 30% of documents. Why did this happen? One of the possible reasons is sampling bias: one or two frequent topics in production data might account for 80% of all input. If your model doesn’t perform well for these frequent topics, then your system will make more errors in production than you initially expected.

Answer 8

Prejudice or stereotype bias is often observed in data obtained from historical sources, such as books or photo archives, or from online activity such as social media, online forums, and comments to online publications. Using a photo archive to train a model that distinguishes men from women might show, for example, men more frequently in work or outdoor contexts, and women more often at home indoors. If we use such biased data, our model will have more difficulty recognizing a woman outdoors or a man at home.

Answer 9

Systematic value distortion is bias usually occurring with the device making measurements or observations. This results in a machine learning model making suboptimal predictions when deployed in the production environment.

Answer 10

Experimenter bias is the tendency to search for, interpret, favor, or recall information in a way that affirms one’s prior beliefs or hypotheses. Applied to machine learning, experimenter bias often occurs when each example in the dataset is obtained from the answers to a survey given by a particular person, one example per person.

Answer 11

Labeling bias happens when labels are assigned to unlabeled examples by a biased process or person.

Answer 12

Low predictive power is an issue that you often don’t consider until you have spent fruitless energy trying to train a good model. Does the model underperform because it is not expressive enough? Does the data not contain enough information from which to learn? You don’t know.

Answer 13

Concept drift is a fundamental change in the statistical relationship between the features and the label.

Answer 14

Outliers are examples that look dissimilar to the majority of examples from the dataset. It’s up to the data analyst to define “dissimilar.” Typically, dissimilarity is measured by some distance metric, such as Euclidean distance.

Answer 15

Data leakage, also called target leakage, is a problem affecting several stages of the machine learning life cycle, from data collection to model evaluation. In this section, I will only describe how this problem manifests itself at the data collection and preparation stages. In the subsequent chapters, I will describe its other forms.

Answer 16

For the convenience of future reference, let me once again repeat the properties of good data: * it contains enough information that can be used for modeling, * it has good coverage of what you want to do with the model, * it reflects real inputs that the model will see in production, * it is as unbiased as possible, * it is not a result of the model itself, * it has consistent labels, and * it is big enough to allow generalization.

Answer 17

Interaction data is the data you can collect from user interactions with the system your model supports. You are considered lucky if you can gather good data from interactions of the user with the system. Good interaction data contains information on three aspects: * context of interaction, * action of the user in that context, and * outcome of interaction. As an example, assume that you build a search engine, and your model reranks search results for each user individually. A reranking model takes as input the list of links returned by the search engine, based on keywords provided by the user and outputs another list in which the items change order. Usually, a reranked model “knows” something about the user and their preferences and can reorder the generic search results for each user individually according to that user’s learned preferences. The context here is the search query and the hundred documents presented to the user in a specific order. The action is a click of the user on a particular document link. The outcome is how much time the user spent reading the document and whether the user hit “back.” Another action is the click on the “next page” link.

Answer 18

Data leakage is when information from outside the training dataset is used to create the model. 1) target being a function of a feature, 2) feature hiding the target, and 3) feature coming from the future.

Answer 19

If you don’t do a careful analysis of each attribute and its relation to GDP, you might let a leakage happen: in the data in Figure 9, two columns, Population and GDP per capita, multiplied, equal GDP. The model you will train will perfectly predict GDP by looking at these two columns only. The fact that you let GDP be one of the features, though in a slightly modified form (devised by the population), constitutes contamination and, therefore, leads to data leakage.

Answer 20

If the data about a customer’s gender and age is factual (as opposed to being guessed by another model that might be available in production), then the column Group constitutes a form of data leakage, when the value you want to predict is “hidden” in the value of a feature.

Answer 21

Here is another example. Let’s say you have a news website and you want to predict the ranking of news you serve to the user, so as to maximize the number of clicks on stories. If in your training data, you have positional features for each news item served in the past (e.g., the x − y position of the title, and the abstract block on the webpage), such information will not be available on the serving time, because you don’t know the positions of articles on the page before you rank them.

Answer 22

The training set is used by the machine learning algorithm to train the model. The validation set is needed to find the best values for the hyperparameters of the machine learning pipeline. The analyst tries different combinations of hyperparameter values one by one, trains a model by using each combination, and notes the model performance on the validation set. The hyperparameters that maximize the model performance are then used to train the model for production. We consider techniques of hyperparameter tuning in more detail in Section ?? of Chapter 5. The test set is used for reporting: once you have your best model, you test its performance on the test set and report the results.

Answer 23

Condition 1: Split was applied to raw data. Once you have access to raw examples, and before everything else, do the split. This will allow avoiding data leakage, as we will see later. Condition 2: Data was randomized before the split. Randomly shuffle your examples first, then do the split. Condition 3: Validation and test sets follow the same distribution. When you select the best values of hyperparameters using the validation set, you want that this selection yields a model that works well in production. The examples in the test set are your best representatives of the production data. Hence the need for the validation and test sets to follow the same distribution. Condition 4: Leakage during the split was avoided. Data leakage can happen even during the data partitioning. Below, we will see what forms of leakage can happen at that stage.

Answer 24

There is no ideal ratio for the split. In older literature (pre-big data), you might find the recommended splits of either 70%/15%/15% or 80%/10%/10% (for training, validation, and test sets, respectively, in proportion to the entire dataset). Today, in the era of the Internet and cheap labor (e.g., Mechanical Turk or crowdsourcing), organizations, scientists, and even enthusiasts at home can get access to millions of training examples. That makes it wasteful only to use 70% or 80% of the available data for training. A small dataset of less than a thousand examples would do best with 90% of the data used for training. In this case, you might decide to not have a distinct validation set, and instead simulate with the cross- validation technique.

Answer 25

Group leakage may occur during partitioning. Imagine you have magnetic resonance images of the brains of multiple patients. Each image is labeled with certain brain disease, and the same patient may be represented by several images taken at different times. If you apply the partitioning technique discussed above (shuffle, then split), images of the same patient might appear in both the training and holdout data.

Answer 26

* removing the examples with missing attributes from the dataset (this can be done if your dataset is big enough to safely sacrifice some data); * using a learning algorithm that can deal with missing attribute values (such as the decision tree learning algorithm); * using a data imputation technique.

Answer 27

-To impute the value of a missing numerical attribute, one technique consists in replacing the missing value by the average value of this attribute in the rest of the dataset. -Another technique is to replace the missing value with a value outside the normal range of values. For example, if the regular range is [0, 1], you can set the missing value to 2 or −1; if the attribute is categorical, such as days of the week, then a missing value can be replaced by the value “Unknown.” Here, the learning algorithm learns what to do when the attribute has a value different from regular values. -A more advanced technique is to use the missing value as the target variable for a regression problem. -Finally, if you have a significantly large dataset and just a few attributes with missing values, you can add a synthetic binary indicator attribute for each original attribute with missing values. Let’s say that examples in your dataset are D-dimensional, and attribute at position j = 12 has missing values. For each example x, you then add the attribute at position j = D + 1, which is equal to 1 if the value of the attribute at position 12 is present in x and 0 otherwise. The missing value then can be replaced by 0 or any value of your choice.

Answer 28

If you use the imputation techniques that compute some statistic of one attribute (such as average) or several attributes (by solving the regression problem), the leakage happens if you use the whole dataset to compute this statistic. Using all available examples, you contaminate the training data with information obtained from the validation and test examples. This type of leakage is not as significant as other types discussed earlier. However, you still have to be aware of it and avoid it by partitioning first, and then computing the imputation statistic only on the training set.

Answer 29

In Figure 14, you can see examples of operations that can be easily applied to a given image to obtain one or more new images: flip, rotation, crop, color shift, noise addition, perspective change, contrast change, and information loss.

Answer 30

As the name suggests, the technique consists of training the model on a mix of the images from the training set. More precisely, instead of training the model on the raw images, we take two images (that could be of the same class or not) and use for training their linear combination: mixup_image = t × image1 + (1 − t) × image2 , where t is a real number between 0 and 1. The target of that mixup image is a combination of the original targets obtained using the same value of t: mixup_target = t × target1 + (1 − t) × target2

Answer 31

When it comes to text data augmentations, it is not as straightforward. We need to use appropriate transformation techniques to preserve the contextual and grammatical structure of natural language texts. -One technique involves replacing random words in a sentence with their close synonyms. For the sentence, “The car stopped near a shopping mall.” some equivalent sentences are: “The automobile stopped near a shopping mall.” “The car stopped near a shopping center.” “The auto stopped near a mall.” -A similar technique uses hypernyms instead of synonyms. A hypernym is a word that has more general meaning. For example, “mammal” is a hypernym for “whale” and “cat”; “vehicle” is a hypernym for “car” and “bus.” From our example above, we could create the fol- lowing sentences: “The vehicle stopped near a shopping mall.” “The car stopped near a building.” -A modern alternative to the k-nearest-neighbors approach described above is to use a deep pre-trained model such as Bidirectional Encoder Representations from Transformers (BERT). Models like BERT are trained to predict a masked word given other words in a sentence. One can use BERT to generate k most likely predictions for a masked word and then use them as synonyms for data augmentation. -Another useful text data augmentation technique is back translation. To create a new example from a text written in English (it can be a sentence or a document), first translate it into another language l using a machine translation system. Then translate it back from l into English. If the text obtained through back translation is different from the original text, you add it to the dataset by assigning the same label as the original text.

Answer 32

Class imbalance is a condition in the data that can significantly affect the performance of the model, independently of the chosen learning algorithm. The problem is a very uneven distribution of labels in the training data. Typically, a machine learning algorithm tries to classify most training examples correctly. The algorithm is pushed to do so because it needs to minimize a cost function that typically assigns a positive loss value to each misclassified example. If the loss is the same for the misclassification of a minority class example as it is for the misclassification of a majority class, then it’s very likely that the learning algorithm decides to “give up” on many minority class examples in order to make fewer mistakes in the majority class.

Answer 33

A technique used frequently to mitigate class imbalance is oversampling. By making multi- ple copies of minority class examples, it increases their weight, as illustrated in Figure 15a. You might also create synthetic examples by sampling feature values of several examples of the minority class and combining them to obtain a new example of that class. Two popular algo- rithms that oversample the minority class by creating synthetic examples: Synthetic Minority Oversampling Technique (SMOTE) and Adaptive Synthetic Sampling Method (ADASYN).

Answer 34

The undersampling can be done randomly; that is, the examples to remove from the majority class can be chosen at random. Alternatively, examples to withdraw from the majority class can be selected based on some property.

Answer 35

Cluster-based undersampling works as follows. Decide on the number of examples you want to have in the majority class resulting from undersampling. Let that number be k. Run a centroid-based clustering algorithm on the majority examples only with k being the desired number of clusters. Then replace all examples in the majority classes with the k centroids. An example of a centroid-based clustering algorithm is k-nearest neighbors.

Answer 36

You can develop your hybrid strategies (by combining both over- and undersampling) and possibly get better results. One such strategy consists of using ADASYN to oversample, and then Tomek links to undersample.

Answer 37

There are two main strategies: probability sampling and nonprobability sampling. In probability sampling, all examples have a chance to be selected. These techniques involve randomness. Nonprobability sampling is not random. To build a sample, it follows a fixed deterministic sequence of heuristic actions. This means that some examples don’t have a chance of being selected, no matter how many samples you build. The main drawback of nonprobability sampling methods is that they include non-representative samples and might systematically exclude important examples. These drawbacks outweigh the possible advantages of nonprobability sampling methods. Therefore, in this book I will only present probability sampling methods.

Answer 38

Simple random sampling is the most straightforward method, and the one I refer to when I say “sample randomly.” Here, each example from the entire dataset is chosen purely by chance; each example has an equal chance of being selected.

Answer 39

To implement systematic sampling (also known as interval sampling), you create a list containing all examples. From that list, you randomly select the first example xstart from the first k elements on the list. Then, you select every k th item on the list starting from xstart. You choose such a value of k that will give you a sample of the desired size. An advantage of the systematic sampling over the simple random sampling is that it draws examples from the whole range of values. However, systematic sampling is inappropriate if the list of examples has periodicity or repetitive patterns. In the latter case, the obtained sample can exhibit a bias. However, if the list of examples is randomized, then systematic sampling often results in a better sample than simple random sampling.

Answer 40

In stratified sampling, you first divide your dataset into groups (called strata) and then randomly select examples from each stratum, like in simple random sampling. The number of examples to select from each stratum is proportional to the size of the stratum. Stratified sampling often improves the representativeness of the sample by reducing its bias; in the worst of cases, the resulting sample is of no less quality than the results of simple random sampling. However, to define strata, the analyst has to understand the properties of the dataset. Furthermore, it can be difficult to decide which attributes will define the strata.

Answer 41

Data serialization is the process of converting complex data structures, such as objects or data collections, into a format that can be easily stored, transmitted, or reconstructed. The serialized data can later be deserialized, which means it's converted back into its original form, allowing it to be used in the same way as before serialization. Serialization is commonly used in various scenarios, such as when saving data to files, sending data over networks, or storing data in databases. Serialization is important because it enables data to be transported or stored in a standardized format that can be understood by different systems or programming languages. It also helps preserve the structure and relationships within the data. Different serialization formats exist, each with its own characteristics and use cases.

Answer 42

Reproducibility should be an important concern in everything you do, including data collection and preparation. You should avoid transforming data manually, or using powerful tools included in text editors or command line shells, such as regular expressions, “quick and dirty” ad hoc awk or sed commands, and piped expressions. Usually, the data collection and transformation activities consist of multiple stages. These include downloading data from web APIs or databases, replacing multiword expressions by unique tokens, removing stop-words and noise, cropping and unblurring images, imputation of missing values, and so on. Each step in this multistage process has to be implemented as a software script, such as Python or R script with their inputs and outputs. If you are organized like that in your work, it will allow you to keep track of all changes in the data.

Answer 43

Remember that in the industry, contrary to academia, “data first, algorithm second,” so focus most of your effort and time on getting more data of wide variety and high quality, instead of trying to squeeze the maximum out of a learning algorithm. Data augmentation, when implemented well, will most likely contribute more to the quality of the model than the search for the best hyperparameter values or model architecture.

Answer 44

1) data was randomized before the split, 2) split was applied to raw data, 3) validation and test sets follow the same distribution, and 4) leakage was avoided.

Answer 45

Feature engineering is a process of first conceptually and then programmatically transforming a raw example into a feature vector. It consists of conceptualizing a feature and then writing the programming code that would transform the entire raw example, with potentially the help of some indirect data, into a feature.

Answer 46

When it comes to text, scientists and engineers often use simple feature engineering tricks. Two such tricks are one-hot encoding and bag-of-words.

Answer 47

Mean encoding, also known as bin counting or feature calibration, is another technique. First, the sample mean of the label is calculated using all examples where the feature has value z. Each value z of the categorical feature is then replaced by that sample mean value. The advantage of this technique is that the data dimensionality doesn’t increase, and by design, the numerical value contains some information about the label.

Answer 48

It converts a cyclical feature into two synthetic features. Let p denote the integer value of our cyclical feature. Replace the value p of the cyclical feature with the following two values: psin = sin 2 × π × p max(p) , pcos = cos 2 × π × p max(p)

Answer 49

Feature hashing, or hashing trick, converts text data, or categorical attributes with many values, into a feature vector of arbitrary dimensionality. One-hot encoding and bag-of-words have a drawback: many unique values will create high-dimensional feature vectors. using a hash function, you first convert all values of your categorical attribute (or all tokens in your collection of documents) into a number, and then you convert this number into an index of your feature vector.

Answer 50

Topic modeling is a family of techniques that uses unlabeled data, typically in the form of natural language text documents. The model learns to represent a document as a vector of topics. For example, in a collection of news articles, the five major topics could be “sports,” “politics,” “entertainment,” “finance,” and “technology”.

Answer 51

Time-series data is different from the traditional supervised learning data, which has a form of unordered collections of independent observations. A time series is an ordered sequence of observations, and each is marked with a time-related attribute, such as timestamp, date, month-year, year, and so on. Analysts typically use time-series data to solve two kinds of prediction problems. Given a sequence of recent observations: * predict something about the next observation (for example, given the stock price and the value of stock indices for the last seven days, predict the stock price for tomorrow), or * predict something about the phenomenon that generated that sequence (for example, given a user’s connection log to a software system, predict whether they are likely to cancel their subscription during the current quarter).

Answer 52

In our movie title classification problem, we first collect all the left contexts. We then apply bag-of-words to transform each left context into a binary feature vector. Next, collect all extractions and, using bag-of-words, transform each extraction into a binary feature vector. Then we collect all the right contexts and apply bag-of-words to transform each right context into a binary feature vector. Finally, we concatenate each example, joining the feature vectors of the left context, the extraction, and the right context.

Answer 53

High Predictive Power Fast Computability Reliability Uncorrelatedness

Answer 54

Correlation of two features means their values are related. If the growth of one feature implies the growth of the other, and the inverse is also true, then the two features are correlated. Once the model is in production, its performance may change because the input data’s properties may change over time. When many of your features are highly correlated, even a minor change in the input data’s properties may result in significant changes in the model’s behavior. Sometimes the model was built under strict time constraints, so the developer used all possible sources of features. With time, maintaining those sources can become costly. It’s generally recommended to eliminate redundant or highly correlated features. Feature selection techniques help reduce such features.

Answer 55

Typically, if a feature contains information (e.g., a non-zero value) only for a handful of examples, such a feature could be removed from the feature vector. In bag-of-words, you can build a graph with the distribution of token counts, and then cut off the so-called long tail, as shown in Figure 15.

Answer 56

high predictive power fast computation reliability fast computation other: However, if you apply the model built on historical tweets to predict something about current tweets, the date of your production examples will always be out of the training distribution, which can result in a significant error.5

Answer 57

Stop words are the words that are too generic or common for the problem we are trying to solve. Frequent examples of stop words are articles, prepositions, and pronouns. Dictionaries of stop words for most languages are available online.

Answer 58

The reasons to discretize a real-valued numerical feature can be numerous. For example, some feature selection techniques only apply to categorical features. A successful discretization adds useful information to the learning algorithm when the training dataset is relatively small. Numerous studies show that discretization can lead to improved predictive accuracy. It is also simpler for a human to interpret a model’s prediction if it is based on discrete groups of values, such as age groups or salary ranges.

Answer 59

Binning, also known as bucketing, is a popular technique that allows transforming a numerical feature into a categorical one by replacing numerical values in a specific range by a constant categorical value. There are three typical approaches to binning: * uniform binning, * k-means-based binning, and * quantile-based binning.

Answer 60

tam k se prelomi

Answer 61

tokenization

Answer 62

d = √4 D, that is google suggestion. Principaled way is to treat it as hyperparameter.

Answer 63

Feature scaling is bringing all your features to the same, or very similar, ranges of values or distributions. Multiple experiments demonstrated that a learning algorithm applied to scaled features might produce a better model. While there’s no guarantee that scaling will have a positive impact on the quality of your model, it’s considered a best practice. Scaling can also increase the training speed of deep neural networks. It also assures that no individual feature dominates, especially in the initial iterations of gradient descent or other iterative optimization algorithms. Finally, scaling reduces the risk of numerical overflow, the problem that computers have when working with very small or very big numbers.

Answer 64

Normalization is the process of converting an actual range of values, which a numerical feature can take, into a predefined and artificial range of values, typically in the interval [−1, 1] or [0, 1].

Answer 65

Winsorization consists of setting all outliers to a specified percentile of the data; for example, a 90% winsorization would see all data below the 5th percentile set to the 5th percentile, and data above the 95th percentile set to the 95th percentile.

Answer 66

normalization) is the procedure during which the feature values are rescaled so that they have the properties of a standard normal distribution, with μ = 0 and σ = 1, where μ is the sample mean (the average value of the feature, averaged over all examples in the training data) and σ is the standard deviation from the sample mean.

Answer 67

Now imagine you are working with text, and that you use bag-of-words to create features with the entire dataset. After building the vocabulary, you split your data into the three sets. In this situation, the learning algorithm will be exposed to features based on tokens only present in the holdout sets. Again, the model will display artificially better performance than had you divided your data before feature engineering. A solution, as you might have guessed, is first to split the entire dataset into training and holdout sets, and only do feature engineering on the training data. This also applies when you use mean encoding to transform a categorical feature to a number: split the data first and then compute the sample mean of the label, based on the training data only.

Answer 68

1 feature { 2 name : "height" 3 type : float 4 min : 50.0 5 max : 300.0 6 mean : 160.0 7 variance : 17.0 8 zeroes : false 9 undefined : false 10 popularity : 1.0 11 } 12 13 feature { 14 name : "color_red" 15 type : binary 16 zeroes : true 17 undefined : false 18 popularity : 0.76 19 } 20 21 feature { 22 name : "color_green" 23 type : binary 24 zeroes : true 25 undefined : false 26 popularity : 0.65 27 } 28 29 feature { 30 name : "color_blue" 31 type : binary 32 zeroes : true 33 undefined : false 34 popularity : 0.81

Answer 69

The random prediction algorithm makes a prediction by randomly choosing a label from the collection of labels assigned to the training examples. In the classification problem, it corresponds to randomly picking one class from all classes in the problem. In the regression problem it means selecting from all unique target values in the training data.

Answer 70

The zero rule algorithm yields a tighter baseline than the random prediction algorithm. This means that it usually improves the value of the metric as compared to random prediction. To make predictions, the zero rule algorithm uses more information about the problem. In classification, the zero rule algorithm strategy is to always predict the class most common in the training set, independently of the input value. It can look ineffective, but consider the following problem. Let the training data for your classification problem contain 800 examples of the positive class, and 200 examples of the negative class. The zero rule algorithm will predict the positive class all the time, and the accuracy (one of the popular performance metrics that we will consider in Section 5.5.2) of the baseline will be 800/1000 = 0.8 or 80%, which is not bad for such a simple classifier. Now you know that your statistical model, independently of how close it is to the optimum, must have an accuracy of at least 80%. Now, let’s consider the zero rule algorithm for regression. According to the zero rule algorithm, the strategy for regression is to predict the sample average of the target values observed in the training data. This strategy will likely have a lower error rate than random prediction.

Answer 71

Mechanical Turk (MT) is a web-platform where people solve simple tasks for a reward. MT provides an API that you can call to get human predictions. The quality of such predictions can vary from very low to relatively high, depending on the task and the reward. MT is relatively inexpensive, so you can get predictions fast and in large numbers.

Answer 72

The distribution shift can be a hard problem to tackle. Using a different data distribution for training could be a conscious choice because of the data availability. However, the analyst may be unaware that the statistical properties of the training and development data are different. This often happens when the model is frequently updated after production deployment, and new examples are added to the training set. The properties of the data used to train the model, and that of the data used to validate and test it, can diverge over time. Section ?? in the next chapter provides guidance on how to handle that problem.

Answer 73

Before you start working on your model, make sure the following conditions are satisfied: 1. You have a labeled dataset. 2. You have split the dataset into three subsets: training, validation, and test. 3. Examples in the validation and test sets are statistically similar. 4. You engineered features and filled missed values using only the training data. 5. You converted all examples into numerical feature vectors.1 6. You have selected a performance metric that returns a single number (see Section 5.5). 7. You have a baseline.

Answer 74

Do the model predictions require explanation for a non-technical audience? The most accurate machine learning algorithms and models are so-called “black boxes.” They make very few prediction errors, but it may be difficult to understand, and even harder to explain, why a model or an algorithm made a specific prediction. Examples of such models are deep neural networks and ensemble models. In contrast, kNN, linear regression, and decision tree learning algorithms are not always the most accurate. However, their predictions are easy to inter- pret by a non-expert.

Answer 75

Can your dataset be fully loaded into the RAM of your laptop or server? If yes, then you can choose from a wide variety of algorithms. Otherwise, you would prefer incremental learning algorithms that can improve the model by reading data gradually. Examples of such algorithms are Naïve Bayes and the algorithms for training neural networks.

Answer 76

How many training examples do you have in your dataset? How many features does each example have? Some algorithms, including those used for training neural networks and random forests, can handle a huge number of examples and millions of features. Others, like the algorithms for training support vector machines (SVM), can be relatively modest in their capacity.

Answer 77

Is your data linearly separable? Can it be modeled using a linear model? If yes, SVM with the linear kernel, linear and logistic regression can be good choices. Otherwise, deep neural networks or ensemble models might work better.

Answer 78

How much time is a learning algorithm allowed to use to build a model, and how often you will need to retrain the model on updated data? If training takes two days, and you need to retrain your model every 4 hours, then your model will never be up to date. Neural networks are slow to train. Simple algorithms like linear and logistic regression, or decision trees, are much faster. Specialized libraries contain very efficient implementations of some algorithms. You may prefer to do research online to find such libraries. Some algorithms, such as random forest learning, benefit from multiple CPU cores, so their training time can be significantly reduced on a machine with dozens of cores. Some machine learning libraries leverage GPU (graphics processing unit) to speed up training.

Answer 79

How fast must the model be when generating predictions? Will your model be used in a production environment where very high throughput is required? Models like SVMs and linear and logistic regression models, and not-very-deep feedforward neural networks, are extremely fast at prediction time. Others, like kNN, ensemble algorithms, and very deep or recurrent neural networks, are slower.

Answer 80

Shortlisting candidate learning algorithms for a given problem is sometimes called algorithm spot-checking. For the most effective spot-checking, it is recommended to: * select algorithms based on different principles (sometimes called orthogonal), such as instance-based algorithms, kernel-based, shallow learning, deep learning, ensembles; * try each algorithm with 3 − 5 different values of the most sensitive hyperparameters (such as the number of neighbors k in k-nearest neighbors, penalty C in support vector machines, or decision threshold in logistic regression); * use the same training/validation split for all experiments, * if the learning algorithm is not deterministic (such as the learning algorithms for neural networks and random forests), run several experiments, and then average the results; * once the project is over, note which algorithms performed the best, and use this information when working on a similar problem in the future.

Answer 81

Many modern machine learning packages and frameworks support the notion of a pipeline. A pipeline is a sequence of transformations the training data goes through, before it becomes a model.

Answer 82

Regression and classification models are assessed using different metrics. Let’s first consider performance metrics for regression: mean squared error (MSE), median absolute error (MAE), and almost correct predictions error rate (ACPER).

Answer 83

If the data contains outliers, the examples very far from the “true” regression line, they can significantly affect the value of MSE. By definition, the squared error for such outlying examples will be high. In such situations, it is better to apply a different metric, the median absolute error.

Answer 84

The almost correct predictions error rate (ACPER) is the percentage of predictions that is within p percentage of the true value. To calculate ACPER, proceed as follows: 1. Define a threshold percentage error that you consider acceptable (let’s say 2%). 2. For each true value of the target yi , the desired prediction should be between yi + 0.02yi and yi − 0.02yi . 3. By using all examples i = 1, . . . , N, calculate the percentage of predicted values fulfilling the above rule. This will give the value of the ACPER metric for your model.

Answer 85

* precision-recall, * accuracy, * cost-sensitive accuracy, and * area under the ROC curve (AUC).

Answer 86

The traditional F-measure, or F1-score, is the harmonic mean of precision and recall:

Answer 87

Accuracy is given by the number of correctly classified examples, divided by the total number of classified examples. In terms of the confusion matrix, it is given by: accuracy def = TP + TN/ TP + TN + FP + FN

Answer 88

Accuracy measures the performance of the model for all classes at once, and it conveniently returns a single number. However, accuracy is not a good performance metric when the data is imbalanced. In an imbalanced dataset, examples belonging to some class or a few classes constitute the vast majority, while other classes include very few examples. Imbalanced training data can significantly and adversely affect the model. We will talk more about dealing with the imbalanced data in Section ?? of Chapter 6. For imbalanced data, a better metric is per-class accuracy. First, calculate the accuracy of prediction for each class {1, . . . , C}, and then take an average of C individual accuracy measures. For the above confusion matrix of the spam detection problem, the accuracy for the class “spam” is 23/(23 + 1) = 0.96, the accuracy for the class “not_spam” is 556/(12 + 556) = 0.98. The per-class accuracy is then (0.96 + 0.98)/2 = 0.97.

Answer 89

Cohen’s kappa statistic is a performance metric that applies to both multiclass and imbalanced learning problems. The advantage of this metric over accuracy is that Cohen’s kappa tells you how much better your classification model is performing, compared to a classifier that randomly guesses a class according to the frequency of each class.

Answer 90

The ROC curve (stands for “receiver operating characteristic;” the term comes from radar engineering) is a commonly-used method of assessing classification models. ROC curves use a combination of the true positive rate (defined exactly as recall) and false positive rate (the proportion of negative examples predicted incorrectly) to build up a summary picture of the classification performance.

Answer 91

Grid search is the simplest hyperparameter tuning technique. It’s used when the number of hyperparameters and their range is not too large. We explain it for the problem of tuning two numerical hyperparameters. The technique consists of discretizing each of the two hyperparameters, and then evaluating each pair of discrete values,

Answer 92

Random search differs from grid search in that you do not provide a discrete set of values to explore for each hyperparameter. Instead, you provide a statistical distribution for each hyperparameter from which values are randomly sampled. Then set the total number of combinations you want to evaluate,

Answer 93

In practice, analysts often use a combination of grid search and random search called coarse- to-fine search. This technique uses a coarse random search to first find the regions of high potential. Then, using a fine grid search in these regions, one finds the best values for hyperparameters, as shown in Figure 8. You can decide to only explore one high-potential region or several such regions, depending on the available time and computational resources.

Answer 94

Bayesian techniques differ from random and grid searches in that they use past evaluation results to choose the next values to evaluate. In practice, this allows Bayesian hyperparameter optimization techniques to find better values of hyperparameters in less time.

Answer 95

Shallow models make predictions based directly on the values in the input feature vector. Most popular machine learning algorithms produce shallow models. The only kind of deep models commonly used are deep neural networks. We consider a strategy to train them in Section ?? of the next chapter.

Answer 96

1. Define a performance metric P. 2. Shortlist learning algorithms. 3. Choose a hyperparameter tuning strategy T. 4. Pick a learning algorithm A. 5. Pick a combination H of hyperparameter values for algorithm A using strategy T. 6. Use the training set and train a model M using algorithm A parametrized with hyperparameter values H. 7. Use the validation set and calculate the value of metric P for model M. 8. Decide: a. If there are still untested hyperparameter values, pick another combination H of hyperparameter values using strategy T and go back to step 6. b. Otherwise, pick a different learning algorithm A and go back to step 5, or proceed to step 9 if there are no more learning algorithms to try. 9. Return the model for which the value of metric P is maximized.

Answer 97

* the model is too simple for the data (for example linear models often underfit); * the features are not informative enough; * you regularize too much (we talk about regularization in the next section). The possible solutions to the problem of underfitting include: * trying a more complex model, * engineering features with higher predictive power, * adding more training data, when possible, and * reducing regularization.

Answer 98

* the model is too complex for the data. Very tall decision trees or a very deep neural network often overfit; * there are too many features and few training examples; and * you don’t regularize enough. Several solutions to overfitting are possible: * use a simpler model. Try linear instead of polynomial regression, or SVM with a linear kernel instead of radial basis function (RBF), or a neural network with fewer layers/units; * reduce the dimensionality of examples in the dataset; * add more training data, if possible; and, * regularize the model.

Answer 99

Regularization is an umbrella term for methods that force a learning algorithm to train a less complex model. In practice, it leads to higher bias, but significantly reduces the variance.

Answer 100

1. Define a performance metric P. 2. Define the cost function C. 3. Pick a parameter-initialization strategy W. 4. Pick a cost-function optimization algorithm A. 5. Choose a hyperparameter tuning strategy T. 6. Pick a combination H of hyperparameter values using the tuning strategy T. 7. Train model M, using algorithm A, parametrized with hyperparameters H, to optimize cost function C. 8. If there are still untested hyperparameter values, pick another combination H of hyperparameter values using strategy T, and repeat step 7. 9. Return the model for which the metric P was optimized.

Answer 101

Classification cost functions. Google for more

Answer 102

Note that the output layers in multiclass and multi-label classification are different. In multiclass classification, one softmax unit is used. It generates a C-dimensional vector whose values are bounded by the range (0, 1), and whose sum equals 1. In multi-label classification, the output layer contains C logistic units whose values also lie in the range (0, 1), but their sum lies in the range (0, C).

Answer 103

* ones — all parameters are initialized to 1; * zeros — all parameters are initialized to 0; * random normal — parameters are initialized to values sampled from the normal distribution, typically with mean of 0 and standard deviation of 0.05; * random uniform — parameters are initialized to values sampled from the uniform distribution with the range [−0.05, 0.05]; * Xavier normal — parameters are initialized to values sampled from the truncated normal distribution, centered on 0, with standard deviation equal to p 2/(in + out) where “in” is the number of units in the preceding layer to which the current unit is connected (the one whose parameters you initialize); and “out” is the number of units on the subsequent layer to which the current unit is connected; and, * Xavier uniform — parameters are initialized to values sampled from a uniform distribution within [−limit, limit], where “limit” is p 6/(in + out), and “in” and “out” are defined as in Xavier normal, above.

Answer 104

We say that f(x) has a local minimum at x = c if f(x) ≥ f(c) for every x in some open interval around x = c.

Answer 105

Learning rate decay consists of gradually reducing the value of the learning rate α as the epochs progress. Consequently, the parameter updates become finer. There are several techniques, Time-based learning rate decay schedules, Step-based learning rate decay schedules, Exponential learning rate decay schedules,

Answer 106

google them

Answer 107

The concept of dropout is very simple. Each time you “run” a training example through the network, you temporarily exclude at random some units from the computation. The higher the percentage of units excluded, the stronger the regularization effect. Popular neural network libraries allow you to add a dropout layer between two successive layers, or you can specify the dropout hyperparameter for a layer. The dropout hyperparameter varies in the range [0, 1] and characterizes the fraction of units to randomly exclude from computation. The value of the hyperparameter has to be found experimentally. While simple, dropout’s flexibility and regularizing effect are phenomenal.

Answer 108

Early stopping trains a neural network by saving the preliminary model after every epoch. Models saved after each epoch are called checkpoints. Then it assesses each checkpoint’s performance on the validation set. You’ll find during gradient descent that the cost decreases as the number of epochs increases. After some epoch, the model can start overfitting, and the model’s performance on the validation data can deteriorate. Remember the bias-variance illustration in Figure ?? in Chapter 5. By keeping a version of the model after each epoch, you can stop the training once you start observing a decreased performance on the validation set. Alternatively, you can keep running the training process for a fixed number of epochs, and then pick the best checkpoint. Some machine learning practitioners rely on this technique. Others try to properly regularize the model using appropriate techniques.

Answer 109

Batch normalization (which rather should be called batch standardization) consists of standardizing the outputs of each layer before the next layer receives them as input. In practice, batch normalization results in faster and more stable training, as well as some regularization effect. So, it’s always a good idea to use batch normalization. In popular neural network libraries, you can often insert a batch normalization layer between two subse- quent layers.

Answer 110

1) its learned parameters can be used to initialize your own model, or 2) it can be used as a feature extractor for your model.

Answer 111

In practice, it means that you only keep several initial layers of the pre-trained model, those closest to and including the input layer. You keep their parameters “frozen,” that is, unchanged and unchangeable. Then you add new layers on top of the frozen layers, including the output layer appropriate for your task. Only the parameters of the new layers will be updated by gradient descent during training on your data.

Answer 112

use a pre-trained model as an initializer for your model, it gives you more flexibility. The gradient descent will modify the parameters in all layers, and, potentially, reach a better performance for your problem. The downside of that is you will often end up training a very deep neural network.

Answer 113

Adversarial Validation is a very clever and very simple way to let us know if our test data and our training data are similar; we combine our train and test data, labeling them with say a 0 for the training data and a 1 for the test data, mix them up, then see if we are able to correctly re-identify them using a binary classifier. If we cannot correctly classify them, i.e. we obtain an area under the receiver operating characteristic curve (ROC) of 0.5 then they are indistinguishable and we are good to go. However, if we can classify them (ROC > 0.5) then we have a problem, either with the whole dataset or more likely with some features in particular, which are probably from different distributions in the test and train datasets. If we have a problem, we can look at the feature that was most out of place. The problem may be that there were values that were only seen in, say, training data, but not in the test data. If the contribution to the ROC is very high from one feature, it may well be a good idea to remove that feature from the model. We can also use it to improve the data for learning.

Answer 114

If you use stochastic gradient descent, the class imbalance can be tackled in several ways. First, you can have different learning rates for different classes: a lower value for the examples of the majority class, and a higher value otherwise. Second, you can make several consecutive updates of the model parameters each time you encounter an example of a minority class.

Answer 115

* the model architecture or learning algorithm are not expressive enough (try more advanced learning algorithm, an ensemble method, or a deeper neural network); * you regularize too much (reduce regularization); * you have chosen suboptimal values for hyperparameters (tune hyperparameters); * the features you engineered don’t have enough predictive power (add more informative features); * you don’t have enough data for the model to generalize (try to get more data, use data augmentation, or transfer learning); or * you have a bug in your code (debug the code that defines and trains the model).

Answer 116

* you don’t have enough data for generalization (add more data or use data augmentation); * your model is under-regularized (add regularization or, for neural networks, both regularization and batch normalization); * your training data distribution is different from the holdout data distribution (reduce the distribution shift); * you have chosen suboptimal values for hyperparameters (tune hyperparameters); or * your features have low predictive power (add features with high predictive power).

Answer 117

1. Train the model using the best values of hyperparameters identified so far. 2. Test the model by applying it to a small subset of the validation set (100−300 examples). 3. Find the most frequent error patterns on that small validation set. Remove those examples from the validation set, because your model will now overfit to them. 4. Generate new features, or add more training data to fix the observed error patterns. 5. Repeat until no frequent error patterns are observed (most errors look dissimilar).

Answer 118

Here is a simple way to identify the examples that have wrong labels. Apply the model to the training data from which it was built, and analyze the examples for which it made a different prediction as compared to the labels provided by humans. If you see that some predictions are indeed correct, change those labels. If you have time and resources, you could also examine the predictions with the score close to the decision threshold. Those are often mislabeled cases too.

Answer 119

As discussed above, error analysis can reveal that more labeled data is needed from specific regions of feature space. You might have an abundance of unlabeled examples. How should you decide which examples to label so as to maximize the positive impact on the model? If your model returns a prediction score, an effective way is to use your best model to score the unlabeled examples. Then label those examples, whose prediction score is close to the prediction threshold.

Answer 120

look in the book: Machine learning engineering

Answer 121

A good model has two properties: * it has the desired quality according to the performance metric; and * it is safe to serve in a production environment. For a model to be safe-to-serve means satisfying the following requirements: * it will not crash or cause errors in the serving system when being loaded, or when loaded with bad or unexpected inputs; * it will not use an unreasonable amount of resources (such as CPU, GPU, or RAM).

Answer 122

Furthermore, frequent model upgrades without retraining from scratch can lead to catas- trophic forgetting. It’s a situation in which the model that was once capable of something, “forgets” that capability because of learning something new.

Answer 123

However, avoid the practice of warm-starting. It consists of iteratively upgrading the existing model by using only new training examples and running additional training iterations.

Answer 124

You might have model mA that solves problem A, but you need a solution mB for a slightly different problem B. It can be tempting to use the output of mA as input for mB, and only train mB on a small sample of examples that “correct” the output of mA for solving problem B. Such technique is called correction cascading, and it is not recommended. It’s important to note that model cascading is not always a bad practice. Using the output of one model, as one of many inputs for another model, is common. It might significantly reduce time to market. However, cascading must be used with caution, because the update of one model in a cascade must involve an update of all models in the cascade, which can end up being costly in the long-term.

Answer 125

Reduce glue code to a minimum. This how Google engineers put it. Machine learning researchers tend to develop general purpose solutions as self-contained packages. A wide variety of these are available as open-source packages or from in-house code, proprietary packages, and cloud-based platforms. Using generic packages often results in a glue-code system design pattern, in which a massive amount of supporting code is written to get data into and out of general-purpose packages.

Answer 126

In practice, however, better results often come from getting more data, specifically, more labeled examples. If designed well, the data labeling process can allow a labeler to produce several thousand training examples daily. It can also be less expensive, compared to the expertise needed to invent a more advanced machine learning algorithm.

Answer 127

If, despite adding more training examples and designing clever features, the performance of your model plateaus, think about different information sources. For example, if you want to predict whether user U will like a news article, try to add historical data about the user U as features. Or cluster all the users, and use the information on the k-nearest users to user U as new features. This is a simpler approach compared to programming very complex features, or combining existing features in a complex way.

Answer 128

The random seed can be set as np.random.seed(15) (in NumPy and scikit-learn), tf.random.set_seed(15) in TensorFlow, torch.manual_seed(15) (in PyTorch), and set.seed(15) (in R). The seed value doesn’t matter as long as it remains constant.

Answer 129

Besides the description of the dataset and features, such as documentation and metadata considered in Sections ?? and ??, each model should contain the documentation with the following details: * a specification of all hyperparameters, including the ranges considered, and the default values used, * the method used to select the best hyperparameter configuration, * the definition of the specific measure or statistics used to evaluate the candidate models, and the value of it for the best model, * a description of the computing infrastructure used, and * the average runtime for each trained model, and an estimated cost of the training.

Answer 130

runtime monitoring is checking whether the running system meets the runtime requirements. Another common scenario is to monitor user behavior in response to different versions of the model. One popular technique used in this scenario is A/B testing. We split the users of a system into two groups, A and B. The two groups are served the old and the new models, respectively. Then we apply a statistical significance test to decide whether the performance of the new model is better than the old one. Multi-armed bandit (MAB) is another popular technique of online model evaluation. Similar to A/B testing, it identifies the best performing models by exposing model candidates to a fraction of users. Then it gradually exposes the best model to more users, by keeping gathering performance statistics until it’s reliable.

Answer 131

To figure out which model is better.

Answer 132

A more advanced, and often preferable way of online model evaluation and selection, is multi-armed bandit (MAB). A/B testing has one major drawback. The number of test results in groups A and B you need to calculate the value of the A/B test is high. A significant portion of users routed to a suboptimal model would experience suboptimal behavior for a long time.

Answer 133

UCB1 (for Upper Confidence Bound) is a popular algorithm for solving the multi-armed bandit problem. The algorithm dynamically chooses an arm, based on the performance of that arm in the past, and how much the algorithm knows about it. In other words, UCB1 routes the user to the best performing model more often when its confidence about the model performance is high. Otherwise, UCB1 might route the user to a suboptimal model so as to get a more confident estimate of that model’s performance. Once the algorithm is confident enough about the performance of each model, it almost always routes users to the best performing model.

Answer 134

When we evaluate a neural network, especially one to be used in a mission-critical scenario, such as a self-driving car or a space rocket, our test set must have good coverage. Neuron coverage of a test set for a neural network model is defined as the ratio of the units (neurons) activated by the examples from the test set, to the total number of units. A good test set has close to 100% neuron coverage. A unit is considered activated when its output is above a certain threshold. For ReLU, it’s usually zero; for a logistic sigmoid, it’s 0.5.

Answer 135

In software engineering, good test coverage for a software under test (SUT) can be determined using the approach known as mutation testing. Let’s have a set of tests designed to test an SUT. We generate several “mutants” of the SUT. A mutant is a version of the SUT in which we randomly make some modifications, such as replacing in the source code, a “+” with a “−”, a “<” with a “>”, delete the else command in an if-else statement, and so on. Then we apply the test set to each mutant, and see if at least one test breaks on that mutant. We say that we kill a mutant if one test breaks on it. We then compute the ratio of killed mutants in the entire collection of mutants. A good test set makes this ratio equal to 100%. In machine learning, a similar approach can be followed. However, to create a mutant statistical model, instead of modifying the code, we modify the training data. If the model is deep, we can also randomly remove or add a layer, or remove or replace an activation function. The training data can be modified by, * adding duplicated examples, * falsifying the labels of some examples, * removing some examples, or * adding random noise to the values of some features. We say that we kill a mutant if at least one test example gets a wrong prediction by that mutant statistical model.

Answer 136

The robustness of a machine learning model refers to the stability of the model performance after adding some noise to the input data. A robust model would exhibit the following behavior. If the input example is perturbed by adding random noise, the performance of the model would degrade proportionally to the level of noise.

Answer 137

Machine learning algorithms tend to learn what humans are teaching them. The teaching comes in the form of training examples. Humans have biases which may affect how they collect and label data. Sometimes, bias is present in historical, cultural, or geographical data. This, in turn, as we have seen in Section ?? in Chapter 3, may lead to biased models.

Answer 138

* statically, as a part of an installable software package, * dynamically on the user’s device, * dynamically on a server, or * via model streaming.

Answer 139

The static deployment of a machine learning model is very similar to traditional software deployment: you prepare an installable binary of the entire software. The model is packaged as a resource available at the runtime. Depending on the operating system and the runtime environment, the objects of both the model and the feature extractor can be packaged as a part of a dynamic-link library (DLL on Windows), Shared Objects (*.so files on Linux), or be serialized and saved in the standard resource location for virtual machine-based systems, such as Java and .Net. Static deployment has many advantages: * the software has direct access to the model, so the execution time is fast for the user, * the user data doesn’t have to be uploaded to the server at the time of prediction; this saves time and preserves privacy, * the model can be called when the user is offline, and * the software vendor doesn’t have to care about keeping the model operational; it becomes the user’s responsibility.

Answer 140

A load balancer dispatches the incoming requests to a specific virtual machine, depending on its availability. The virtual machines can be added and closed manually, or be a part of an autoscaling group that launches or terminates virtual machines based on their usage. Figure 2 illustrates that deployment pattern. Each instance, denoted as an orange square, contains all the code needed to run the feature extractor and the model. The instance also contains a web service that has access to that code.

Answer 141

Model mB used the output of model mA as a feature, without knowing that model mA also used the output of model mB as its feature. Another kind of hidden feedback loop only involves one model. Let’s say we have a model that classifies incoming email messages as spam or not spam. Let the user interface allow the user to mark messages as spam or not spam. Obviously, we want to use those marked messages to improve our model. However, by so doing, we risk creating a hidden feedback loop, and here is why. In our application, the user will only mark a message as spam when they see it. However, users only see the messages that our model classified as not spam. Also, it is unlikely that the user will regularly go to the spam folder and mark some messages as not spam. So, the action of the user is significantly affected by our model, which makes the data we get from the user skewed: we influence the phenomenon from which we learn.

Answer 142

To deal with such situations, on-demand architectures include a message broker, such as RabbitMQ or Apache Kafka. A message broker allows one process to write messages in a queue, and another to read from that queue. On-demand requests are placed in the input queue. The model runtime process periodically connects to the broker. It reads a batch of input data elements from the input queue and generates predictions for each element in batch mode. It then writes the predictions to the output queue. Another process periodically connects to the broker, reads the predictions from the output queue, and pushes them to users who sent the requests (Figure 3). In addition to allowing us to cope with demand spikes, such an approach is more resource-efficient.

Answer 143

1. We cannot always explain why an error happened. 2. We cannot reliably predict when it will happen, and even a high confidence prediction can be false. 3. We cannot always know how to fix a specific error. If it’s fixable, what kind and how much training data is needed?

Answer 144

Modern libraries, such as thundersvm and cuML, allow the analyst to run shallow learning algorithms on GPUs, with a significant gain in training time. If you cannot afford to wait for days or weeks to get an updated model, using a less complex (and, therefore, less accurate) model might be your only choice.

Machine learning engineering (book) Flashcards

(180 cards)