CH2 End-to-end ML project Flashcards

Question 1

Q

What is the first question to ask your boss when building a model?

Why is this important?

Answer

A

How does the company expect to use and benefit from this model?

This is important because it will determine how you frame the problem, what algorithms you will select, what performance measure you will use to
evaluate your model, and how much effort you should spend tweaking it.

Question 2

Q

What is a signal?

Answer

A

A piece of information fed to a Machine Learning system is often called a signal in reference to Shannon’s information theory: you want a high signal/noise ratio

Question 3

Q

What are pipelines?

Answer

A

A piece of information fed to a Machine Learning system is often called a signal in reference to Shannon’s information theory: you want a high signal/noise ratio

Question 4

Q

How are components run asynchronously>

Answer

A

Components typically run asynchronously. Each component pulls in a large amount of data, processes it, and spits out the result in another data store, and then some time later the next component in the pipeline pulls this data and spits out its own output,
and so on.

Question 5

Q

What kind of questions do you ask when framing a problem?

Answer

A

Okay, with all this information you are now ready to start designing your system. First, you need to frame the problem: is it supervised, unsupervised, or Reinforce‐ ment Learning? Is it a classification task, a regression task, or something else? Should
you use batch learning or online learning techniques?

Question 6

Q

What is the next step after framing the problem?

Answer

A

Your next step is to select a performance measure.

Question 7

Q

What is the RMSE?

Answer

A

A typical performance measure for regression problems is the Root Mean Square Error (RMSE). It gives an idea of how much error the system typically makes in its predictions, with a higher weight for
large errors.

Question 8

Q

When do you use the MAE?

Answer

A

uppose that there are many outlier districts. In that case, you may consider using the Mean
Absolute Error (also called the Average Absolute Deviation

Question 9

Q

What is data snooping bias?

Answer

A

When you estimate the generalization error using the test set, your estimate will be too optimistic and you will launch a system that will not
perform as well as expected. This is called data snooping bias

Question 10

Q

Which function to use to split data set?

Answer

A

Scikit-Learn provides a few functions to split datasets into multiple subsets in various ways. The simplest function is train_test_split, which does pretty much the same thing as the function split_train_test defined earlier, with a couple of additional features. First there is a random_state parameter that allows you to set the random generator seed as explained previously, and second you can pass it multiple datasets with an identical number of rows, and it will split them on the same indices (this is
very useful, for example, if you have a separate DataFrame for labels):

Question 11

Q

What is stratified sampling?

Answer

A

the population is divided into homogeneous subgroups called strata, and the right number of instances is sampled from each stratum to guarantee that the
test set is representative of the overall population.

Question 12

Q

What do the different correlation coefficients mean?

Answer

A

The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that there is a strong positive correlation;

When the coefficient is close to –1, it means that there is a strong negative correlation

Finally, coefficients close to zero mean that there is no linear correlationWh

Question 13

Q

What relations can the correlation coefficient miss?

Answer

A

The correlation coefficient only measures linear correlations (“if x goes up, then y generally goes up/down”). It may completely miss
out on nonlinear relationships

Question 14

Q

What is one-hot encoding?

Answer

A

This is called one-hot encoding, because only one attribute will be equal to 1 (hot), while the others will be 0 (cold). The new
attributes are sometimes called dummy attributes

Question 15

Q

What are two common ways to perform scaling?

Answer

A

min-max scaling and standardization

Question 16

Q

What is min-max scaling?

Answer

Study These Flashcards

A

Min-max scaling (many people call this normalization) is quite simple: values are shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtract‐
ing the min value and dividing by the max minus the min

Question 17

Q

What is standardization?

Answer

Study These Flashcards

A

Min-max scaling (many people call this normalization) is quite simple: values are shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtract‐
ing the min value and dividing by the max minus the min

Question 18

Q

What are the disadvantages, advantages of min-max sclaing and standardization?

Answer

Study These Flashcards

A

Unlike min-max scaling, standardization does not bound values to a specific range, which may be a problem for some algo‐ rithms (e.g., neural networks often expect an input value ranging from 0 to 1). How‐
ever, standardization is much less affected by outliers.

Question 19

Q

How can you perform the data transformation steps easily?

Answer

Study These Flashcards

A

As you can see, there are many data transformation steps that need to be executed in the right order. Fortunately, Scikit-Learn provides the Pipeline class to help with
such sequences of transformations.

Question 20

Q

What is ensemble learning?

Answer

Study These Flashcards

A

Building a model on top of many other models is called Ensemble Learning, and it is often a great way to push ML algo‐
rithms even further

Question 21

Q

Answer

Study These Flashcards

A

CH2 End-to-end ML project Flashcards

(21 cards)