CH2 End-to-end ML project Flashcards

1
Q

What is the first question to ask your boss when building a model?

Why is this important?

A

How does the company expect to use and benefit from this model?

This is important because it will determine how you frame the problem, what algorithms you will select, what performance measure you will use to
evaluate your model, and how much effort you should spend tweaking it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a signal?

A

A piece of information fed to a Machine Learning system is often called a signal in reference to Shannon’s information theory: you want a high signal/noise ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are pipelines?

A

A piece of information fed to a Machine Learning system is often called a signal in reference to Shannon’s information theory: you want a high signal/noise ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How are components run asynchronously>

A

Components typically run asynchronously. Each component pulls in a large amount of data, processes it, and spits out the result in another data store, and then some time later the next component in the pipeline pulls this data and spits out its own output,
and so on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What kind of questions do you ask when framing a problem?

A

Okay, with all this information you are now ready to start designing your system. First, you need to frame the problem: is it supervised, unsupervised, or Reinforce‐ ment Learning? Is it a classification task, a regression task, or something else? Should
you use batch learning or online learning techniques?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the next step after framing the problem?

A

Your next step is to select a performance measure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the RMSE?

A

A typical performance measure for regression problems is the Root Mean Square Error (RMSE). It gives an idea of how much error the system typically makes in its predictions, with a higher weight for
large errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When do you use the MAE?

A

uppose that there are many outlier districts. In that case, you may consider using the Mean
Absolute Error (also called the Average Absolute Deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is data snooping bias?

A

When you estimate the generalization error using the test set, your estimate will be too optimistic and you will launch a system that will not
perform as well as expected. This is called data snooping bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which function to use to split data set?

A

Scikit-Learn provides a few functions to split datasets into multiple subsets in various ways. The simplest function is train_test_split, which does pretty much the same thing as the function split_train_test defined earlier, with a couple of additional features. First there is a random_state parameter that allows you to set the random generator seed as explained previously, and second you can pass it multiple datasets with an identical number of rows, and it will split them on the same indices (this is
very useful, for example, if you have a separate DataFrame for labels):

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is stratified sampling?

A

the population is divided into homogeneous subgroups called strata, and the right number of instances is sampled from each stratum to guarantee that the
test set is representative of the overall population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do the different correlation coefficients mean?

A

The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that there is a strong positive correlation;

When the coefficient is close to –1, it means that there is a strong negative correlation

Finally, coefficients close to zero mean that there is no linear correlationWh

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What relations can the correlation coefficient miss?

A

The correlation coefficient only measures linear correlations (“if x goes up, then y generally goes up/down”). It may completely miss
out on nonlinear relationships

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is one-hot encoding?

A

This is called one-hot encoding, because only one attribute will be equal to 1 (hot), while the others will be 0 (cold). The new
attributes are sometimes called dummy attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are two common ways to perform scaling?

A

min-max scaling and standardization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is min-max scaling?

A

Min-max scaling (many people call this normalization) is quite simple: values are shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtract‐
ing the min value and dividing by the max minus the min

17
Q

What is standardization?

A

Min-max scaling (many people call this normalization) is quite simple: values are shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtract‐
ing the min value and dividing by the max minus the min

18
Q

What are the disadvantages, advantages of min-max sclaing and standardization?

A

Unlike min-max scaling, standardization does not bound values to a specific range, which may be a problem for some algo‐ rithms (e.g., neural networks often expect an input value ranging from 0 to 1). How‐
ever, standardization is much less affected by outliers.

19
Q

How can you perform the data transformation steps easily?

A

As you can see, there are many data transformation steps that need to be executed in the right order. Fortunately, Scikit-Learn provides the Pipeline class to help with
such sequences of transformations.

20
Q

What is ensemble learning?

A

Building a model on top of many other models is called Ensemble Learning, and it is often a great way to push ML algo‐
rithms even further

21
Q
A