CH2 End-to-end ML project Flashcards
What is the first question to ask your boss when building a model?
Why is this important?
How does the company expect to use and benefit from this model?
This is important because it will determine how you frame the problem, what algorithms you will select, what performance measure you will use to
evaluate your model, and how much effort you should spend tweaking it.
What is a signal?
A piece of information fed to a Machine Learning system is often called a signal in reference to Shannon’s information theory: you want a high signal/noise ratio
What are pipelines?
A piece of information fed to a Machine Learning system is often called a signal in reference to Shannon’s information theory: you want a high signal/noise ratio
How are components run asynchronously>
Components typically run asynchronously. Each component pulls in a large amount of data, processes it, and spits out the result in another data store, and then some time later the next component in the pipeline pulls this data and spits out its own output,
and so on.
What kind of questions do you ask when framing a problem?
Okay, with all this information you are now ready to start designing your system. First, you need to frame the problem: is it supervised, unsupervised, or Reinforce‐ ment Learning? Is it a classification task, a regression task, or something else? Should
you use batch learning or online learning techniques?
What is the next step after framing the problem?
Your next step is to select a performance measure.
What is the RMSE?
A typical performance measure for regression problems is the Root Mean Square Error (RMSE). It gives an idea of how much error the system typically makes in its predictions, with a higher weight for
large errors.
When do you use the MAE?
uppose that there are many outlier districts. In that case, you may consider using the Mean
Absolute Error (also called the Average Absolute Deviation
What is data snooping bias?
When you estimate the generalization error using the test set, your estimate will be too optimistic and you will launch a system that will not
perform as well as expected. This is called data snooping bias
Which function to use to split data set?
Scikit-Learn provides a few functions to split datasets into multiple subsets in various ways. The simplest function is train_test_split, which does pretty much the same thing as the function split_train_test defined earlier, with a couple of additional features. First there is a random_state parameter that allows you to set the random generator seed as explained previously, and second you can pass it multiple datasets with an identical number of rows, and it will split them on the same indices (this is
very useful, for example, if you have a separate DataFrame for labels):
What is stratified sampling?
the population is divided into homogeneous subgroups called strata, and the right number of instances is sampled from each stratum to guarantee that the
test set is representative of the overall population.
What do the different correlation coefficients mean?
The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that there is a strong positive correlation;
When the coefficient is close to –1, it means that there is a strong negative correlation
Finally, coefficients close to zero mean that there is no linear correlationWh
What relations can the correlation coefficient miss?
The correlation coefficient only measures linear correlations (“if x goes up, then y generally goes up/down”). It may completely miss
out on nonlinear relationships
What is one-hot encoding?
This is called one-hot encoding, because only one attribute will be equal to 1 (hot), while the others will be 0 (cold). The new
attributes are sometimes called dummy attributes
What are two common ways to perform scaling?
min-max scaling and standardization