CHAPTER 2 End-to-End Machine Learning Project Flashcards

Question 1

Q

What are the main steps we need to take for an End-to-End ML project? P 65

Answer

A

Look at the big picture.
Get the data.
Discover and visualize the data to gain insights.
Prepare the data for Machine Learning algorithms.
Select a model and train it.
Fine-tune your model.
Present your solution.
Launch, monitor, and maintain your system.
In page 67 there’s a link to an appendix, explaining details of each step.

Question 2

Q

What is the first question to ask when beginning an ML project? P 67

Answer

A

The first question to ask your boss is what exactly the business objective is.

Question 3

Q

What are upstream and downstream systems? External

Answer

A

An upstream system is any system that sends data to the Collaboration Server system. A downstream system is a system that receives data from the Collaboration Server system.

Question 4

Q

What’s a pipeline? P 68

Answer

A

A sequence of data processing components is called a data pipeline.

Question 5

Q

What’s the difference between online learning and batch learning? External

Answer

A

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set or batches of it. (Mahsa: Basically online learning is using stochastic gradient descent for updating weights and batch learning is using batches for updating weights)

Question 6

Q

What should we do after defining the problem? P 69

Answer

A

Your next step is to select a performance measure.

Question 7

Q

How is Root Mean Square Error (RMSE) calculated? P 70

Answer

A

It’s the root of Mean Square Error so its:
radical(1/m sigma (prediction-y)^2))

Question 8

Q

What is the most common performance metric for regression problems? What performance metric do we use when we have many outliers? P 71

Answer

A

Even though the RMSE is generally the preferred performance measure for regression tasks, in some contexts you may prefer to use another function. For example, suppose that there are many outlier districts. In that case, you may consider using the mean absolute error (MAE, also called the average absolute deviation.

Question 9

Q

Computing the root of a sum of squares (RMSE) corresponds to the …norm: this is the notion of distance you are familiar with. It is also called the …, denoted as … P 71

Answer

A

Euclidean, ℓ2 norm, || ·||2 (or just || ·||)

Question 10

Q

Computing the sum of absolutes (MAE) corresponds to the…, denoted as …. This is sometimes called the …norm because it measures the distance between two points in a city if you can only travel along orthogonal city blocks. P 71

Answer

A

ℓ1 norm,||·||1, Manhattan

Question 11

Q

How is MAE calculated? P 71

Answer

A

MAE= sigma(|prediction - y|)/n

Question 12

Q

More generally, the ℓk norm of a vector v containing n elements is defined as: ____ P 71

Answer

A

||v||^k = (|v0 | ^k + |v1 | ^k + … + |vn | ^k ) ^1/k

Question 13

Q

Why the RMSE is more sensitive to outliers than the MAE? P 71

Answer

A

The higher the norm index (∥v∥^k = (|v0 | ^k + |v1 | ^k + … + |vn | ^k ) ^1/k ), the more it focuses on large values and neglects small ones. But when outliers are exponentially rare (like in a bell-shaped curve), the RMSE performs very well and is generally preferred.

Question 14

Q

Besides using SKLEARN, what can we do to
have a stable train/test split even after updating the dataset? Code P 82

Answer

A

To have a stable train/test split even after updating the dataset, a common solution is to use each instance’s identifier to decide whether it should go in the test set (assuming instances have a unique and immutable identifier). For example, you could:
1- ✨ compute a hash of each instance’s identifier ✨
2-✨ put that instance in the test set if the hash is lower than or equal to 20% of the maximum hash value.✨
This ensures that the test set will remain consistent across multiple runs, even if you refresh the dataset.
The new test set will contain 20% of the new instances, but it will not contain any instance that was previously in the training set.
Here is a possible implementation:

from zlib import crc32
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

Question 15

Q

When a company wants to sample 1000 people, do they do it purely randomly? Use an example to explain P 83

Answer

A

When a survey company decides to call 1,000 people to ask them a few questions, they don’t just pick 1,000 people randomly in a phone book. They try to ensure that these 1,000 people are representative of the whole population. For example, the US population is 51.3%
females and 48.7% males, so a well-conducted survey in the US would try to maintain this ratio in the sample: 513 female and 487 male. This is called stratified sampling.

1/ The population is divided into homogeneous subgroups called ✨strata✨,
2/ The right number of instances are sampled from each stratum to guarantee that the test set is representative of the overall population.

“Stratum” refers to a single subgroup or category, while “Strata” can mean several, or all, groups

Question 16

Q

In stratified sampling, “Stratum” refers to a single subgroup or category, while “Strata” can mean several, or all, groups
You should have many strata, and each stratum should be large enough. True/False P 84

Answer

Study These Flashcards

A

False. You should not have too many strata, and each stratum should be large enough.

Question 17

Q

How can we turn continuous features to categorical using Pandas and sklearn? External

Answer

Study These Flashcards

A

Pandas: pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates=’raise’, ordered=True)

Sklearn: sklearn.preprocessing.KBinsDiscretizer(n_bins=5, *, encode=’onehot’, strategy=’quantile’, dtype=None, subsample=’warn’, random_state=None)

Question 18

Q

What is the parameter in a scatter plot that helps us see high density parts? P 86

Answer

Study These Flashcards

A

Setting the alpha option to 0.1 makes it much easier to visualize the places where there is a high density of data points

Question 19

Q

How can we set the radius and color of the circles in a scatter plot? P 88

Answer

Study These Flashcards

A

✨ The radius of each circle represents the district’s population (option s),
✨ The color represents the price (option c).
✨ We will use a predefined color map (option cmap) called jet, which ranges from blue (low values) to red (high prices):

housing.plot(kind=”scatter”, x=”longitude”, y=”latitude”, alpha=0.4, s=housing[“population”]/100, label=”population”, figsize=(10,7), c=”median_house_value”, cmap=plt.get_cmap(“jet”), colorbar=True, )
plt.legend()

Question 20

Q

Let’s say we have achieved a result for test set, using the tuned model. How can we be sure that it’s better than the previous model in use and the result is not based on chance? P 110

Answer

Study These Flashcards

A

In some cases, such a point estimate of the generalization error will not be quite enough to convince you to launch: what if it is just 0.1% better than the model currently in production? You might want to have an idea of how precise this estimate is.
For this, you can compute a 95% confidence interval for the generalization error using ✨ scipy.stats.t.interval(): ✨
&raquo_space;> from scipy import stats
»> confidence = 0.95
&raquo_space;> squared_errors = (final_predictions - y_test) ** 2
&raquo_space;> np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1, loc=squared_errors.mean(), scale=stats.sem(squared_errors)))
array([45685.10470776, 49691.25001878])

Question 21

Q

What data do we need to compute the confidence interval? External

Answer

Study These Flashcards

A

To compute a 95% confidence interval, you need three pieces of data:
✨ The mean (for continuous data) or proportion (for binary data)
✨ The standard deviation, which describes how dispersed the data is around the average.
✨ The sample size.

CHAPTER 2 End-to-End Machine Learning Project Flashcards

(21 cards)