CHAPTER 2 End-to-End Machine Learning Project Flashcards
What are the main steps we need to take for an End-to-End ML project? P 65
- Look at the big picture.
- Get the data.
- Discover and visualize the data to gain insights.
- Prepare the data for Machine Learning algorithms.
- Select a model and train it.
- Fine-tune your model.
- Present your solution.
- Launch, monitor, and maintain your system.
In page 67 there’s a link to an appendix, explaining details of each step.
What is the first question to ask when beginning an ML project? P 67
The first question to ask your boss is what exactly the business objective is.
What are upstream and downstream systems? External
An upstream system is any system that sends data to the Collaboration Server system. A downstream system is a system that receives data from the Collaboration Server system.
What’s a pipeline? P 68
A sequence of data processing components is called a data pipeline.
What’s the difference between online learning and batch learning? External
In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set or batches of it. (Mahsa: Basically online learning is using stochastic gradient descent for updating weights and batch learning is using batches for updating weights)
What should we do after defining the problem? P 69
Your next step is to select a performance measure.
How is Root Mean Square Error (RMSE) calculated? P 70
It’s the root of Mean Square Error so its:
radical(1/m sigma (prediction-y)^2))
What is the most common performance metric for regression problems? What performance metric do we use when we have many outliers? P 71
Even though the RMSE is generally the preferred performance measure for regression tasks, in some contexts you may prefer to use another function. For example, suppose that there are many outlier districts. In that case, you may consider using the mean absolute error (MAE, also called the average absolute deviation.
Computing the root of a sum of squares (RMSE) corresponds to the …norm: this is the notion of distance you are familiar with. It is also called the …, denoted as … P 71
Euclidean, ℓ2 norm, || ·||2 (or just || ·||)
Computing the sum of absolutes (MAE) corresponds to the…, denoted as …. This is sometimes called the …norm because it measures the distance between two points in a city if you can only travel along orthogonal city blocks. P 71
ℓ1 norm,||·||1, Manhattan
How is MAE calculated? P 71
MAE= sigma(|prediction - y|)/n
More generally, the ℓk norm of a vector v containing n elements is defined as: ____ P 71
||v||^k = (|v0 | ^k + |v1 | ^k + … + |vn | ^k ) ^1/k
Why the RMSE is more sensitive to outliers than the MAE? P 71
The higher the norm index (∥v∥^k = (|v0 | ^k + |v1 | ^k + … + |vn | ^k ) ^1/k ), the more it focuses on large values and neglects small ones. But when outliers are exponentially rare (like in a bell-shaped curve), the RMSE performs very well and is generally preferred.
Besides using SKLEARN, what can we do to
have a stable train/test split even after updating the dataset? Code P 82
To have a stable train/test split even after updating the dataset, a common solution is to use each instance’s identifier to decide whether it should go in the test set (assuming instances have a unique and immutable identifier). For example, you could:
1- ✨ compute a hash of each instance’s identifier ✨
2-✨ put that instance in the test set if the hash is lower than or equal to 20% of the maximum hash value.✨
This ensures that the test set will remain consistent across multiple runs, even if you refresh the dataset.
The new test set will contain 20% of the new instances, but it will not contain any instance that was previously in the training set.
Here is a possible implementation:
from zlib import crc32
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32
When a company wants to sample 1000 people, do they do it purely randomly? Use an example to explain P 83
When a survey company decides to call 1,000 people to ask them a few questions, they don’t just pick 1,000 people randomly in a phone book. They try to ensure that these 1,000 people are representative of the whole population. For example, the US population is 51.3%
females and 48.7% males, so a well-conducted survey in the US would try to maintain this ratio in the sample: 513 female and 487 male. This is called stratified sampling.
1/ The population is divided into homogeneous subgroups called ✨strata✨,
2/ The right number of instances are sampled from each stratum to guarantee that the test set is representative of the overall population.
“Stratum” refers to a single subgroup or category, while “Strata” can mean several, or all, groups