Chapter 2: End-to-End Machine Learning Project Flashcards
In the machine learning project checklist, what is the first step? (Hint: Look at…)
…the big picture.
In the machine learning project checklist, what is the second step? (Hint: Get…)
.. the data.
In the machine learning project checklist, what is the third step? (Hint: Explore and visualize…)
…the data to gain insights.
In the machine learning project checklist, what is the fourth step? (Hint: Prepare the…)
…data for machine learning algorithms.
In the machine learning project checklist, what is the fifth step? (Hint: Select a…)
…model and train it.
In the machine learning project checklist, what is the sixth step? (Hint: Fine…)
… tune your model.
In the machine learning project checklist, what is the seventh step? (Hint: Present…)
… your solution.
In the machine learning project checklist, what is the eighth step? (Hint: Launch…)
… monitor and maintain your system.
What are the considerations to take when deciding between batch or online learning?
- Is there a continuous flow of data coming in to the system?
No = Batch / Yes = Online. - Will the model need to adjust to data changing rapidly?
No = Batch / Yes = Online. - Is the data small enough to fit in to memory?
No = Batch / Yes = Online.
What is a suitable performance metric for a linear regression?
Root Mean Squared error, this gives an idea of how much error the system typically makes in its predictions, with a higher weight given to larger errors due to the error being squared.
What are the advantages and disadvantages of mean absolute error (MAE)?
Advantages:
- Robust to outliers.
- Same units as the output variable.
Disadvantages:
- Graph of MAE is not differentiable, using it as a loss function requires optimizers like Gradient Descent.
What are the advantages and disadvantages of root mean squared error (RMSE)?
Advantages:
- Same units as the output variable.
Disadvantages:
- Not robust to outliers as is bias to larger error values.
When is RMSE preferred to MAE?
When errors follow a Gaussian distribution and outliers are exponentially rare.
What is an example of checking the assumptions on a machine learning project?
An assumption on a machine learning project could be that the output of the model is going to be used as a numerical value i.e. when predicting price it will be a $ value rather than a category e.g. high price/medium price/low price.
Checking this assumption is very important as it guides the solution architecture i.e. $ value would require a regression model, category a classification model.
When you start to explore the data in a machine learning project, what characteristics of each attribute should you look at?
- Name.
- Data Type (categorical, int/float, bounded/unbounded, structured/unstructured).
- % null values.
- Noisiness/type of noise. (Stochastic, outliers/round errors).
- Usefullness for the task.
- Distribution (Gaussian/Uniform/Logarithmic).
Before performing any Exploratory Data Analysis for a machine learning task, what should you do?
Split the data in to full train/test/dry run (84%/15%/1%). Splitting the data and hiding the test set prevents data snooping. Creating a dry run set is used for testing code pipelines.
When creating a train/test split, why is using a hash of an identifier preferable over a random split?
When using a random split, each time the script is run the data gets randomly shuffled again and eventually all of the data will be seen by the data scientist and ML algorithm, potentially leading to data snooping.
Using a hash of a unique ID will create a stable dataset split.
When exploring the data through visualisations, how should you treat geographical data such as longitude and latitude?
- Visualise it using a scatterplot or a mapping library such as plotly.
- Overlay different attributes, using colour or density, from the dataset and inspect for any patterns that appear.
What is useful to compute between pairs of continuous attributes?
The correlation matrix or the scatter matrix.
Why should we be cautious when looking at the correlation between continuous features?
- It only measures linear correlations and would completely miss non-linear relationships.
- A perfect correlation would be given to the same variable but in different units.
How can experimenting with attribute combinations be helpful when exploring the data? Can you give an example?
Some attributes in a dataset may provide more information value when combined. An example of this is when including tenure and promotions in a risk model for attrition, dividing the number of promotions by tenure gives an indication of promotion per year of tenure which can be more informative than the raw number of promotion’s.
When cleaning the data when preparing it for machine learning, what strategies can be used for dealing with missing data?
- Remove the rows with missing data, if the number of rows affected is low.
- Remove the attribute with missing data, if the number of rows affected is high.
- Impute the missing values using an appropriate statistic (mean/median) or zero.
- Impute the missing value using another machine learning model, with the missing values as the target. Examples of these are KNNImputer, which uses the k-nearest neigbors algorithm and the IterativeImputer, which trains a regression model per feature.
When dealing with missing values, what is it important to consider?
It is important to consider if the values missing are random or systematic.
For example, in an unbalanced classification task, there may be 1% of records missing for an attribute, but those records are all for instances in the target class. Therefore dropping the rows would not be suitable.
How can you process categorical variables in to numbers for machine learning?
- Ordinal encoding: This replaces each category with a number between 0 and the number of categories in the attribute. This works best for categories with a logical order i.e. rating scale of “bad”, “average”, “good”, and “excellent”.
- One hot encoding: This creates a new binary attribute for each category e.g. one attribute equal to 1 when the attribute is a certain value and zero otherwise.