Chapter 2: End-to-End Machine Learning Project Flashcards
In the machine learning project checklist, what is the first step? (Hint: Look at…)
…the big picture.
In the machine learning project checklist, what is the second step? (Hint: Get…)
.. the data.
In the machine learning project checklist, what is the third step? (Hint: Explore and visualize…)
…the data to gain insights.
In the machine learning project checklist, what is the fourth step? (Hint: Prepare the…)
…data for machine learning algorithms.
In the machine learning project checklist, what is the fifth step? (Hint: Select a…)
…model and train it.
In the machine learning project checklist, what is the sixth step? (Hint: Fine…)
… tune your model.
In the machine learning project checklist, what is the seventh step? (Hint: Present…)
… your solution.
In the machine learning project checklist, what is the eighth step? (Hint: Launch…)
… monitor and maintain your system.
What are the considerations to take when deciding between batch or online learning?
- Is there a continuous flow of data coming in to the system?
No = Batch / Yes = Online. - Will the model need to adjust to data changing rapidly?
No = Batch / Yes = Online. - Is the data small enough to fit in to memory?
No = Batch / Yes = Online.
What is a suitable performance metric for a linear regression?
Root Mean Squared error, this gives an idea of how much error the system typically makes in its predictions, with a higher weight given to larger errors due to the error being squared.
What are the advantages and disadvantages of mean absolute error (MAE)?
Advantages:
- Robust to outliers.
- Same units as the output variable.
Disadvantages:
- Graph of MAE is not differentiable, using it as a loss function requires optimizers like Gradient Descent.
What are the advantages and disadvantages of root mean squared error (RMSE)?
Advantages:
- Same units as the output variable.
Disadvantages:
- Not robust to outliers as is bias to larger error values.
When is RMSE preferred to MAE?
When errors follow a Gaussian distribution and outliers are exponentially rare.
What is an example of checking the assumptions on a machine learning project?
An assumption on a machine learning project could be that the output of the model is going to be used as a numerical value i.e. when predicting price it will be a $ value rather than a category e.g. high price/medium price/low price.
Checking this assumption is very important as it guides the solution architecture i.e. $ value would require a regression model, category a classification model.
When you start to explore the data in a machine learning project, what characteristics of each attribute should you look at?
- Name.
- Data Type (categorical, int/float, bounded/unbounded, structured/unstructured).
- % null values.
- Noisiness/type of noise. (Stochastic, outliers/round errors).
- Usefullness for the task.
- Distribution (Gaussian/Uniform/Logarithmic).
Before performing any Exploratory Data Analysis for a machine learning task, what should you do?
Split the data in to full train/test/dry run (84%/15%/1%). Splitting the data and hiding the test set prevents data snooping. Creating a dry run set is used for testing code pipelines.
When creating a train/test split, why is using a hash of an identifier preferable over a random split?
When using a random split, each time the script is run the data gets randomly shuffled again and eventually all of the data will be seen by the data scientist and ML algorithm, potentially leading to data snooping.
Using a hash of a unique ID will create a stable dataset split.
When exploring the data through visualisations, how should you treat geographical data such as longitude and latitude?
- Visualise it using a scatterplot or a mapping library such as plotly.
- Overlay different attributes, using colour or density, from the dataset and inspect for any patterns that appear.