Chapter 2: End-to-End Machine Learning Project Flashcards

1
Q

In the machine learning project checklist, what is the first step? (Hint: Look at…)

A

…the big picture.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In the machine learning project checklist, what is the second step? (Hint: Get…)

A

.. the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In the machine learning project checklist, what is the third step? (Hint: Explore and visualize…)

A

…the data to gain insights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In the machine learning project checklist, what is the fourth step? (Hint: Prepare the…)

A

…data for machine learning algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In the machine learning project checklist, what is the fifth step? (Hint: Select a…)

A

…model and train it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In the machine learning project checklist, what is the sixth step? (Hint: Fine…)

A

… tune your model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In the machine learning project checklist, what is the seventh step? (Hint: Present…)

A

… your solution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In the machine learning project checklist, what is the eighth step? (Hint: Launch…)

A

… monitor and maintain your system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the considerations to take when deciding between batch or online learning?

A
  • Is there a continuous flow of data coming in to the system?
    No = Batch / Yes = Online.
  • Will the model need to adjust to data changing rapidly?
    No = Batch / Yes = Online.
  • Is the data small enough to fit in to memory?
    No = Batch / Yes = Online.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a suitable performance metric for a linear regression?

A

Root Mean Squared error, this gives an idea of how much error the system typically makes in its predictions, with a higher weight given to larger errors due to the error being squared.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the advantages and disadvantages of mean absolute error (MAE)?

A

Advantages:
- Robust to outliers.
- Same units as the output variable.

Disadvantages:
- Graph of MAE is not differentiable, using it as a loss function requires optimizers like Gradient Descent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the advantages and disadvantages of root mean squared error (RMSE)?

A

Advantages:
- Same units as the output variable.

Disadvantages:
- Not robust to outliers as is bias to larger error values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When is RMSE preferred to MAE?

A

When errors follow a Gaussian distribution and outliers are exponentially rare.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is an example of checking the assumptions on a machine learning project?

A

An assumption on a machine learning project could be that the output of the model is going to be used as a numerical value i.e. when predicting price it will be a $ value rather than a category e.g. high price/medium price/low price.

Checking this assumption is very important as it guides the solution architecture i.e. $ value would require a regression model, category a classification model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When you start to explore the data in a machine learning project, what characteristics of each attribute should you look at?

A
  1. Name.
  2. Data Type (categorical, int/float, bounded/unbounded, structured/unstructured).
  3. % null values.
  4. Noisiness/type of noise. (Stochastic, outliers/round errors).
  5. Usefullness for the task.
  6. Distribution (Gaussian/Uniform/Logarithmic).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Before performing any Exploratory Data Analysis for a machine learning task, what should you do?

A

Split the data in to full train/test/dry run (84%/15%/1%). Splitting the data and hiding the test set prevents data snooping. Creating a dry run set is used for testing code pipelines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

When creating a train/test split, why is using a hash of an identifier preferable over a random split?

A

When using a random split, each time the script is run the data gets randomly shuffled again and eventually all of the data will be seen by the data scientist and ML algorithm, potentially leading to data snooping.

Using a hash of a unique ID will create a stable dataset split.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

When exploring the data through visualisations, how should you treat geographical data such as longitude and latitude?

A
  • Visualise it using a scatterplot or a mapping library such as plotly.
  • Overlay different attributes, using colour or density, from the dataset and inspect for any patterns that appear.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is useful to compute between pairs of continuous attributes?

A

The correlation matrix or the scatter matrix.

20
Q

Why should we be cautious when looking at the correlation between continuous features?

A
  • It only measures linear correlations and would completely miss non-linear relationships.
  • A perfect correlation would be given to the same variable but in different units.
21
Q

How can experimenting with attribute combinations be helpful when exploring the data? Can you give an example?

A

Some attributes in a dataset may provide more information value when combined. An example of this is when including tenure and promotions in a risk model for attrition, dividing the number of promotions by tenure gives an indication of promotion per year of tenure which can be more informative than the raw number of promotion’s.

22
Q

When cleaning the data when preparing it for machine learning, what strategies can be used for dealing with missing data?

A
  • Remove the rows with missing data, if the number of rows affected is low.
  • Remove the attribute with missing data, if the number of rows affected is high.
  • Impute the missing values using an appropriate statistic (mean/median) or zero.
  • Impute the missing value using another machine learning model, with the missing values as the target. Examples of these are KNNImputer, which uses the k-nearest neigbors algorithm and the IterativeImputer, which trains a regression model per feature.
23
Q

When dealing with missing values, what is it important to consider?

A

It is important to consider if the values missing are random or systematic.

For example, in an unbalanced classification task, there may be 1% of records missing for an attribute, but those records are all for instances in the target class. Therefore dropping the rows would not be suitable.

24
Q

How can you process categorical variables in to numbers for machine learning?

A
  • Ordinal encoding: This replaces each category with a number between 0 and the number of categories in the attribute. This works best for categories with a logical order i.e. rating scale of “bad”, “average”, “good”, and “excellent”.
  • One hot encoding: This creates a new binary attribute for each category e.g. one attribute equal to 1 when the attribute is a certain value and zero otherwise.
25
Q

When dealing with categorical attributes which contain a large number of categories, why is using one hot encoding not optimal? What are alternative methods?

A

One hot encoding creates k-1 new features for k categories, if k is large this will result in a large number of new features, leading to slow training and degraded performance.

Alternative method include:
- Converting the categories to a useful numerical feature related to the category.
- Create a learnable, low-dimensional vector called an embedding.

26
Q

When preparing data for machine learning, why do we scale numerical features?

A

Machine learning algorithms don’t perform well when the input numerical attributes have very different scales, most models will bias towards larger scale features. Feature scaling transforms all the features to the same scale.

26
Q

What are two common feature scaling methods?

A
  • MinMax scaling (normalisation): Values are shifted and rescaled so that they end up ranging from 0 to 1 by subtracting the min value and dividing by max - min.
  • Standardization: The mean of the feature is subtracted from each value and the result is divided by the standard deviation.
27
Q

What is a drawback of using MinMax scaling?

A

MinMax scaling is affected by outliers. For example, if a feature had values 0 - 15 and value had been inputted by mistake to 100, MinMax would scaled the 100 down to 1 and crush all of the other values down to 0 - 0.15.

28
Q

What does a feature with a heavy tail refer to?

A

When values far from the mean are not exponentially rare.

29
Q

What will happen if you scale a distribution with a heavy tail? Are there any steps you should take before scaling a heavy tailed distribution?

A

MinMax and Standardization will squash most values in the distribution in to a very small range, which is not optimal for ML models.

Before scaling the feature, you should transform it to shrink the heavy tail and ideally make the distribution symmetrical.

30
Q

What are common transformations to shrink a heavy tailed distribution?

A
  • Square root transformation (positive features, tail to the right).
  • Logarithmic transformation (long heavy tails, power law distribution).
  • Bucketizing the feature and replacing the feature value with the index of the bucket it belongs to i.e. replace each value with its percentile.
31
Q

What transformation can be applied to multimodal distributions?

A

Bucketize the feature and use the bucket IDs as a category and one hot encode.

This approach allows regression models to more easily learn different rules for different ranges of feature value.

32
Q

What does RBF refer to?

A

Radial Basis Function, which is a similarity measure that depends on the distance between an input value and fixed point.

The most common is a Gaussian RBF, which decays exponentially as the input value moves away from a fixed point.

33
Q

How can an RBF be used in feature engineering?

A

If you have multimodal data, creating an RBF with the fixed point being the value of one of those modes, if the fixed point is correlated with the target value, the new feature will help.

34
Q

Should transformation and scaling be applied to the target variable?

A

Yes, when suitable e.g. a heavy tailed target variable can be log transformed. However, the regression would be predicting log of the target variable and the output will have to be inversely transformed back to its business related output.

35
Q

What does sklearn.utils.validation contain?

A

The sklearn.utils.validation package contains several functions we can use to validate data inputs.

36
Q

When using a model performance measure such as RMSE, how should you put it in to business context?

A

Think about the error term in the context of the range of the value that is trying to be predicted. Also consider what it would mean in a practical sense if your model made this magnitude of error in a prediction.

For example, when trying to predict a house price, when the house prices range from £120k - £265k and a model has a RMSE of £65k, this would not be acceptable.

37
Q

When selecting and training a model, what is a short checking to follow?

A
  • Select a model, starting with the simplest (e.g. linear regression).
  • Train on the dataset and evaluate using k-fold cross validation.
  • Check the evaluation score, inspecting the mean/std yielded by the cross validation.
  • Judge the score and repeat previous steps with different varieties of models.
  • Shortlist two - five of the best performing models for further fine tuning.
38
Q

What are the two different hyperparameters can be explored in hyperparameter tuning?

A

Model hyper-parameters: These are external configuration variables of the model and can be thought of like “settings” on the model. Examples of model hyper-parameters are the number of branches in a decision tree or the regularisation term in a regression.

Feature engineering as a hyper-parameter: Preprocessing features can be treated as a hyper-parameter, for example discretising vs not-discretising a continuous variable or using two different imputation methods can be compared.

39
Q

What are two methods of automated hyperparameter tuning?

A

Grid search: This method takes a dictionary of hyper-paramters and a range of values, evaluates the model performance using all combinations of these values and returns the set of the optimal hyper-parameter values.

Random search: This method selects a random value for each hyperparameter at each iteration. These values are chosen from a list of possible values or a probability distribution.

40
Q

When would one use a random search over a grid search?

A

When the hyperparameter search space is very large.

41
Q

How does Random Search CV work?

A

For a given number of iterations, for example 100, RandomSearchCV will explore 100 random values for each of these hyper-parameters.

42
Q

What are the benefits of Random Search vs Grid Search?

A
  • Random Search time is determined by the number of iterations specified by the data scientist, where as GridSearch is determined by the number of combinations of variables.
43
Q

What are Ensemble Methods?

A

Ensemble methods are a method of fine tuning where the best performing models are combined to produce better performance than the best individual model. This is optimised when models in the ensemble make different types of errors.

44
Q

When analysing the best models, what are checks that should be performed?

A

Feature importance:
- Check the feature importance scores.
- Drop less useful features.
- Perform sanity checks on the features to see if they make sense and are not inadvertently causing target leakage.

Prediction Errors:
- Check where the system is making errors.
- Try to understand why it makes them.
- Try to fix the problem (adding extra features, cleaning up outliers etc).

Systematic performance concerns:
- Check to see if there are any subsets of categories in the dataset the model systematically performs badly on.
- If it does, it is not ready for deployment until this is resolved or predictions for this category should be not be used.

45
Q

After fine tuning the model to a point where it performs sufficiently well on the validation set, what is the next step?

A

Evaluate the model on the completely unseen test set.

46
Q

When evaluating the performance on the test set, what more information would you want in addition to a point estimate of the generalisation error?

A

An idea of the precision of the point estimate of the generalisation error, which can be obtained by computing a 95% confidence interval using stats.t.interval.