Chapter 2: End-to-End Machine Learning Project Flashcards

Question

When dealing with categorical attributes which contain a large number of categories, why is using one hot encoding not optimal? What are alternative methods?

Answer 1

One hot encoding creates k-1 new features for k categories, if k is large this will result in a large number of new features, leading to slow training and degraded performance. Alternative method include: - Converting the categories to a useful numerical feature related to the category. - Create a learnable, low-dimensional vector called an embedding.

Answer 2

Machine learning algorithms don’t perform well when the input numerical attributes have very different scales, most models will bias towards larger scale features. Feature scaling transforms all the features to the same scale.

Answer 3

- MinMax scaling (normalisation): Values are shifted and rescaled so that they end up ranging from 0 to 1 by subtracting the min value and dividing by max - min. - Standardization: The mean of the feature is subtracted from each value and the result is divided by the standard deviation.

Answer 4

MinMax scaling is affected by outliers. For example, if a feature had values 0 - 15 and value had been inputted by mistake to 100, MinMax would scaled the 100 down to 1 and crush all of the other values down to 0 - 0.15.

Answer 5

When values far from the mean are not exponentially rare.

Answer 6

MinMax and Standardization will squash most values in the distribution in to a very small range, which is not optimal for ML models. Before scaling the feature, you should transform it to shrink the heavy tail and ideally make the distribution symmetrical.

Answer 7

- Square root transformation (positive features, tail to the right). - Logarithmic transformation (long heavy tails, power law distribution). - Bucketizing the feature and replacing the feature value with the index of the bucket it belongs to i.e. replace each value with its percentile.

Answer 8

Bucketize the feature and use the bucket IDs as a category and one hot encode. This approach allows regression models to more easily learn different rules for different ranges of feature value.

Answer 9

Radial Basis Function, which is a similarity measure that depends on the distance between an input value and fixed point. The most common is a Gaussian RBF, which decays exponentially as the input value moves away from a fixed point.

Answer 10

If you have multimodal data, creating an RBF with the fixed point being the value of one of those modes, if the fixed point is correlated with the target value, the new feature will help.

Answer 11

Yes, when suitable e.g. a heavy tailed target variable can be log transformed. However, the regression would be predicting log of the target variable and the output will have to be inversely transformed back to its business related output.

Answer 12

The sklearn.utils.validation package contains several functions we can use to validate data inputs.

Answer 13

Think about the error term in the context of the range of the value that is trying to be predicted. Also consider what it would mean in a practical sense if your model made this magnitude of error in a prediction. For example, when trying to predict a house price, when the house prices range from £120k - £265k and a model has a RMSE of £65k, this would not be acceptable.

Answer 14

- Select a model, starting with the simplest (e.g. linear regression). - Train on the dataset and evaluate using k-fold cross validation. - Check the evaluation score, inspecting the mean/std yielded by the cross validation. - Judge the score and repeat previous steps with different varieties of models. - Shortlist two - five of the best performing models for further fine tuning.

Answer 15

Model hyper-parameters: These are external configuration variables of the model and can be thought of like "settings" on the model. Examples of model hyper-parameters are the number of branches in a decision tree or the regularisation term in a regression. Feature engineering as a hyper-parameter: Preprocessing features can be treated as a hyper-parameter, for example discretising vs not-discretising a continuous variable or using two different imputation methods can be compared.

Answer 16

Grid search: This method takes a dictionary of hyper-paramters and a range of values, evaluates the model performance using all combinations of these values and returns the set of the optimal hyper-parameter values. Random search: This method selects a random value for each hyperparameter at each iteration. These values are chosen from a list of possible values or a probability distribution.

Answer 17

When the hyperparameter search space is very large.

Answer 18

For a given number of iterations, for example 100, RandomSearchCV will explore 100 random values for each of these hyper-parameters.

Answer 19

- Random Search time is determined by the number of iterations specified by the data scientist, where as GridSearch is determined by the number of combinations of variables.

Answer 20

Ensemble methods are a method of fine tuning where the best performing models are combined to produce better performance than the best individual model. This is optimised when models in the ensemble make different types of errors.

Answer 21

Feature importance: - Check the feature importance scores. - Drop less useful features. - Perform sanity checks on the features to see if they make sense and are not inadvertently causing target leakage. Prediction Errors: - Check where the system is making errors. - Try to understand why it makes them. - Try to fix the problem (adding extra features, cleaning up outliers etc). Systematic performance concerns: - Check to see if there are any subsets of categories in the dataset the model systematically performs badly on. - If it does, it is not ready for deployment until this is resolved or predictions for this category should be not be used.

Answer 22

Evaluate the model on the completely unseen test set.

Answer 23

An idea of the precision of the point estimate of the generalisation error, which can be obtained by computing a 95% confidence interval using stats.t.interval.

Chapter 2: End-to-End Machine Learning Project Flashcards

(47 cards)