The machine learning process Flashcards
What is feature engineering?
The process of crafting features from existing columns using domain knowledge and intuition.
Give an example of where feature engineering might improve a machine learning model
E.g. prediction of life satisfaction using GDP. Intuitively, there are diminishing returns. The relationship is more likely to be logarithmic: i.e. a doubling of wealth leads to a set increase in happiness. So use log(GDP).
Define polynomial regression.
The use of polynomial features: e.g. x^2, x^3 etc. Draws from Taylors theorem: any analytic function has a polynomial series expansion.
What is linear truncation and how is it implemented?
Introduce a kink point where the trend in the data seems to change, fit a linear regression with two weights: one for the data before the kink and one after.
How can you deal with multiple categorial variables?
One-hot encoding: a binary classifier for each category. I.e. IsDog, IsCat and IsMouse
Explain feature selection
The process of selecting a good subset of features to use in a model.
How might you manually select features?
If features are standardized, OLS coefficients might indicate how important each feature is.
We could look for features that are correlated with the outcome.
Explain best subset selection.
Find the subset of features such that the generalisation error is minimised.
What are the steps of the forward subset selection algorithm?
Start with a constant model, for each additional feature, only select the one which results in the greatest risk reduction, and store that model. Then there are a sequence of models to choose from, select the one with the lowest generalisation error.
In machine learning it is common to split your data into three, what are each of these subsets called and what are they used for?
Training: this set is used to train the model by minimising the Risk given the labels.
Validation: this set is unseen in the training stage, but is used to tune hyperparameters (e.g. number of features used).
Test: this set is used to measure model metrics such as accuracy, loss etc.
What is a hyperparameter?
A hyperparameter is a model parameter that is independent of the data. For example: learning rate, tree depth etc. It should be tuned for optimal generalisation performance.
Explain the idea of k-cross validation
Useful for hyperparameter tuning when the data set is small. Divide data into partitions of similar size. Train the model on all data not in that partition, then evaluate the model on the partition, repeating for each partition. Report statistics for the metrics over the partitions.
Why is it important to consider the sampling for splitting the data?
Training data should be representative of the population. Sampling might be necessary to avoid over/under representation among classes.
What is stratified sampling?
Data is grouped into classes, and points are sampled from each class and recombined into training and test sets.
How can this be implemented on multiple columns?
Make more columns with combinations of the features, then stratify on that.