The machine learning process Flashcards

1
Q

What is feature engineering?

A

The process of crafting features from existing columns using domain knowledge and intuition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Give an example of where feature engineering might improve a machine learning model

A

E.g. prediction of life satisfaction using GDP. Intuitively, there are diminishing returns. The relationship is more likely to be logarithmic: i.e. a doubling of wealth leads to a set increase in happiness. So use log(GDP).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define polynomial regression.

A

The use of polynomial features: e.g. x^2, x^3 etc. Draws from Taylors theorem: any analytic function has a polynomial series expansion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is linear truncation and how is it implemented?

A

Introduce a kink point where the trend in the data seems to change, fit a linear regression with two weights: one for the data before the kink and one after.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can you deal with multiple categorial variables?

A

One-hot encoding: a binary classifier for each category. I.e. IsDog, IsCat and IsMouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain feature selection

A

The process of selecting a good subset of features to use in a model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How might you manually select features?

A

If features are standardized, OLS coefficients might indicate how important each feature is.
We could look for features that are correlated with the outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain best subset selection.

A

Find the subset of features such that the generalisation error is minimised.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the steps of the forward subset selection algorithm?

A

Start with a constant model, for each additional feature, only select the one which results in the greatest risk reduction, and store that model. Then there are a sequence of models to choose from, select the one with the lowest generalisation error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In machine learning it is common to split your data into three, what are each of these subsets called and what are they used for?

A

Training: this set is used to train the model by minimising the Risk given the labels.

Validation: this set is unseen in the training stage, but is used to tune hyperparameters (e.g. number of features used).

Test: this set is used to measure model metrics such as accuracy, loss etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a hyperparameter?

A

A hyperparameter is a model parameter that is independent of the data. For example: learning rate, tree depth etc. It should be tuned for optimal generalisation performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain the idea of k-cross validation

A

Useful for hyperparameter tuning when the data set is small. Divide data into partitions of similar size. Train the model on all data not in that partition, then evaluate the model on the partition, repeating for each partition. Report statistics for the metrics over the partitions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why is it important to consider the sampling for splitting the data?

A

Training data should be representative of the population. Sampling might be necessary to avoid over/under representation among classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is stratified sampling?

A

Data is grouped into classes, and points are sampled from each class and recombined into training and test sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can this be implemented on multiple columns?

A

Make more columns with combinations of the features, then stratify on that.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can it be implemented on a continuous variable?

A

The data can be grouped by quantiles and stratified.

17
Q

What is the R-squared and for what type of model can it be used as an evaluation metric?

A

1 - (residual sum of squares over total sum of squares). This can be used on regression models.

18
Q

What are the disadvantages to using average log loss and accuracy score to evaluate classification models?

A

These errors do not illuminate how the model is misclassifying, which could be important information depending on the business problem.

19
Q

What does the confusion matrix show?

A

The total true negatives, false negatives, false positives and true positives

20
Q

Which parts of the confusion matrix might be more important in spam classification? What about cancer screening?

A

In spam classification, we care more about minimising FP than FN because missing a non-spam email is worse than seeing a spam in your inbox.

Cancer screening: we minimise FN over FP because being sent for further tests without actually having cancer much preferable to missing a sign of cancer.

21
Q

What are the true positive/true negative rates?

A
TPR = TP / (All positives) = TP / (TP + FN)
TNR = TN / (All negatives) = TN / (TN + FP)
22
Q

For classification prediction functions, what is the cutoff hyperparameter?

A

For binary classification, a prediction is a probability of membership to each class. The cut-off is the probability above which is assigned a positive result.

23
Q

What is an ROC curve?

A

This is a curve in TPR-TNR space, parameterised by the cut off, for a given classification model.

24
Q

How does the ROC change with scaling of scores/probabilities

A

The curve remains unchanged: its parameterisation is different but it occupies the same region of TPR-TNR space.

25
Q

What does the area under the ROC curve represent?

A

This measures the extent to which, with the ideal choice of cut-off, you can correctly separate classes.

26
Q

A model with an ROC AUC of 0.5 is better or worse than one with an AUC of 0.1?

A

An ROC of 1 means that there is a cut-off that perfectly separates the classes, but 0.5 is the AUC for a random classification, so really, 0.1 is better because you could flip classes for a better accuracy.

27
Q

What is a calibration curve?

A

It represents how well calibrated a model is to the actual distribution of classes

28
Q

What does a well-calibrated classification prediction function look like?

A

a line with gradient 1

29
Q

Explain the aim behind regularisation. Which trade-off does it hope to optimise?

A

Regularisation aims to reduce overfitting by simplifying the models and reducing the hypothesis space.

30
Q

What is a complexity measure?

A

A complexity measure is a positive valued functional which quantifies how complex the function is.

31
Q

What is Ivanov regularisation?

A

The minimisation of the Risk functional subject to the constraint that the complexity function is below some threshold.

32
Q

What is Tykhonov regularisation?

A

Minimisation of a modified risk functional containing a penalty term which is related to the complexity measure.

33
Q

What is the ridge regression objective function?

A

The usual regression risk functional, plus an L2-norm term

34
Q

What is the reason for this objective function?

A

By minimising this risk functional, the trade off between training set accuracy and model complexity is balanced, since a smaller L2-norm means the model is less sensitive to inputs.

35
Q

What must we do to the features before performing ridge regression?

A

We must standardize them.

36
Q

What is the difference between ridge and lasso regression?

A

Lasso is the same, but uses an L1-norm.

37
Q

What are three pros to using lasso regression over ridge regression?

A

Lasso gives sparse solutions: good for model interpretability and maintenance. Also quicker to compute.

38
Q

What is one pro to using ridge regression over lasso regression?

A

Ridge gives less sparse, meaning the model is less dependent on a few features.