Model Selection & Evaluation Flashcards

1
Q

Write the mathematical description of supervised learning.

A

(x_i, y_i) ~ p(x, y) i.i.d where x_i is a vector and y_i is a label.
We want to find f(x_i) ≈ y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the goal of unsupervised learning?

A

Finding underlying structure/pattern in the data that can be used for tasks such as clustering, outlier/anomaly detection and dimensionality reduction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

List pre-processing steps that can be taken with numerical data.

A

Centering, scaling, ranking, non-linear transformations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

5List 4 types of encoding that can be used for categorical features.

A

Ordinal, one-hot, target, frequency, hierarchical.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Mention 2 possible methods for dealing with missing data.

A

Replacing with the mean, k-means.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define overfitting.

A

Overfitting occurs when a model learns the training data too well, including its noise and outliers, resulting in poor performance on new, unseen data. Essentially, the model memorizes the training data instead of learning generalizable patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Write the expected test error and decompose into bias and variance terms.

A

err(x_0) = E[(y - f(x_0)) ^ 2] = sigma^2 + bias(f(x_0))^2 + var(f(x_0))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain how the bias/variance tradeoff relates to model complexity and training samples.

A

The more complex a model is, the less biased its responses are, but variance increases so as to fit the noise, reducing its generalisation capacity.

The more training samples there are, the more the bias increases, but the variance between training and test error is reduced. If there are low training samples, the bias will be low but there will be high variance between train and test samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define model selection and model asssessment.

A

Model selection: estimating the performance of different models in order to choose the best one (e.g. hyperparameter optimization)

Model assessment (model evaluation): having chosen a final model (e.g. optimal hyper-parameter), estimating its generalization (test error) on new data.

They should be done on different partitions of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define nested cross validation.

A

Nested cross-validation consists of two cross-validation loops: an outer loop that splits the dataset into training and test folds and an inner loop that further splits the training data for model selection and hyperparameter tuning. The model is trained and validated within the inner loop, and its final performance is evaluated on the held-out test folds from the outer loop.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain the difference between k-fold and stratified k-fold

A

K-fold samples by using index, StratifiedKFold ensures relative class frequencies in each fold reflect relative to class frequencies on the whole dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why would you use GroupKFold?

A

Use it when samples are grouped, ensuring all samples from a group stay in the same fold. This prevents data leakage and ensures evaluation on unseen groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is accuracy? And balanced accuracy? Write formulas for both.

A

Accuracy: Fraction of examples that are corrected classified.
(TP + TN) / N

Balanced accuracy: average accuracy per class.
(1/2)*( (TP / (TP +FN)) + (TN / (TN+FP)))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Define and formulate Precision

A

Precision: “How many of those predicted positive are actually positive?”
TP / (TP + FP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Define and formulate Recall

A

“How many of those which are actually positive are correctly predictive as positive?”
TP / (TP + FN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define and formulate Specificity

A

“How many of those which are actually negative are correctly predictive as negative?”
TN / (TN + FP)

17
Q

What is a soft classifier?

A

Classifies all predictions above a threshold as 1 and all those below it as 0.

18
Q

Write the formula for MSE

A

mean of (f(x_i) - y_i)^2

19
Q

Write the formula for the coefficient of determination

A

R^2 = 1 - (mean of MSE) / (mean of (y - sum(y)))