Model Selection & Evaluation Flashcards

Question 1

Q

Write the mathematical description of supervised learning.

Answer

A

(x_i, y_i) ~ p(x, y) i.i.d where x_i is a vector and y_i is a label.
We want to find f(x_i) ≈ y.

Question 2

Q

What is the goal of unsupervised learning?

Answer

A

Finding underlying structure/pattern in the data that can be used for tasks such as clustering, outlier/anomaly detection and dimensionality reduction.

Question 3

Q

List pre-processing steps that can be taken with numerical data.

Answer

A

Centering, scaling, ranking, non-linear transformations.

Question 4

Q

5List 4 types of encoding that can be used for categorical features.

Answer

A

Ordinal, one-hot, target, frequency, hierarchical.

Question 5

Q

Mention 2 possible methods for dealing with missing data.

Answer

A

Replacing with the mean, k-means.

Question 6

Q

Define overfitting.

Answer

A

Overfitting occurs when a model learns the training data too well, including its noise and outliers, resulting in poor performance on new, unseen data. Essentially, the model memorizes the training data instead of learning generalizable patterns.

Question 7

Q

Write the expected test error and decompose into bias and variance terms.

Answer

A

err(x_0) = E[(y - f(x_0)) ^ 2] = sigma^2 + bias(f(x_0))^2 + var(f(x_0))

Question 8

Q

Explain how the bias/variance tradeoff relates to model complexity and training samples.

Answer

A

The more complex a model is, the less biased its responses are, but variance increases so as to fit the noise, reducing its generalisation capacity.

The more training samples there are, the more the bias increases, but the variance between training and test error is reduced. If there are low training samples, the bias will be low but there will be high variance between train and test samples.

Question 9

Q

Define model selection and model asssessment.

Answer

A

Model selection: estimating the performance of different models in order to choose the best one (e.g. hyperparameter optimization)

Model assessment (model evaluation): having chosen a final model (e.g. optimal hyper-parameter), estimating its generalization (test error) on new data.

They should be done on different partitions of the data.

Question 10

Q

Define nested cross validation.

Answer

A

Nested cross-validation consists of two cross-validation loops: an outer loop that splits the dataset into training and test folds and an inner loop that further splits the training data for model selection and hyperparameter tuning. The model is trained and validated within the inner loop, and its final performance is evaluated on the held-out test folds from the outer loop.

Question 11

Q

Explain the difference between k-fold and stratified k-fold

Answer

A

K-fold samples by using index, StratifiedKFold ensures relative class frequencies in each fold reflect relative to class frequencies on the whole dataset

Question 12

Q

Why would you use GroupKFold?

Answer

A

Use it when samples are grouped, ensuring all samples from a group stay in the same fold. This prevents data leakage and ensures evaluation on unseen groups.

Question 13

Q

What is accuracy? And balanced accuracy? Write formulas for both.

Answer

A

Accuracy: Fraction of examples that are corrected classified.
(TP + TN) / N

Balanced accuracy: average accuracy per class.
(1/2)*( (TP / (TP +FN)) + (TN / (TN+FP)))

Question 14

Q

Define and formulate Precision

Answer

A

Precision: “How many of those predicted positive are actually positive?”
TP / (TP + FP)

Question 15

Q

Define and formulate Recall

Answer

A

“How many of those which are actually positive are correctly predictive as positive?”
TP / (TP + FN)

Question 16

Q

Define and formulate Specificity

Answer

Study These Flashcards

A

“How many of those which are actually negative are correctly predictive as negative?”
TN / (TN + FP)

Question 17

Q

What is a soft classifier?

Answer

Study These Flashcards

A

Classifies all predictions above a threshold as 1 and all those below it as 0.

Question 18

Q

Write the formula for MSE

Answer

Study These Flashcards

A

mean of (f(x_i) - y_i)^2

Question 19

Q

Write the formula for the coefficient of determination

Answer

Study These Flashcards

A

R^2 = 1 - (mean of MSE) / (mean of (y - sum(y)))

Model Selection & Evaluation Flashcards

(19 cards)