Model Selection & Evaluation Flashcards
Write the mathematical description of supervised learning.
(x_i, y_i) ~ p(x, y) i.i.d where x_i is a vector and y_i is a label.
We want to find f(x_i) ≈ y.
What is the goal of unsupervised learning?
Finding underlying structure/pattern in the data that can be used for tasks such as clustering, outlier/anomaly detection and dimensionality reduction.
List pre-processing steps that can be taken with numerical data.
Centering, scaling, ranking, non-linear transformations.
5List 4 types of encoding that can be used for categorical features.
Ordinal, one-hot, target, frequency, hierarchical.
Mention 2 possible methods for dealing with missing data.
Replacing with the mean, k-means.
Define overfitting.
Overfitting occurs when a model learns the training data too well, including its noise and outliers, resulting in poor performance on new, unseen data. Essentially, the model memorizes the training data instead of learning generalizable patterns.
Write the expected test error and decompose into bias and variance terms.
err(x_0) = E[(y - f(x_0)) ^ 2] = sigma^2 + bias(f(x_0))^2 + var(f(x_0))
Explain how the bias/variance tradeoff relates to model complexity and training samples.
The more complex a model is, the less biased its responses are, but variance increases so as to fit the noise, reducing its generalisation capacity.
The more training samples there are, the more the bias increases, but the variance between training and test error is reduced. If there are low training samples, the bias will be low but there will be high variance between train and test samples.
Define model selection and model asssessment.
Model selection: estimating the performance of different models in order to choose the best one (e.g. hyperparameter optimization)
Model assessment (model evaluation): having chosen a final model (e.g. optimal hyper-parameter), estimating its generalization (test error) on new data.
They should be done on different partitions of the data.
Define nested cross validation.
Nested cross-validation consists of two cross-validation loops: an outer loop that splits the dataset into training and test folds and an inner loop that further splits the training data for model selection and hyperparameter tuning. The model is trained and validated within the inner loop, and its final performance is evaluated on the held-out test folds from the outer loop.
Explain the difference between k-fold and stratified k-fold
K-fold samples by using index, StratifiedKFold ensures relative class frequencies in each fold reflect relative to class frequencies on the whole dataset
Why would you use GroupKFold?
Use it when samples are grouped, ensuring all samples from a group stay in the same fold. This prevents data leakage and ensures evaluation on unseen groups.
What is accuracy? And balanced accuracy? Write formulas for both.
Accuracy: Fraction of examples that are corrected classified.
(TP + TN) / N
Balanced accuracy: average accuracy per class.
(1/2)*( (TP / (TP +FN)) + (TN / (TN+FP)))
Define and formulate Precision
Precision: “How many of those predicted positive are actually positive?”
TP / (TP + FP)
Define and formulate Recall
“How many of those which are actually positive are correctly predictive as positive?”
TP / (TP + FN)
Define and formulate Specificity
“How many of those which are actually negative are correctly predictive as negative?”
TN / (TN + FP)
What is a soft classifier?
Classifies all predictions above a threshold as 1 and all those below it as 0.
Write the formula for MSE
mean of (f(x_i) - y_i)^2
Write the formula for the coefficient of determination
R^2 = 1 - (mean of MSE) / (mean of (y - sum(y)))