Week 3: Performance Evaluation Flashcards

Question 1

Q

What is the error function for a classification problem?

Answer

A

The indicator function of y.hat == y, i.e, value 1 if misspecified, 0 if predicted equals observed outcome.

Question 2

Q

What is the difference between the error function and the loss function?

Answer

A

The loss function is used when learning/training the model while the error function is used to evaluate the performance of an already learned model.

Question 3

Q

What is leave-one-out cross-validation?

Answer

A

A special case of the k-fold cross-validation, where the number of batches to validate on (k) equals the number of observations (n). k = n.

Question 4

Q

What is cross-validation?

Question 5

Q

What is the error function for a regression problem?

Answer

A

The squared error;
(y.hat - y)^2.

Question 6

Q

Why can we not approximate E_new with E_train?

Question 7

Q

Why are we interested in estimating E_new? (4)

Answer

A

1) Judge if performance is satisfactory (E_new small), 2) Help choose between different methods and models, 3) Choose hyperparameters (e.g., k in kNN, regularisation parameter in ridge reg. or no of hidden layers in DL) in order to minimize E_new, 4) Serve as a good measure of the expected performance when presented to a customer.

Question 8

Q

Which value of E_new and E_train will, in general, be lower?

Question 9

Q

What is VERY important when splitting the complete dataset into training, validation (and possibly also test) parts?

Answer

A

To split the data randomly. E.g., shuffle the data points around before selecting some percentage as training and rest as validation, in order to avoid potential (sorted) patterns or trend in the dataset.

This is most illustrative in the case of classification, where the training data could predict one class while the hold-out validation data would predict the other class.

Question 10

Q

What is the training error (E_train) for a classification problem with k = 1? Why?

Answer

A

Zero.

When K = 1, you’ll choose the closest training sample to your test sample. Since your test sample is in the training dataset, it’ll choose itself as the closest and never make mistake. For this reason, the training error will be zero when K = 1, irrespective of the dataset.

Note, there’s one logical assumption here; your training set must not include same training samples belonging to different classes, i.e. conflicting information.

Question 11

Q

What is the overall goal in supervised machine learning?

Answer

A

To minimize E_new (expected new data error).

Question 12

Q

What is the number of k usually used for k-fold cross-validation?

Answer

A

k = 5 or 10.

Question 13

Q

Which two ways of cross-validation can we use?

Answer

A

1) k-fold cross-valid, 2) leave-one-out cross-valid.

Question 14

Q

What does the training error, E_train, tell us?

Answer

A

How well the method is able to “reproduce” the data from which it was learned (not very informative in ML).

Question 15

Q

Does the chosen loss function and chosen error function need to be the same?

Question 16

Q

In which situations does it hold that E_hold-out is an unbiased estimate of E_new?

Answer

A

When we assume all data points (training+valid) to be drawn from p(x,y).

If the entire procedure (hold-out) is repeated multiple times, each time with new data, the average E_hold-out will be E_new.

Question 17

Q

What is the trade-off between a large validation data chunk and a large training data chunk?

Answer

A

If we increase the hold-out validation data, the variance of E_hold-out will decrease -> we expect it to be closer to E_new for a single experiment.

However, increasing the validation data, we also have less data to train the model on, possibly and commonly resulting in a larger E_new than we want.

Use k-fold C-V instead!

Question 18

Q

Describe the procedure of k-fold cross-validation.

Answer

A

1) Split all available data into k batches of similar size (l=1…,k batches total),

2) take batch l as hold-out validation data and the remaining k-l batches as training data,

3) train the model on training data and compute E^(l)_hold_out (l=1,…k) as the average error on the hold-out validation data, for eah batch l.

4) if l < k, set l <- l +1 and repeat 2-3. If l = k, compute the k-fold cross-validation error by taking the average (1/k * sum) of the E^(l)_hold_out for all k batches.

5) train the model again, using the entire dataset.

Question 19

Q

What do we take the expectation with respect to when creating E_new?

Answer

A

Expectation over all possible test data points w/ respect to the (unknown, theoretical) distribution p(x, y).

Question 20

Q

What is the risk of using E_k-fold (or E_hold-out) to choose the ultimate hyperparameter?

Answer

A

It will invalidate E_k-fold (or E_hold-out) as an estimator of E_new as it tends to overfit to the validation data. The result is that E_k-fold is overly optimistic as an estimator of the new data error.

Question 21

Q

What is the risk of using E_k-fold (or E_hold-out) to choose the ultimate hyperparameter?

Answer

A

It will invalidate E_k-fold (or E_hold-out) as an estimator of E_new as it tends to overfit to the validation data. The result is that E_k-fold is overly optimistic as an estimator of the new data error.

Question 22

Q

What is the risk of using E_k-fold (or E_hold-out) to choose the ultimate hyperparameter?

Answer

A

It will invalidate E_k-fold (or E_hold-out) as an estimator of E_new as it tends to overfit to the validation data. The result is that E_k-fold is overly optimistic as an estimator of the new data error.

Question 23

Q

What is the risk of using E_k-fold (or E_hold-out) to choose the ultimate hyperparameter?

Answer

A

It will invalidate E_k-fold (or E_hold-out) as an estimator of E_new as it tends to overfit to the validation data. The result is that E_k-fold is overly optimistic as an estimator of the new data error.

Question 24

Q

What is the ultimate task if wanting to use E_k-fold as approximator of E_new?

Answer

A

To set aside another (besides the validation data) hold-out dataset, i.e., the test set. Test set should only be used ONCE (after selecting hyperparameters) to estimate E_new for the final model.

Question 25

Q

What are the three common techniques for creating artificial datasets?

Answer

A

1) To duplicate date and add noise to the duplicated,

2) Simulate data,

3) Use data from a different but related problem.

Question 26

Q

What is the E.bar_NEW?

Answer

A

The average E_new if we were to train the model multiple times using different datasets (all size n), i.e., E.bar_NEW is average over all possible training datasets.

E.bar_NEW = Exp_TRAIN [E_NEW(TRAIN)]

Question 27

Q

Why do we use E.bar_NEW and E.bar_TRAIN, rather than E_NEW and E_TRAIN?

Answer

A

Because it is easier to reason about the average behaviour of E.bar_NEW and E.bar_TRAIN, than about the errors E_NEW and E_TRAIN obtained when model is trained on one specific model.

Question 28

Q

What does it mean (in ML lingo) for a model to generalise from training data?

Answer

A

To be able to perform well on unseen data after being trained.

Question 29

Q

What is the (expected) generalisation gap?

Answer

A

The difference E.bar_new - E.bar_train.

I.e., the difference between the expected performance on training data and the expected performance “in production” on new, previously unseen data.

Question 30

Q

Which are the two decompositions of E.bar_NEW we can do?

Answer

A

1) Training Error-Generalisation Gap decomposition and 2) Bias-Variance decomposition.

Question 31

Q

Why don’t we want to increase model complecity in orer to let the E.bar_train to decrease further?

Answer

A

Due to the risk of overfitting.

Question 32

Q

What is a rough definition of model complexity?

Answer

A

The number of parameters that are learned from the data.

Question 33

Q

How can we get a value for E_train in a classification problem using kNN?

Answer

A

By calculating the fraction of misclassified training data points.

Question 34

Q

Draw the graph of E.bar_new and E.bar_train with model complexity on x, error on y.

Question 35

Q

Why is the bias-variance composition of E.bar_new irrelevant when comparing different (ML) algorithms, but relevant for choosing (among) tuning parameters for one single method?

Answer

A

Irrelevant because different methods will have different ways of adapting to the data.

Question 36

Q

Which three components can we split E.bar_new into?

Answer

A

1) variance term, 2) (squared) bias term and 3) irreducible error term

Question 37

Q

Why can’t we use k-fold CV on time series data?

Question 38

Q

What typ of cross-validation should be use on time series data?

Answer

A

Rolling cross-validation.

Question 39

Q

Which loss function do we actually use when training a classifier? Why not the misclassification?

Answer

A

The cross-entropy loss. Besides formulating the loss function not just in termms of the hard class prediction y.hat, we also incorporate the predicted class probabability g(x).

Not misclassification since,
1) using cross-entropy can result in a model that generalizes better from the training data, as the final prediction y.hat does not reveal all aspects of the classifier (this would mean pushing the decision boundaries further away from the training data points),

2) Misclassification would give us a piecewise constant cost function = impossible for numerical optimization as the gradient is zero everywhere (except where undefined).

Question 40

Q

Answer

A

The cross-entropy loss. Besides formulating the loss function not just in termms of the hard class prediction y.hat, we also incorporate the predicted class probabability g(x).

Not misclassification since,
1) using cross-entropy can result in a model that generalizes better from the training data, as the final prediction y.hat does not reveal all aspects of the classifier (this would mean pushing the decision boundaries further away from the training data points),

2) Misclassification would give us a piecewise constant cost function = impossible for numerical optimization as the gradient is zero everywhere (except where undefined).

Question 41

Q

Name the major downside with k-fold CV.

Answer

A

It is computationally very costly.

Question 42

Q

Does k-fold CV improve the upward bias situation that we have for E_hold out when estimating E_new?

Answer

A

No, but it reduces variance.

Question 43

Q

What are the two conflicting goals regarding T_v and T_train?

Answer

A

That we need a large T_train to reduce bias, but we also need a large T_v to get a less uncertain estimate of a smaller validation data set.

Question 44

Q

When do we use the rolling CV?

Answer

A

For time series data which is dependent and has a natural order.

Question 45

Q

What is a necessary assumption for the data points when using k-fold CV?

Answer

A

That the data points are independent and unordered. Therefore, assign data randomly before any splitting into training and validation data.

Question 46

Q

What is the TPR (true pos. rate)?

Answer

A

TPR is the probability that an actual positive will be predicted as positive.

Question 47

Q

What is the TPR also called?

Answer

A

The sensitivity.

Question 48

Q

What is the TNR (true negative rate)?

Answer

A

The probability that an actual negative will be predicted as negative.

Question 49

Q

What is the TNR also called?

Answer

A

The specificity.

Question 50

Q

What is the fundamental principle of data usage in ML?

Answer

A

Don’t use the same data for two different purposes.

Question 51

Q

What does it mean that E_hold out is an upward biased estimate of E_new? Why?

Answer

A

That the estimate E_hold out is expected to be larger than the true value of E_new.

VARFÖR?

Question 52

Q

Definition of the training error?

Answer

A

It is the error measured from validating the model on the same data as we learned the model on.

Question 53

Q

Does the generalization gap decrease or increase with model complexity (think of plot with E.new and E.train difference)? Draw a graph!

Answer

A

Increase.

Question 54

Q

What is true regarding (sq) bias and variance for a model with low complexity?

Answer

A

It has low variance but high bias.

Question 55

Q

What is true regarding (sq) bias and variance for a model with high complexity?

Answer

A

It has high variance but low bias.

Question 56

Q

Which error functions do we use?

Answer

A

Misclassification error for classification problem, squared error function for regression.

Question 57

Q

What is the definition of the error function?

Answer

A

A measure of how much our output prediction for a single data point misses the true value of exactly this data point.

Question 58

Q

What is the definition of E.bar_new?

Answer

A

The average E_new if we were to train the model on different training data sets.

Question 59

Q

Illustrate the generalization gap.

Question 60

Q