Week 3: Performance Evaluation Flashcards

1
Q

What is the error function for a classification problem?

A

The indicator function of y.hat == y, i.e, value 1 if misspecified, 0 if predicted equals observed outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the difference between the error function and the loss function?

A

The loss function is used when learning/training the model while the error function is used to evaluate the performance of an already learned model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is leave-one-out cross-validation?

A

A special case of the k-fold cross-validation, where the number of batches to validate on (k) equals the number of observations (n). k = n.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is cross-validation?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the error function for a regression problem?

A

The squared error;
(y.hat - y)^2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why can we not approximate E_new with E_train?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why are we interested in estimating E_new? (4)

A

1) Judge if performance is satisfactory (E_new small), 2) Help choose between different methods and models, 3) Choose hyperparameters (e.g., k in kNN, regularisation parameter in ridge reg. or no of hidden layers in DL) in order to minimize E_new, 4) Serve as a good measure of the expected performance when presented to a customer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which value of E_new and E_train will, in general, be lower?

A

E_train.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is VERY important when splitting the complete dataset into training, validation (and possibly also test) parts?

A

To split the data randomly. E.g., shuffle the data points around before selecting some percentage as training and rest as validation, in order to avoid potential (sorted) patterns or trend in the dataset.

This is most illustrative in the case of classification, where the training data could predict one class while the hold-out validation data would predict the other class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the training error (E_train) for a classification problem with k = 1? Why?

A

Zero.

When K = 1, you’ll choose the closest training sample to your test sample. Since your test sample is in the training dataset, it’ll choose itself as the closest and never make mistake. For this reason, the training error will be zero when K = 1, irrespective of the dataset.

Note, there’s one logical assumption here; your training set must not include same training samples belonging to different classes, i.e. conflicting information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the overall goal in supervised machine learning?

A

To minimize E_new (expected new data error).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the number of k usually used for k-fold cross-validation?

A

k = 5 or 10.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which two ways of cross-validation can we use?

A

1) k-fold cross-valid, 2) leave-one-out cross-valid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does the training error, E_train, tell us?

A

How well the method is able to “reproduce” the data from which it was learned (not very informative in ML).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Does the chosen loss function and chosen error function need to be the same?

A

No!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

In which situations does it hold that E_hold-out is an unbiased estimate of E_new?

A

When we assume all data points (training+valid) to be drawn from p(x,y).

If the entire procedure (hold-out) is repeated multiple times, each time with new data, the average E_hold-out will be E_new.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the trade-off between a large validation data chunk and a large training data chunk?

A

If we increase the hold-out validation data, the variance of E_hold-out will decrease -> we expect it to be closer to E_new for a single experiment.

However, increasing the validation data, we also have less data to train the model on, possibly and commonly resulting in a larger E_new than we want.

Use k-fold C-V instead!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Describe the procedure of k-fold cross-validation.

A

1) Split all available data into k batches of similar size (l=1…,k batches total),

2) take batch l as hold-out validation data and the remaining k-l batches as training data,

3) train the model on training data and compute E^(l)_hold_out (l=1,…k) as the average error on the hold-out validation data, for eah batch l.

4) if l < k, set l <- l +1 and repeat 2-3. If l = k, compute the k-fold cross-validation error by taking the average (1/k * sum) of the E^(l)_hold_out for all k batches.

5) train the model again, using the entire dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What do we take the expectation with respect to when creating E_new?

A

Expectation over all possible test data points w/ respect to the (unknown, theoretical) distribution p(x, y).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the risk of using E_k-fold (or E_hold-out) to choose the ultimate hyperparameter?

A

It will invalidate E_k-fold (or E_hold-out) as an estimator of E_new as it tends to overfit to the validation data. The result is that E_k-fold is overly optimistic as an estimator of the new data error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the risk of using E_k-fold (or E_hold-out) to choose the ultimate hyperparameter?

A

It will invalidate E_k-fold (or E_hold-out) as an estimator of E_new as it tends to overfit to the validation data. The result is that E_k-fold is overly optimistic as an estimator of the new data error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the risk of using E_k-fold (or E_hold-out) to choose the ultimate hyperparameter?

A

It will invalidate E_k-fold (or E_hold-out) as an estimator of E_new as it tends to overfit to the validation data. The result is that E_k-fold is overly optimistic as an estimator of the new data error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the risk of using E_k-fold (or E_hold-out) to choose the ultimate hyperparameter?

A

It will invalidate E_k-fold (or E_hold-out) as an estimator of E_new as it tends to overfit to the validation data. The result is that E_k-fold is overly optimistic as an estimator of the new data error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the ultimate task if wanting to use E_k-fold as approximator of E_new?

A

To set aside another (besides the validation data) hold-out dataset, i.e., the test set. Test set should only be used ONCE (after selecting hyperparameters) to estimate E_new for the final model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are the three common techniques for creating artificial datasets?

A

1) To duplicate date and add noise to the duplicated,

2) Simulate data,

3) Use data from a different but related problem.

26
Q

What is the E.bar_NEW?

A

The average E_new if we were to train the model multiple times using different datasets (all size n), i.e., E.bar_NEW is average over all possible training datasets.

E.bar_NEW = Exp_TRAIN [E_NEW(TRAIN)]

27
Q

Why do we use E.bar_NEW and E.bar_TRAIN, rather than E_NEW and E_TRAIN?

A

Because it is easier to reason about the average behaviour of E.bar_NEW and E.bar_TRAIN, than about the errors E_NEW and E_TRAIN obtained when model is trained on one specific model.

28
Q

What does it mean (in ML lingo) for a model to generalise from training data?

A

To be able to perform well on unseen data after being trained.

29
Q

What is the (expected) generalisation gap?

A

The difference E.bar_new - E.bar_train.

I.e., the difference between the expected performance on training data and the expected performance “in production” on new, previously unseen data.

30
Q

Which are the two decompositions of E.bar_NEW we can do?

A

1) Training Error-Generalisation Gap decomposition and 2) Bias-Variance decomposition.

31
Q

Why don’t we want to increase model complecity in orer to let the E.bar_train to decrease further?

A

Due to the risk of overfitting.

32
Q

What is a rough definition of model complexity?

A

The number of parameters that are learned from the data.

33
Q

How can we get a value for E_train in a classification problem using kNN?

A

By calculating the fraction of misclassified training data points.

34
Q

Draw the graph of E.bar_new and E.bar_train with model complexity on x, error on y.

A
35
Q

Why is the bias-variance composition of E.bar_new irrelevant when comparing different (ML) algorithms, but relevant for choosing (among) tuning parameters for one single method?

A

Irrelevant because different methods will have different ways of adapting to the data.

36
Q

Which three components can we split E.bar_new into?

A

1) variance term, 2) (squared) bias term and 3) irreducible error term

37
Q

Why can’t we use k-fold CV on time series data?

A
38
Q

What typ of cross-validation should be use on time series data?

A

Rolling cross-validation.

39
Q

Which loss function do we actually use when training a classifier? Why not the misclassification?

A

The cross-entropy loss. Besides formulating the loss function not just in termms of the hard class prediction y.hat, we also incorporate the predicted class probabability g(x).

Not misclassification since,
1) using cross-entropy can result in a model that generalizes better from the training data, as the final prediction y.hat does not reveal all aspects of the classifier (this would mean pushing the decision boundaries further away from the training data points),

2) Misclassification would give us a piecewise constant cost function = impossible for numerical optimization as the gradient is zero everywhere (except where undefined).

40
Q
A

The cross-entropy loss. Besides formulating the loss function not just in termms of the hard class prediction y.hat, we also incorporate the predicted class probabability g(x).

Not misclassification since,
1) using cross-entropy can result in a model that generalizes better from the training data, as the final prediction y.hat does not reveal all aspects of the classifier (this would mean pushing the decision boundaries further away from the training data points),

2) Misclassification would give us a piecewise constant cost function = impossible for numerical optimization as the gradient is zero everywhere (except where undefined).

41
Q

Name the major downside with k-fold CV.

A

It is computationally very costly.

42
Q

Does k-fold CV improve the upward bias situation that we have for E_hold out when estimating E_new?

A

No, but it reduces variance.

43
Q

What are the two conflicting goals regarding T_v and T_train?

A

That we need a large T_train to reduce bias, but we also need a large T_v to get a less uncertain estimate of a smaller validation data set.

44
Q

When do we use the rolling CV?

A

For time series data which is dependent and has a natural order.

45
Q

What is a necessary assumption for the data points when using k-fold CV?

A

That the data points are independent and unordered. Therefore, assign data randomly before any splitting into training and validation data.

46
Q

What is the TPR (true pos. rate)?

A

TPR is the probability that an actual positive will be predicted as positive.

47
Q

What is the TPR also called?

A

The sensitivity.

48
Q

What is the TNR (true negative rate)?

A

The probability that an actual negative will be predicted as negative.

49
Q

What is the TNR also called?

A

The specificity.

50
Q

What is the fundamental principle of data usage in ML?

A

Don’t use the same data for two different purposes.

51
Q

What does it mean that E_hold out is an upward biased estimate of E_new? Why?

A

That the estimate E_hold out is expected to be larger than the true value of E_new.

VARFÖR?

52
Q

Definition of the training error?

A

It is the error measured from validating the model on the same data as we learned the model on.

53
Q

Does the generalization gap decrease or increase with model complexity (think of plot with E.new and E.train difference)? Draw a graph!

A

Increase.

54
Q

What is true regarding (sq) bias and variance for a model with low complexity?

A

It has low variance but high bias.

55
Q

What is true regarding (sq) bias and variance for a model with high complexity?

A

It has high variance but low bias.

56
Q

Which error functions do we use?

A

Misclassification error for classification problem, squared error function for regression.

57
Q

What is the definition of the error function?

A

A measure of how much our output prediction for a single data point misses the true value of exactly this data point.

58
Q

What is the definition of E.bar_new?

A

The average E_new if we were to train the model on different training data sets.

59
Q

Illustrate the generalization gap.

A
60
Q
A