ML midterm Flashcards

Question 1

Q

Consider a set of classifiers that includes all linear classifiers that use different choices of strict subsets of the components of the input vectors x∈Rd. Claim: the VC-dimension of this combined set cannot be more than d+1.

Question 2

Q

It is impossible to overfit to the test set if one only uses it for evaluation.

Question 3

Q

If the target function is deterministic, overfitting cannot occur.

Question 4

Q

For a given linear regression problem with weight decay, there is a λ>0
regularization coefficient so that the optimal weight vectors of the regularized problem is the same as for the unregularized problem

Question 5

Q

If the VC dimension of a model is infinite, then VC theory still guarantees generalization, but the bound is looser.

Question 6

Q

If a data set of size k cannot be shattered by a hypothesis set H, then k is H’s break point.

Question 7

Q

The cross-entropy error for a logistic regression model has an upper bound when all the samples are classified correctly.

Question 8

Q

Consider the following hypotheses: h1(x)=w1x1 and h2(x)=1−w1x1, where x1 is the first component of the input vector x. Then for any dataset, the absolute difference between Eout and Ein can be proven to be the same for the two hypotheses.

Question 9

Q

Underfitting occurs when the in sample error gets larger than the out of sample error.

Question 10

Q

Hard Support Vector Machines work by trying to find a separating hyperplane that correctly classifies all elements in the training dataset while maximizing the distance of the nearest points to this separator.

Question 11

Q

Stochastic Gradient Descent is made stochastic with respect to normal Gradient Descent by adding a randomized extra noise term to the estimation of the gradient.

Question 12

Q

If the target function is not a linear function, then the training dataset cannot linearly separable.

Question 13

Q

Polynomial regression is not a linear problem, so Stochastic Gradient Descent cannot be used.

Question 14

Q

The Hoeffding Inequality implies that the increase of the generalization error is bounded from below by a logarithmic function of the growth of the size of the training set.

Question 15

Q

The test set can be used to estimate the out of sample error.

Question 16

Q

The following model is linear (with respect to the parameters w):

h(x)=w1sin(x1)+ew2x2

Question 17

Q

There exists a dataset of size dVC+1 for which Ein>0.

Question 18

Q

If the training set is linearly separable, then the pocket algorithm returns a worse model (in approximating the target function) than PLA.

Question 19

Q

The Ein obtained by normal linear regression is not larger than the Ein obtained by linear regression with weight decay.

Question 20

Q

The gradient of the error with respect to the weights can be estimated on one sample and this estimation is unbiased.

Question 21

Q

The following model is linear (with respect to the parameters):

h(x)=w1sin(x1)+w2ex2−w3x1x2

Question 22

Q

By using the cross-entropy error measure, logistic regression is guaranteed not to get stuck in a local minimum of the error function.

Question 23

Q

Ein=0 for all datasets of size at most dVC.

Question 24

Q

It is possible for a growth function to have more than one break point.

Question 25

Q

In Gradient Descent, too small step size can lead to slow learning.

Question 26

Q

The training set can be used to estimate the in sample error

Question 27

Q

The squared error measure can be decomposed into bias and variance term

Question 28

Q

In gradient desxent the step size should be as small as possible, to avoid instability

Question 29

Q

The PLA is guaranteed to terminate in a finite number of steps

Question 30

Q

In each step of Gradient descent, the gradient of the error is added to the weigths

Question 31

Q

The variance of model measures the expected squared difference between the output of the model and the target value.

Question 32

Q

When considering a missclassified sample, the PLA algorithm will moe the separator in the direction of classifying that sample correctly

Question 33

Q

The Hoeffding Inequality implies that the generalisation error decreases as the model complexity is increased

Question 34

Q

If we decrease the regularization coefficiont, the generalization error decreases when using weight decay

Question 35

Q

Nonlinear transformation of the inputs cannot increase the VC dimension.

Question 36

Q

There exists no closed analytical formula to determine the weigths of a linear regression problem that minimise the sample error

Question 37

Q

The gardient descent algo is used to find local minimum in a logistic regression problem.

Question 38

Q

The Hoeffding inequality implies that the generalisation error grows as the size of the dataset grows

Question 39

Q

The best stopping criterion for an iterative classification algorithm is to stop only when Ein stops decreasing.

Question 40

Q

lt is impossible for any learning algorithm to generalise to unseen examples, ie. the output for those will be random.

Question 41

Q

Stochastic Gradient Descent differs from normal Gradient Descent in that it adds an extra noise term after the estimation gradient.

Question 42

Q

Regularization is commonly used to speed up the optimization process.

Question 43

Q

Regularization can improve generalisation.

Question 44

Q

Using a model with the same complexity as the target function does not prevent overfitting.

Question 45

Q

When considering a misclassified sample, the PLA algorithm will move the separator in a way to ensure that after this update the sample will be classified correctly.

Question 46

Q

Increasing the amount of testing data reduces overfitting.

Question 47

Q

After hyperparameter optimization, the validation error is an unbiased estimate for Eout.

Question 48

Q

For hypothesis space H with VC dimension dVC:dVC>=N implies that there exists a dataset of size N that H sthatters

Question 49

Q

Increasing the amount of training data reduces overfitting.

Question 50

Q

A growth function with a breakpoint k can be bounded by a polynomial of degree k -1.

Question 51

Q

Overfitting happens when the in sample error is larger (“goes over”) than the out of sample error.

Question 52

Q

The validation set can be used to optimize hyperparameters.

Question 53

Q

Polynomial regression is not a linear problem, so Stochastic Gradient Descent cannot be used.

Question 54

Q

The perceptron model is a linear separator.

Question 55

Q

Since the (logistic) sigmoid function is not linear, the logistic regression model
is also nonlinear

Question 56

Q

Overfitting occurs when the out of sample error gets larger than the in sample
error.

Question 57

Q

The cross-entropy error for a logistic regression model does not have an upper
bound

Question 58

Q

The Rademacher bound can be used to bound the 0-1 loss of a classification
model.

Question 59

Q

The obtained by normal linear regression is not larger than the Ein
obtained by linear regression with weight decay.

Question 60

Q

In Gradient Descent, as a general rule of thumb, the step size should be
chosen to be small enough to keep the magnitude of the gradient under
to avoid instability.

Question 61

Q

When considering a misclassified sample, the
PLA algorithm will move the separator in the
direction of classifying that sample correctly,
however, the sample might still be classified
incorrectly after the weight update.

Question 62

Q

The vector (the normal of the separating hyperplane) in an SVM is in the span of the input samples x_i.

Question 63

Q

A not linearly separable dataset can become linearly separable if further features are added.

Question 64

Q

In each step of Gradient Descent, the gradient of the error is subtracted from the weights.

ML midterm Flashcards

questions