MCQ Flashcards

1
Q

Which statement is true about 1-vs-1 classification?

A

It needs as many models as there are class pairs, and each model predicts which class an observation is more likely to belong to

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which statement is true about 1-vs-all classification?

A

It needs as many models as there are classes, and each models predicts the probability of an observation to belong to a class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In a binary classification setting, accuracy is…

A

The ratio of number of correct predictions to the number of observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

An advantage of using the bootstrap method over other model selection methods is….

A

It uses fewer data points for training and so each model is more accurate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The classical estimate of the model error obtained using the bootstrap method is

A

It overestimates the real error, because each model is trained on average on 63.2% of the data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are typical assumptions of the error term ε, when the relation between inputs and
output is written as y = f(X) + ε (where y is the output, X is the input, and f is the
real relationship between input and output)?

A

It’s expected value is zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which one of the following about the F1 score is true?

A

It is close to zero for models bad at discriminating the positive class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Given data with 3 numeric features, 2 categorical features and 1 numeric label, how many subset of features should I consider, if I do feature selection via an exhaustive search of all possible combinations?

A

32

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the synonym of “feature”?

A

Independent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is true about forward stepwise feature selection?

A

It starts with considering no features, just predicting mean label

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The gradient descent method

A

Moves against the direction of gradient in every iteration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In the hold out validation method

A

We divide our data into large training set and small test set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

After selecting a winning model in the hold out validation method, it’s a good idea to

A

Retraining the training+test data before using it in production

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The estimate of the error of the model we get using the hold out validation method

A

Is an overestimation of the true model error because we train on fewer data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The hyperparameters of a model

A

Are chosen by the user usually by hyper parameter tuning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

A hyperplane in R
p
can be described with…

A

One linear equation

17
Q

What is true about irreducible error?

A

It depends on the data noise and cannot be reduced by selecting a better model

18
Q

The k-fold cross validation when k=n (n is the number of data points), becomes

A

The leave-one-out cross-validation method

19
Q

An advantage of the k-fold cross validation is that

A

it allows a trade off between accuracy and computational power required which we can tune by varying k

20
Q

Increasing k in k-fold cross-validation is going to

A

reduce the bias of the estimate of the model error, because training sets get larger

21
Q

Which of the following properties are commonly ascribed to Lasso Regularisation?
(a) It increases the variance of a model but, in exchange, drastically reduces the bias
(b) It penalises simple models, with few parameters
(c) It can help the interpretability of the model, because it tends to set many parameters
to zero
(d) It makes training a model easier by smoothing out irregularities in the objective
function

A

It can help improve the interpretability as it sets many parameters to zero

22
Q

Which of the following statement about Lasso regularisation is true?

(a) It penalises the number of non-zero parameters
(b) It penalises the 1-norm of the parameter vector (i.e., the sum of absolute values)
(c) It penalises the 2-norm of the parameter vector (i.e., the sum of squares)
(d) It penalises the infinite-norm of the parameter vector

A

It penalises the 1-norm of the parameter vector

23
Q

Which of the following about Lasso and Ridge Regularisation is true?
(a) Lasso always dominates Ridge regularisation, giving smaller test errors
(b) Ridge always dominates Lasso regularisation, giving smaller test errors
(c) In general neither method dominates the other, although there are rules of thumb
on when we might expect dominance
(d) Neither method dominates the other in terms of test error, but Lasso always gives
smaller training errors

A

In general neither method dominates the other although there are thumb rules

24
Q

As a rule Lasso tends to perform better than Ridge when

(a) A small number of predictors have large coefficients, and the remaining ones have coefficients that are very small
(b) The response is a function of many predictors, all with coefficients of roughly equal size
(c) A small number of predictors have negative coefficients, and the remaining ones have coefficients that are very small and positive
(d) A small number of predictors have positive coefficients, and the remaining ones have coefficients that are very small and negative

A

A small no. of predictors have large coefficients and the remaining ones have coefficients that are very small

25
Q

In gradient descent, the learning rate

(a) Indicates by how much we move against the gradient
(b) Helps forgetting about outliers and concentrating on typical samples
(c) Can be chosen appropriately, so to normalise the inputs
(d) Has no impact on the number of iterations required for convergence

A

Indicates by how much we move against the gradient

26
Q

Two classes are called linearly separable if

(a) There is a third class which lies between them
(b) They are separated by a marin wider than the average inter-class distance
(c) They both lie on the same line
(d) There is a hyperplane that separates them

A

There is a hyperplane that separates them

27
Q

At a local minimum of the average loss function

(a) The gradient with respect to the model’s parameters is zero
(b) The gradient with respect to the model parameters is strictly positive
(c) The gradient with respect to the model parameters is strictly negative
(d) The gradient with respect to the model’s parameters can have any sign

A

The gradient with respect to the model’s parameters is zero

28
Q

The estimate of the accuracy of a model obtained using Leave-One-Out Cross-Validation

(a) Higher bias than the estimate provided by hold-out cross-validation because we test on a single point
(b) Higher bias than the estimate provided by k-fold cross-validation because we test on a single point
(c) Higher bias than the estimate provided by hold-out cross validation because we train on a smaller set
(d) Lower bias than the estimate provided by hold-out cross validation because we train on a larger set

A

Lower bias than the estimate provided by hold-out cross validation because we train on a larger set

29
Q

A disadvantage of the Leave-One-Out Cross-Validation is that

(a) The estimate of the model’s error we obtain is a gross overestimation of the real error
(b) The estimate of the model’s error we obtain is a gross underestimation of the real error
(c) The estimate of the model’s error we obtain is affected by roughly twice as much variance compared to k-fold CV
(d) We have to train each model as many times as we have points in the dataset, and this can be computationally expensive

A

We have to train each model as many times as there are points in the dataset and this can be computationally expensive

30
Q

Which of the following is a desirable property of the loss function?

(a) It achieves its minimum when the prediction is correct
(b) It has many discontinuities, to facilitate finding a global optimum
(c) It’s neither concave nor convex
(d) It decreases, the further the prediction is compared to the ground truths

A

It achieves its minimum when the prediction is correct