MCQ Flashcards by Aheli Das

Which statement is true about 1-vs-1 classification?

It needs as many models as there are class pairs, and each model predicts which class an observation is more likely to belong to

How well did you know this?

Not at all

Perfectly

Which statement is true about 1-vs-all classification?

It needs as many models as there are classes, and each models predicts the probability of an observation to belong to a class

How well did you know this?

Not at all

Perfectly

In a binary classification setting, accuracy is…

The ratio of number of correct predictions to the number of observations

How well did you know this?

Not at all

Perfectly

An advantage of using the bootstrap method over other model selection methods is….

It uses fewer data points for training and so each model is more accurate

How well did you know this?

Not at all

Perfectly

The classical estimate of the model error obtained using the bootstrap method is

It overestimates the real error, because each model is trained on average on 63.2% of the data set

How well did you know this?

Not at all

Perfectly

What are typical assumptions of the error term ε, when the relation between inputs and
output is written as y = f(X) + ε (where y is the output, X is the input, and f is the
real relationship between input and output)?

It’s expected value is zero

How well did you know this?

Not at all

Perfectly

Which one of the following about the F1 score is true?

It is close to zero for models bad at discriminating the positive class

How well did you know this?

Not at all

Perfectly

Given data with 3 numeric features, 2 categorical features and 1 numeric label, how many subset of features should I consider, if I do feature selection via an exhaustive search of all possible combinations?

How well did you know this?

Not at all

Perfectly

What is the synonym of “feature”?

Independent variable

How well did you know this?

Not at all

Perfectly

What is true about forward stepwise feature selection?

It starts with considering no features, just predicting mean label

How well did you know this?

Not at all

Perfectly

The gradient descent method

Moves against the direction of gradient in every iteration

How well did you know this?

Not at all

Perfectly

In the hold out validation method

We divide our data into large training set and small test set

How well did you know this?

Not at all

Perfectly

After selecting a winning model in the hold out validation method, it’s a good idea to

Retraining the training+test data before using it in production

How well did you know this?

Not at all

Perfectly

The estimate of the error of the model we get using the hold out validation method

Is an overestimation of the true model error because we train on fewer data points

How well did you know this?

Not at all

Perfectly

The hyperparameters of a model

Are chosen by the user usually by hyper parameter tuning

How well did you know this?

Not at all

Perfectly

A hyperplane in R
p
can be described with…

Study These Flashcards

One linear equation

What is true about irreducible error?

Study These Flashcards

It depends on the data noise and cannot be reduced by selecting a better model

The k-fold cross validation when k=n (n is the number of data points), becomes

Study These Flashcards

The leave-one-out cross-validation method

An advantage of the k-fold cross validation is that

Study These Flashcards

it allows a trade off between accuracy and computational power required which we can tune by varying k

Increasing k in k-fold cross-validation is going to

Study These Flashcards

reduce the bias of the estimate of the model error, because training sets get larger

Which of the following properties are commonly ascribed to Lasso Regularisation?
(a) It increases the variance of a model but, in exchange, drastically reduces the bias
(b) It penalises simple models, with few parameters
(c) It can help the interpretability of the model, because it tends to set many parameters
to zero
(d) It makes training a model easier by smoothing out irregularities in the objective
function

Study These Flashcards

It can help improve the interpretability as it sets many parameters to zero

Which of the following statement about Lasso regularisation is true?

(a) It penalises the number of non-zero parameters
(b) It penalises the 1-norm of the parameter vector (i.e., the sum of absolute values)
(c) It penalises the 2-norm of the parameter vector (i.e., the sum of squares)
(d) It penalises the infinite-norm of the parameter vector

Study These Flashcards

It penalises the 1-norm of the parameter vector

Which of the following about Lasso and Ridge Regularisation is true?
(a) Lasso always dominates Ridge regularisation, giving smaller test errors
(b) Ridge always dominates Lasso regularisation, giving smaller test errors
(c) In general neither method dominates the other, although there are rules of thumb
on when we might expect dominance
(d) Neither method dominates the other in terms of test error, but Lasso always gives
smaller training errors

Study These Flashcards

In general neither method dominates the other although there are thumb rules

As a rule Lasso tends to perform better than Ridge when

(a) A small number of predictors have large coefficients, and the remaining ones have coefficients that are very small
(b) The response is a function of many predictors, all with coefficients of roughly equal size
(c) A small number of predictors have negative coefficients, and the remaining ones have coefficients that are very small and positive
(d) A small number of predictors have positive coefficients, and the remaining ones have coefficients that are very small and negative

Study These Flashcards

A small no. of predictors have large coefficients and the remaining ones have coefficients that are very small

In gradient descent, the learning rate (a) Indicates by how much we move against the gradient (b) Helps forgetting about outliers and concentrating on typical samples (c) Can be chosen appropriately, so to normalise the inputs (d) Has no impact on the number of iterations required for convergence

Indicates by how much we move against the gradient

Two classes are called linearly separable if (a) There is a third class which lies between them (b) They are separated by a marin wider than the average inter-class distance (c) They both lie on the same line (d) There is a hyperplane that separates them

There is a hyperplane that separates them

At a local minimum of the average loss function (a) The gradient with respect to the model’s parameters is zero (b) The gradient with respect to the model parameters is strictly positive (c) The gradient with respect to the model parameters is strictly negative (d) The gradient with respect to the model’s parameters can have any sign

The gradient with respect to the model's parameters is zero

The estimate of the accuracy of a model obtained using Leave-One-Out Cross-Validation (a) Higher bias than the estimate provided by hold-out cross-validation because we test on a single point (b) Higher bias than the estimate provided by k-fold cross-validation because we test on a single point (c) Higher bias than the estimate provided by hold-out cross validation because we train on a smaller set (d) Lower bias than the estimate provided by hold-out cross validation because we train on a larger set

Lower bias than the estimate provided by hold-out cross validation because we train on a larger set

A disadvantage of the Leave-One-Out Cross-Validation is that (a) The estimate of the model’s error we obtain is a gross overestimation of the real error (b) The estimate of the model’s error we obtain is a gross underestimation of the real error (c) The estimate of the model’s error we obtain is affected by roughly twice as much variance compared to k-fold CV (d) We have to train each model as many times as we have points in the dataset, and this can be computationally expensive

We have to train each model as many times as there are points in the dataset and this can be computationally expensive

Which of the following is a desirable property of the loss function? (a) It achieves its minimum when the prediction is correct (b) It has many discontinuities, to facilitate finding a global optimum (c) It’s neither concave nor convex (d) It decreases, the further the prediction is compared to the ground truths

It achieves its minimum when the prediction is correct

MCQ Flashcards

(30 cards)