MCQ Flashcards
Which statement is true about 1-vs-1 classification?
It needs as many models as there are class pairs, and each model predicts which class an observation is more likely to belong to
Which statement is true about 1-vs-all classification?
It needs as many models as there are classes, and each models predicts the probability of an observation to belong to a class
In a binary classification setting, accuracy is…
The ratio of number of correct predictions to the number of observations
An advantage of using the bootstrap method over other model selection methods is….
It uses fewer data points for training and so each model is more accurate
The classical estimate of the model error obtained using the bootstrap method is
It overestimates the real error, because each model is trained on average on 63.2% of the data set
What are typical assumptions of the error term ε, when the relation between inputs and
output is written as y = f(X) + ε (where y is the output, X is the input, and f is the
real relationship between input and output)?
It’s expected value is zero
Which one of the following about the F1 score is true?
It is close to zero for models bad at discriminating the positive class
Given data with 3 numeric features, 2 categorical features and 1 numeric label, how many subset of features should I consider, if I do feature selection via an exhaustive search of all possible combinations?
32
What is the synonym of “feature”?
Independent variable
What is true about forward stepwise feature selection?
It starts with considering no features, just predicting mean label
The gradient descent method
Moves against the direction of gradient in every iteration
In the hold out validation method
We divide our data into large training set and small test set
After selecting a winning model in the hold out validation method, it’s a good idea to
Retraining the training+test data before using it in production
The estimate of the error of the model we get using the hold out validation method
Is an overestimation of the true model error because we train on fewer data points
The hyperparameters of a model
Are chosen by the user usually by hyper parameter tuning
A hyperplane in R
p
can be described with…
One linear equation
What is true about irreducible error?
It depends on the data noise and cannot be reduced by selecting a better model
The k-fold cross validation when k=n (n is the number of data points), becomes
The leave-one-out cross-validation method
An advantage of the k-fold cross validation is that
it allows a trade off between accuracy and computational power required which we can tune by varying k
Increasing k in k-fold cross-validation is going to
reduce the bias of the estimate of the model error, because training sets get larger
Which of the following properties are commonly ascribed to Lasso Regularisation?
(a) It increases the variance of a model but, in exchange, drastically reduces the bias
(b) It penalises simple models, with few parameters
(c) It can help the interpretability of the model, because it tends to set many parameters
to zero
(d) It makes training a model easier by smoothing out irregularities in the objective
function
It can help improve the interpretability as it sets many parameters to zero
Which of the following statement about Lasso regularisation is true?
(a) It penalises the number of non-zero parameters
(b) It penalises the 1-norm of the parameter vector (i.e., the sum of absolute values)
(c) It penalises the 2-norm of the parameter vector (i.e., the sum of squares)
(d) It penalises the infinite-norm of the parameter vector
It penalises the 1-norm of the parameter vector
Which of the following about Lasso and Ridge Regularisation is true?
(a) Lasso always dominates Ridge regularisation, giving smaller test errors
(b) Ridge always dominates Lasso regularisation, giving smaller test errors
(c) In general neither method dominates the other, although there are rules of thumb
on when we might expect dominance
(d) Neither method dominates the other in terms of test error, but Lasso always gives
smaller training errors
In general neither method dominates the other although there are thumb rules
As a rule Lasso tends to perform better than Ridge when
(a) A small number of predictors have large coefficients, and the remaining ones have coefficients that are very small
(b) The response is a function of many predictors, all with coefficients of roughly equal size
(c) A small number of predictors have negative coefficients, and the remaining ones have coefficients that are very small and positive
(d) A small number of predictors have positive coefficients, and the remaining ones have coefficients that are very small and negative
A small no. of predictors have large coefficients and the remaining ones have coefficients that are very small
In gradient descent, the learning rate
(a) Indicates by how much we move against the gradient
(b) Helps forgetting about outliers and concentrating on typical samples
(c) Can be chosen appropriately, so to normalise the inputs
(d) Has no impact on the number of iterations required for convergence
Indicates by how much we move against the gradient
Two classes are called linearly separable if
(a) There is a third class which lies between them
(b) They are separated by a marin wider than the average inter-class distance
(c) They both lie on the same line
(d) There is a hyperplane that separates them
There is a hyperplane that separates them
At a local minimum of the average loss function
(a) The gradient with respect to the model’s parameters is zero
(b) The gradient with respect to the model parameters is strictly positive
(c) The gradient with respect to the model parameters is strictly negative
(d) The gradient with respect to the model’s parameters can have any sign
The gradient with respect to the model’s parameters is zero
The estimate of the accuracy of a model obtained using Leave-One-Out Cross-Validation
(a) Higher bias than the estimate provided by hold-out cross-validation because we test on a single point
(b) Higher bias than the estimate provided by k-fold cross-validation because we test on a single point
(c) Higher bias than the estimate provided by hold-out cross validation because we train on a smaller set
(d) Lower bias than the estimate provided by hold-out cross validation because we train on a larger set
Lower bias than the estimate provided by hold-out cross validation because we train on a larger set
A disadvantage of the Leave-One-Out Cross-Validation is that
(a) The estimate of the model’s error we obtain is a gross overestimation of the real error
(b) The estimate of the model’s error we obtain is a gross underestimation of the real error
(c) The estimate of the model’s error we obtain is affected by roughly twice as much variance compared to k-fold CV
(d) We have to train each model as many times as we have points in the dataset, and this can be computationally expensive
We have to train each model as many times as there are points in the dataset and this can be computationally expensive
Which of the following is a desirable property of the loss function?
(a) It achieves its minimum when the prediction is correct
(b) It has many discontinuities, to facilitate finding a global optimum
(c) It’s neither concave nor convex
(d) It decreases, the further the prediction is compared to the ground truths
It achieves its minimum when the prediction is correct