Exam questions Flashcards
- What is the difference between classification and regression?
A Classification predicts an item from a finite set, regression predicts a
numeric value.
B Regression predicts an item from a finite set, classification predicts a
numeric value.
C Classification is unsupervised, regression is supervised.
D Regression is unsupervised, classification is supervised.
A Classification predicts an item from a finite set, regression predicts a
numeric value.
What is a valid reason to prefer gradient descent over random search?
A My model is easily differentiable.
B I need to be sure that I find the global minimum.
C My loss function is not smooth.
D There is some computation between the output of my model and my
loss function, which I do not control.
A My model is easily differentiable.
We are training a classification model by gradient descent, and we want to figure out which learning rate to use, before comparing the model to other classifiers. We try five learning rate values, resulting in five different models. How do we choose among these five models?
A We measure the accuracy of each model on the training set.
B We measure the accuracy of each model on the validation set.
C We measure the accuracy of each model on the test set.
D We measure the accuracy of each model on the full dataset
B We measure the accuracy of each model on the validation set.
Undersampling and oversampling are ways to deal with imbalanced classes. Which is true?
A You oversample your majority class.
B You undersample your minority class.
C Undersampling leads to duplicate instances in your data.
D Oversampling leads to duplicate instances in your data.
D Oversampling leads to duplicate instances in your data.
The slides mention two ways to adapt a categoric feature for a classifier that only accepts numeric features: integer coding and one-hot coding. Which is true?
A Integer coding always turns one categoric feature into multiple numeric
features.
B One-hot coding always turns one categoric feature into multiple numeric features.
C Integer coding becomes inefficient if there are too many categories.
D One-hot coding becomes inefficient if there are too few categories.
B One-hot coding always turns one categoric feature into multiple numeric features.
If somebody says: “There is a high probability that the mean height of Italian women is below 2 meters.”Which is true?
A A strict subjectivist would consider this an improper use of the term probability.
B A strict Bayesian would consider this an improper use of the term probability.
C A strict frequentist would consider this an improper use of the term
probability.
D In machine learning, we would consider this an improper use of the term probability.
C A strict frequentist would consider this an improper use of the term
probability.
How does dropout help with the overfitting problem?
A By propagating the gradient of the loss back down the network.
B By randomly disabling nodes in a neural network, to eliminate solutions that require highly specific configurations.
C By ensuring that the output distribution of a neural network is normally distributed if the input distribution is.
D By converting the scalar backpropagation algorithm to work with tensors.
B By randomly disabling nodes in a neural network, to eliminate solutions that require highly specific configurations.
The soft margin SVM loss is defined as a constrained optimization objective. We can rewrite this in two ways. Which is true?
A We can rewrite to an unconstrained problem. This allows us to use the kernel trick.
B We can rewrite to an unconstrained problem. This expresses the solution purely in terms of the dot product between pairs of instances.
C We can rewrite using KKT multipliers. This allows us to use the kernel trick.
D We can rewrite using KKT multipliers. This allows some instances to fall inside the margin.
C We can rewrite using KKT multipliers. This allows us to use the kernel trick.
Neural networks usually contain activation functions. What is their purpose?
A They are used to compute a local approximation of the gradient.
B They are applied after a linear transformation, so that the network can learn nonlinear functions.
C They control the magnitude of the the step taken during an iteration of gradient descent.
D They function as a regularizer, to combat overfitting.
B They are applied after a linear transformation, so that the network can learn nonlinear functions.
By what method do variational autoencoders avoid mode collapse?
A By training the “decoder” network through a discriminator.
B By using a regularizer to steer the network toward the data average.
C By feeding the discriminator network pairs of inputs.
D By learning the latent representation of an instance through an “encoder” network.
D By learning the latent representation of an instance through an “encoder” network.
I am training a generator network to generate faces. I take a random sample, compare it to a randomly chosen image from the data, and backpropagate the error. When training is finished, all samples from
the network look like the average over all faces in the dataset. What name do we have for this phenomenon?
A Multiple testing
B Overfitting
C Dropout
D Mode collapse
D Mode collapse
In some machine learning settings it is said that we must make a tradeoff between exploration and exploitation. What do we mean by this?
A That hyperparameter selection (exploration) uses computational resources that can also used in training the model (exploitation).
B That an online algorithm needs to balance optimization of its expectedreward with exploring to learn more about its environment.
C That an insufficiently thoroughly trained model may be biased against minorities.
D This refers to the problem of balancing the loss function with the regularization terms in matrix factorization.
B That an online algorithm needs to balance optimization of its expected
reward with exploring to learn more about its environment.
Which is false?
A To use decision trees on data with categorical features, we must convert those features to one-hot vectors.
B To use decision trees on data with numeric features, we must choose a threshold value to split on, for every split.
C When training a decision tree on only categorical features, there’s no
use in splitting again on a feature you’ve already split on.
D When training a decision tree on numeric features, it can often be useful to split on a feature you’ve already used before.
A To use decision trees on data with categorical features, we must convert those features to one-hot vectors.
Some models are built on the Markov assumption. What do we mean by this?
A We can apply backpropagation to neural networks by unrolling them.
B The probability of a word does not depend on the current class for which we are predicting the probability.
C The operation of an LSTM cell depends only on its predecessors through two inputs.
D A word is conditionally dependent only on a finite number of words preceding it.
D A word is conditionally dependent only on a finite number of words
preceding it.
Which is (primarily) a supervised machine learning method?
A Principal Component Analysis
B Support Vector Machines
C Variational Autoencoders
D None of the above
B Support Vector Machines
We are fitting a regression model using the least squares loss. We have seen two different forms of the loss function:
Sum_i(yi − ti)^2 and
0.5Sum_i(yi − ti)^2
(where yi is the model output and ti is the true value
given by the data). Which is true?
A The global minima of these two loss functions occur at different points
in the model space .
B If we work out the solution analytically, when we set the gradient equal to zero, the constant factor 1 2
in the second loss function changes the parameters of the optimal solution.
C If we use these loss functions with gradient descent, it makes no difference which we use, the behavior is exactly the same.
D If we use these loss functions with gradient descent, there is a small difference in which we use, but if we scale the learning rate appropriately, the
difference will disappear.
D If we use these loss functions with gradient descent, there is a small difference in which we use, but if we scale the learning rate appropriately, the
difference will disappear.
We are choosing a new basis for our data. We decide to use an orthonormal basis. What is the advantage of having an orthonormal basis?
A It ensures that the basis vectors are equal to the principal components.
B It ensures that the inverse of the basis matrix is equal to its transpose.
C It ensures that the basis vectors are orthogonal to the principal components.
D It ensures that the data is automatically whitened in the new basis.
B It ensures that the inverse of the basis matrix is equal to its transpose.
We are considering using either gradient descent or random search for a problem. Which is true?
A For both, which optimum they find depends on the initial starting point.
B Gradient descent can get stuck in a local optimum, random search cannot.
C Gradient descent cannot get stuck in a local optimum, random search can.
D Gradient descent is more efficient than random search and can always be applied, so we always prefer gradient descent over random search.
A For both, which optimum they find depends on the initial starting point.
Which property is common to both logistic regression and support vector machines?
A For both, the decision boundary is chosen by minimizing the number of misclassified examples.
B Both are usually optimized by alternating optimization.
C Both require backpropagation to work out the gradient efficiently.
D They both focus mostly or only on the points closest to the decision boundary.
D They both focus mostly or only on the points closest to the decision
boundary.
Imagine we have a naive Bayes classifier. In our dataset we have two binary features (categorical with two possible values) and two classes.
How many pseudo-observations do we need to add if we want to apply Laplace smoothing?
A 1
B 2
C 4
D 8
C 4
One can choose between the likelihood function or the log likelihood function as a loss function. Which is usually preferred, and why?
A Both result in a maximum at the same point in model space, but the loglikelihood is often easier to work with.
B The likelihood function. When this is maximised we have the best fitting model whereas the log likelihood results in a worse model.
C The log likelihood function. The squared errors are minimized only when the log-likelihood is maximized.
D The likelihood function. The squared errors are minimized only when the likelihood is maximized.
A Both result in a maximum at the same point in model space, but the loglikelihood is often easier to work with.
We have a logistic regression model for a binary classification problem, which predicts class probabilities q. We compare these to the true class probabilities p, which are always 1 for the correct class and 0 for the incorrect class. The slides mention two loss functions for this purpose: logarithmic loss and binary cross-entropy. Which is true?
A Log-loss does not lead to a smooth loss landscape, so we approximate it by cross-entropy so that we can search for a good model using gradient descent.
B Cross-entropy loss does not lead to a smooth loss landscape, so we approximate it by log-loss so that we can search for a good model using gradient descent.
C Log-loss is equal to the binary cross-entropy H(p, q).
D Log-loss is equal to the binary cross-entropy H(q, p).
C Log-loss is equal to the binary cross-entropy H(p, q).
We want to represent color videos in a deep learning system. Each is a series of frames, with each frame an RGB image. Which is the most natural representation for one such video?
A As a 1-tensor.
B As a 2-tensor.
C As a 3-tensor.
D As a 4-tensor.
D As a 4-tensor.
In support vector machines, how is the maximum margin hyperplane criterion (MMC) related to the support vectors?
A The support vectors can be removed from the data once the maximum margin hyperplane has been found.
B The support vectors determine the hyperplane that satisfies the MMC.
C The MMC and the support vectors describe different loss functions that we can use to fit a hyperplane.
D The support vectors provide an approximation to the hyperplane that satisfies the MMC.
B The support vectors determine the hyperplane that satisfies the MMC.
Which of the following is not a method to prevent overfitting?
A Boosting
B Bagging
C Dropout
D L1 regularization
A Boosting
What problem, if it exists for a single model, cannot be solved by training an ensemble of such models?
A High bias.
B High variance.
C High overfitting.
D High training time.
D High training time.
- I have a dataset of politicians in the European parliament and which past laws they voted for and against. The record is incomplete, but I have some votes for every law and for every politician. I would like to predict, for future laws, which politicians will vote for
and which will vote against. I plan to model this as a recommender system using matrix factorization.
Which is true?
A This is not a good model, because there are too many classes and not
enough instances.
B This is not a good model, because there are not enough classes, and too many instances.
C I would have to deal with the cold start problem, because for the future laws I don’t have any voting information.
D I would have to deal with the cold start problem, because the voting record for past laws is incomplete.
C I would have to deal with the cold start problem, because for the future laws I don’t have any voting information.
Which answer contains only unsupervised methods and tasks?
A k-Means, Clustering, Density estimation
B Clustering, Linear regression, Generative modelling
C Classification, Clustering, k-Means
D k-NN, Density estimation, Clustering
A k-Means, Clustering, Density estimation