ML past exam Questions Flashcards by chris Unknown

How are hyperparameters defined?

A They are parameters of a learning algorithm that are set before we start
training on the data.
B They are parameters of which the values are found by the training algo- rithm itself.
C They are parameters that make the models train faster.
D They are parameters of which the values are set after training the model.

A They are parameters of a learning algorithm that are set before we start
training on the data.

How well did you know this?

Not at all

Perfectly

In principal component analysis, which is true?

A The first principal component is the direction in which the variance is
the smallest.
B The first principal component is the direction that maximizes the recon- struction loss.
C The first principal component is the direction that minimizes the recon- struction loss.
D The first principal component is the direction in which the bias is the smallest.

C The first principal component is the direction that minimizes the recon- struction loss.

How well did you know this?

Not at all

Perfectly

What is a valid reason to prefer gradient descent over random search?

A My model is easily differentiable.
B I need to be sure that I find the global minimum.
C My loss function is not smooth.
D There is some computation between the output of my model and my loss function, which I do not control.

A My model is easily differentiable.

How well did you know this?

Not at all

Perfectly

If we have a feature that is categorical, but our model requires numeric features, we can turn the categoric feature into one or more numeric features either by integer coding, or by one-hot coding. Which is true?

A The benefit of integer coding is that it assigns each value a separate fea- ture.
B Integer coding results in more features than one-hot coding.
C The feature(s) introduced by integer coding are are valued between 0 and 1.
D The feature(s) introduced by one-hot coding are valued 0 or 1.

D The feature(s) introduced by one-hot coding are valued 0 or 1.

How well did you know this?

Not at all

Perfectly

Two classes that are not linearly separable may become linearly sepa- rable if we add a new feature that is derived from one or more of the other features. Why?

A This approach removes outliers.
B A linear decision boundary in the new space may represent a nonlinear boundary in the old one.
C Adding a new feature derived from the existing ones normalizes the data.
D This approach makes a non-differentiable loss function differentiable.

B A linear decision boundary in the new space may represent a nonlinear boundary in the old one.

How well did you know this?

Not at all

Perfectly

In the Expectation-Maximization algorithm for Gaussian mixture models, we train a model consisting of several components, and we compute various responsibilities. Which is true?

A Each component is a normal distribution. Its responsibility for a point is the normalized probability density it assigns the point.
B Each component is a normal distribution. Its responsibility is the integer number of points it generates.
C Each component is one of the features of the data. Its responsibility for a point is the normalized probability density it assigns the point.
D Each component is one of the features of the data. Its responsibility is the integer number of points it generates.

B Each component is a normal distribution. Its responsibility is the integer number of points it generates.

How well did you know this?

Not at all

Perfectly

Lise is given an imbalanced dataset for a binary classification task. Which one of the following options is useful solution to this problem?

A She can oversample her majority class by sampling with replacement.
B She can use accuracy to measure the performance of her classifier.
C She can use SMOTE to augment her training data.
D She can augment the feature set by adding the multiplication of some features.

C She can use SMOTE to augment her training data.

How well did you know this?

Not at all

Perfectly

What is true about logistic regression?

A Logistic regression is a linear, discriminative classifier with a cross- entropy loss function.
B Logistic regression finds the maximum margin hyperplane when applied to data with well-separated classes.
C Logistic regression is identical to the least-square linear regression.
D Thanks to the log-loss function, logistic regression can learn nonlinear decision boundaries.

A Logistic regression is a linear, discriminative classifier with a cross- entropy loss function.

How well did you know this?

Not at all

Perfectly

Neural networks usually contain activation functions. What is their purpose?

A They are used to compute a local approximation of the gradient.
B They are applied after a linear transformation, so that the network can
learn nonlinear functions.
C They control the magnitude of the the step taken during an iteration of gradient descent.
D They function as a regularizer, to combat overfitting.

B They are applied after a linear transformation, so that the network can
learn nonlinear functions.

How well did you know this?

Not at all

Perfectly

I am training a generator network to generate faces. I take a random sample, compare it to a randomly chosen image form the data, and backpropagate the error.
When training is finished, all samples from the network look like the average over all faces in the dataset.
What name do we have for this phenomenon?

A Multiple testing
B Overfitting
C Dropout
D Mode collapse

D Mode collapse

How well did you know this?

Not at all

Perfectly

Which is true?

A A maximum likelihood objective for least-squares regression does not provide a smooth loss surface.
B The least-squares loss for linear regression can be derived from a maxi- mum likelihood objective.
C Linear regression can be performed with a maximum likelihood objec- tive but the results will be different from the least-squares loss.
D The loss function used in logistic regression is derived from assuming a normal distribution on the residuals.

B The least-squares loss for linear regression can be derived from a maxi- mum likelihood objective.

How well did you know this?

Not at all

Perfectly

What is an important difference between regular recurrent neural net- works (RNNs) and LSTM neural networks?

A All LSTMs have a forget gate, allowing them to ignore parts of the cell state right away.
B All RNNs have a forget gate, allowing them to ignore parts of of the cell state right away.
C LSTMs have a vanishing gradient problem, RNNs don’t.
D RNNs can be turned into variational autoencoders, LSTMs can’t.

A All LSTMs have a forget gate, allowing them to ignore parts of the cell state right away.

How well did you know this?

Not at all

Perfectly

Which of the following is an ensemble method? A Random forest
B Gradient boosting
C AdaBoost
D All of the above

D All of the above

How well did you know this?

Not at all

Perfectly

William has a dataset of many recipes and many ingredients. He doesn’t know anything about the recipes except which ingredients oc- cur in each, and he doesn’t know anything about the ingredients ex- cept in which recipes they occur.
He’d like to predict for a pair of an ingredient and a recipe (both al- ready in the data) whether the recipe would likely be improved by adding that ingredient.
Which is true?

A He could model the recipes as instances with their ingredients as a sin- gle categorical feature, and solve the problem with a decision tree.
B He could model the ingredients as instances and their recipes as a sin- gle categorical feature, and solve the problem with a decision tree.
C He could model this as a matrix decomposition problem.
D This problem requires a sequence-to-sequence model.

C He could model this as a matrix decomposition problem.

How well did you know this?

Not at all

Perfectly

What separates offline learning from reinforcement learning?

A In reinforcement learning the training labels are reinforced through
boosting.
B Offline learning can be done without connection to the internet. Rein- forcement learning requires reinforcement from a separate server.
C In reinforcement learning, the learner takes actions and receives feed- back from the environment. In offline learning we learn from a fixed dataset.
D Reinforcement learning uses backpropagation to approximate the gradi- ent, whereas offline learning uses symbolic computation.

C In reinforcement learning, the learner takes actions and receives feed- back from the environment. In offline learning we learn from a fixed dataset.

How well did you know this?

Not at all

Perfectly

The most important rule in machine learning is “never judge your performance on the training data.” If we break this rule, what can happen as a consequence?

A The loss surface no longer provides an informative gradient.
B We get cost imbalance.
C We end up choosing a model that overfits the training data.
D We commit multiple testing.

C We end up choosing a model that overfits the training data.

How well did you know this?

Not at all

Perfectly

We have a classifier c and a test set. Which is true?

A To compute the precision for c on the test set, we must define how to turn it into a ranking classifier.
B To compute the false positive rate for c on the test set, we must define how to turn it into a ranking classifier.
C To compute the confusion matrix for c on the test set, we must define how to turn it into a ranking classifier.
D To compute the ROC area under the curve for c on the test set, we must define how to turn it into a ranking classifier.

D To compute the ROC area under the curve for c on the test set, we must define how to turn it into a ranking classifier.

How well did you know this?

Not at all

Perfectly

Testing too many times on the test set increases the chance of ran- dom effects influencing your choice of model. Nevertheless, we may need to test many different models and many different hyperparame- ters. What is the solution suggested in the lectures?

A To withhold the test set, and use a train/validation split on the remain- der to evaluate model choices and hyperparameters.
B To normalize the the data so that they appear normally distributed. Nor- malization will not help with this problem.
C To use bootstrap sampling to gauge the variance of the model. Boot- strap sampling (lecture 3 and lecture 10) will help you gauge the variance. But that will not solve this problem.
D To use a boosted ensemble, to reduce the variance of the model, and with it, the probability of random effects.

A To withhold the test set, and use a train/validation split on the remain- der to evaluate model choices and hyperparameters.

How well did you know this?

Not at all

Perfectly

Which answer contains only unsupervised methods and tasks?

A Clustering, Linear regression, Generative modeling
B k-Means, Clustering, Density estimation
C Classification, Clustering, Expectation-Maximization
D Logistic regression, Density estimation, Clustering

B k-Means, Clustering, Density estimation

How well did you know this?

Not at all

Perfectly

The two most important conceptual spaces in machine learning are the model space and the feature space. Which is true?

A Every point in the model space represents a loss function that we can choose for our task.
B Every point in the model space represents a single instance in the dataset.
C Every point in the feature space represents a single feature of a single instance in the dataset.
D Every point in the feature space represents a single instance in the dataset.

D Every point in the feature space represents a single instance in the dataset.

How well did you know this?

Not at all

Perfectly

The ALVINN system from 1995 was a self-driving car system imple- mented as a classifier: a grayscale camera was pointed at the road and a classifier was trained to predict the correct position of the steer- ing wheel based on the behavior of a human driver. In this example, which are the instances and which are the features?

A The instances are the different cars the system is deployed in and the features are the angles of the steering wheel.
B The instances are the angles of the steering wheel and the features are the different cars the system is deployed in.
C The instances are the frames produced by the camera and the features are the pixel values.
D The instances are the pixel values and the features are the frames produced by the camera.

C The instances are the frames produced by the camera and the features are the pixel values.

How well did you know this?

Not at all

Perfectly

What is the relation between the loss landscape and its gradient?

A The gradient points in the direction that the loss increases the fastest.
B The gradient points in the direction that the loss decreases the fastest.
C The gradient is the region of the loss landscape where the loss is the lowest.
D The gradient is the region of the loss landscape where the loss is the highest.

A The gradient points in the direction that the loss increases the fastest.

How well did you know this?

Not at all

Perfectly

Which is a legitimate reason to prefer random search over gradient descent as a search method?

A The loss surface is complicated, so I want the size of the steps to change as I approach a minimum.
B I need to be sure that I’ve found a global minimum.
C My model has multiple layers, so I want to use backpropagation.
D My model is not differentiable.

D My model is not differentiable.

How well did you know this?

Not at all

Perfectly

What is the benefit of a convex loss surface?

A It allows us to use the backpropagation algorithm.
B It allows us to use evolutionary methods.
C It ensures that there are no local minima other than the global mini- mum.
D It allows gradient descent to escape local maxima.

C It ensures that there are no local minima other than the global mini- mum.

How well did you know this?

Not at all

Perfectly

Rob trains a k-nearest neighbors classifier. He withholds 20% of his data as a test set and uses the rest as his training data. He runs the training algorithm twenty times, for k = 1 to k = 20. For each, he computes the accuracy on the test set. Het gets the best accuracy for k = 17, so he reports this accuracy as an estimate of the performance of k-nearest neighbors on his data. What fundamental mistake has Rob made? A Rob is checking a linear range of values for hyperparameter k when a logarithmic range would be better. B He is using arbitrary values for k. He should use a grid search. C The test set should always be bigger than the training set. D By reusing his test set, he may be inflating his performance estimate and overfitting to an arbitrary value of k.

D By reusing his test set, he may be inflating his performance estimate and overfitting to an arbitrary value of k.

Accuracy is a very simple and effective performance metric, but in certain cases, we should be careful. Imagine a spam classifier that au- tomatically deletes emails detected as spam. The user receives about one spam email for every legitimate email. Why should we be careful optimizing for accuracy? A Because we have very high class imbalance. B Because we have very high cost imbalance. C Because the data arrives irregularly. D Because this is an online learning problem.

B Because we have very high cost imbalance.

Maria is fitting a regression model to predict the year in which a particular piece of instrumental music was written. The prediction is based on various features like average and variance of loudness, rhythm, key etc. She realizes that she has many outliers: for instance, the atonal music of the 1920s produces extreme variations in loudness, and John Cage’s piece 4’33” from 1952 is entirely silent. What should she do? A She should remove these instances entirely. Removing outliers will make it easier to fit the data with a normal distribution. B She should remove these instances from the training data, but leave them in the test data. C She should leave these instances in the training data, but remove them from the test data. D She should leave these instances in. They are important examples of the data distribution.

D She should leave these instances in. They are important examples of the data distribution.

When we want to model the throwing of a single die, using probability theory, we define a sample space and an event space. Which is true? A “Rolling an even number” is an element of the event space.“Rolling a 1” is an element of the sample space. B “Rolling an even number” is an element of the sample space.“Rolling a 1” is an element of the event space. C For this example, the event space is continuous, the sample space is dis- crete. D For this example, the sample space is continuous, the event space is dis- crete.

A “Rolling an even number” is an element of the event space.“Rolling a 1” is an element of the sample space.

Let f(x) = σ(wx + b) be a logistic regression model. We interpret f(x) as the probability that x has the positive class. If x actually has the negative class, what is the cross-entropy loss for this single example? A −logf(x) B −log(1−f(x)) C −logf(x)−log(1−f(x)) D logf(x)−log(1−f(x))

B −log(1−f(x))

Frank is a researcher in the 1960s. He’s just read about a new model called the perceptron, which is a highly simplified simulation of a sin- gle brain cell. Frank decides that if a brain is powerful because it chains together multiple brain cells, he should try to chain together multiple perceptrons, to make a network that is more powerful than a single perceptron. Why won’t chaining perceptrons together work in this way? A A GPU is needed to compute the output of such a function. B The perceptron is a linear function, and the composition of linear func- tions is still a linear function. C Such a model would suffer from vanishing gradients. D This is equivalent to hypothesis boosting, which has been proven to be impossible.

B The perceptron is a linear function, and the composition of linear func- tions is still a linear function.

In a support vector machine, what are the support vectors? A The parameters wT that are multiplied by the input x to produce a clas- sification. B The parameters b that are added to the input x to produce a classifica- tion. C The positive and negative data points that are allowed to fall inside the margin. D The positive and negative data points that are closest to the decision boundary.

D The positive and negative data points that are closest to the decision boundary.

In deep learning, what is the difference between lazy and eager exe- cution (or evaluation)? A In lazy execution, the computation graph is compiled and kept static during training. In eager execution, it is built up for each forward pass. B In eager execution, the computation graph is compiled and kept static during training. In lazy execution, it is built up for each forward pass. C In lazy execution, the gradient is computed by numeric approximation, while eager execution uses the backpropagation algorithm. D In eager execution, the gradient is computed by numeric approxima- tion, while lazy execution uses the backpropagation algorithm.

A In lazy execution, the computation graph is compiled and kept static during training. In eager execution, it is built up for each forward pass.

Why is the ReLU activation function often preferred over the sigmoid activation function, for hidden nodes? A It causes more vanishing gradients, which help learning. B The sigmoid function cannot be used with gradient descent. C Its derivative is almost always either 0 or 1, reducing vanishing gradients. D The sigmoid function contains a point where the gradient is not defined.

C Its derivative is almost always either 0 or 1, reducing vanishing gradients.

When we apply the chain rule to a complex operation involving ten- sors, in order to use the backpropagation algorithm, the local deriva- tives might be something like the derivative of a vector with respect to a matrix. The result is a 3-tensor which is complex to work out, and expensive to store in memory. How do modern machine learning frameworks avoid this problem in their implementation of backpropa- gation? A They approximate the local derivative using random search. B They approximate the local derivative using the EM algorithm. C They don’t compute the local derivative, but the product of the up- stream derivative (the loss over the module outputs) with the local derivative D They don’t compute the local derivative, but the product of the down- stream derivative (the module inputs over the network inputs) with the local derivative.

C They don’t compute the local derivative, but the product of the up- stream derivative (the loss over the module outputs) with the local derivative

How is the log-likelihood like a loss function? A The log-likelihood is an approximation of the loss function. B The loss function is an approximation of the log-likelihood. C When we train a model, we minimize the loss, and when we fit a distri- bution, we often minimize the log-likelihood. D When we train a model, we minimize the loss, and when we fit a distribution, we often maximize the log-likelihood.

D When we train a model, we minimize the loss, and when we fit a distribution, we often maximize the log-likelihood.

If we train a generator network by comparing a random output to a random target example from the data and backpropagating the differ- ence, we get mode collapse. The problem is that we don’t know which random input corresponds to which item in the dataset. How do GANs solve this problem? A By training a second network to map the target example to a distribu- tion on the input space. B By adding a KL-loss term on the random inputs of the generator net- work. C By training the generator to generate outputs that a second network recognizes as real, and training the second network to distinguish generated outputs from real data. D By adding a cycle-consistency loss-term.

C By training the generator to generate outputs that a second network recognizes as real, and training the second network to distinguish generated outputs from real data.

Sometimes we want to learn a model that maps some input to some output, but in some aspects we also want the model to behave like a generator. For instance, if we train a model to colorize photographs, we don’t want it to be purely deterministic: to colorize the label of a beer bottle, it should randomly imagine some colors even if it can’t in- fer the correct colors from the input. Which GAN approach is designed to accomplish this? A Vanilla GAN B Conditional GAN C CycleGAN D StyleGAN

B Conditional GAN

The Variational Autoencoder (VAE) differs from a regular autoencoder in several aspects. Which is not one of them? A It includes a discriminator, which tries to tell the difference between data points and samples from the generator. B It has an added loss term that ensures that the data looks like a stan- dard normal distribution in the latent space. C For a given instance, the encoder produces a distribution on the latent space, instead of a single point. D It includes a sampling step in the middle, where a latent vector is sam- pled from the distribution provided by the encoder.

A It includes a discriminator, which tries to tell the difference between data points and samples from the generator.

The standard decision tree algorithm doesn’t stop adding nodes until all leaves either contain no data instances, or only instances with the same label (or all features have been used). Why is this a problem, and what is the default solution (mentioned in the slides)? A It’s a problem because the algorithm may never terminate. To solve it, we can use a validation set to see if removing nodes improves perfor- mance. B It’s a problem because the algorithm may never terminate. To solve it, we can remove features from the data, so that fewer splits are available. C It’s a problem because we may be overfitting on the training set. To solve it, we can use a validation set to see if removing nodes improves performance. D It’s a problem because we may be overfitting on the training set. To solve it, we can add features to the data, so that more splits are available.

C It’s a problem because we may be overfitting on the training set. To solve it, we can use a validation set to see if removing nodes improves performance.

Boosting is a popular method to improve a model’s performance. Why do we rarely see boosting used in research settings (unless specifically studying ensembling methods)? A In research we want to measure the relative performance with respect to the baseline. If we apply boosting to our model, we should apply boost- ing to the baseline as well. B Boosting cannot be applied in combination with a validation split, which is required in research. C Boosting makes it difficult to compute a confidence interval over the accuracy, which is required in research. D Boosting requires some information from the test set to be used in train- ing, which is not allowed in research.

A In research we want to measure the relative performance with respect to the baseline. If we apply boosting to our model, we should apply boost- ing to the baseline as well.

In recommender systems, what is implicit feedback? A Ratings given by a single “like” button rather than a more fine-grained system. B Recommendations that take the temporal structure of the data into ac- count C Associations between users and items assumed from user behavior. D Recommendations derived from manually crafted item features rather than learned ones.

C Associations between users and items assumed from user behavior.

Word2Vec and matrix factorization are both embedding methods that make it possible to learn about a large set of featureless objects. How do they do this? A By taking known features for each object and mapping these to a low- dimensional representation. B By taking known features for each object and mapping these to a high- dimensional representation. C By embedding these objects into a Euclidean space, with each object represented by a vector. D By embedding these objects into a Euclidean space, with each object represented by a scalar.

C By embedding these objects into a Euclidean space, with each object represented by a vector.

What is batch normalization? A An operation in a neural network that normalizes the output of a layer so that it is normally distributed over the current batch. B An operation in a neural network that normalizes the output of a layer so that it is uniformly distributed over the batch. C A hyperparameter selection technique that sets the batch size to a value that ensures a normal distribution in the gradients of a neural network. D A hyperparameter selection technique that sets the batch size to a value that ensures a uniform distribution in the gradients of a neural network.

A An operation in a neural network that normalizes the output of a layer so that it is normally distributed over the current batch.

What separates offline learning from reinforcement learning? A In reinforcement learning the training labels are reinforced through boosting. B Offline learning can be done without connection to the internet. Rein- forcement learning requires reinforcement from a separate server. C In reinforcement learning, the learner takes actions and receives feed- back from the environment. In offline learning we learn from a fixed dataset. D Reinforcement learning uses backpropagation to approximate the gradi- ent, whereas offline learning uses symbolic computation.

C In reinforcement learning, the learner takes actions and receives feed- back from the environment. In offline learning we learn from a fixed dataset.

The most important rule in machine learning is “never judge your performance on the training data.” If we break this rule, what can happen as a consequence? A The loss surface no longer provides an informative gradient. B We get cost imbalance. C We end up choosing a model that overfits the training data. D We commit multiple testing.

C We end up choosing a model that overfits the training data.

The squared error loss function looks like this: Σi(yi − ti)^2, where the sum is over all instances, yi is the model output for instance i and ti is the training label. Which is not a reason for squaring the difference between the two? A It ensures that negative and positive differences don’t cancel out in the sum. B It ensures that large errors count very heavily towards the total loss. C When used in classification, it ensures that points near the decision boundary weigh most heavily. D It is a consequence of assuming normally distributed errors, and deriving the maximum likelihood solution.

C When used in classification, it ensures that points near the decision boundary weigh most heavily.

We have a classifier c and a test set. Which is true? A To compute the precision for c on the test set, we must define how to turn it into a ranking classifier. B To compute the false positive rate for c on the test set, we must define how to turn it into a ranking classifier. C To compute the confusion matrix for c on the test set, we must define how to turn it into a ranking classifier. D To compute the area under the curve for c on the test set, we must de- fine how to turn it into a ranking classifier.

D To compute the area under the curve for c on the test set, we must de- fine how to turn it into a ranking classifier.

Testing too many times on the test set increases the chance of ran- dom effects influencing your choice of model. What is the solution suggested in the lectures? A To withhold the test set until a hypothesis is established, and use a train/validation split on the remainder to evaluate model choice and hyperparameters. B To perform cross-validation on the training data, so that all instances are used as training data at least once. C To use bootstrap sampling to gauge the variance of the model. D To use a boosted ensemble, to reduce the variance of the model, and with it, the probability of random effects.

A To withhold the test set until a hypothesis is established, and use a train/validation split on the remainder to evaluate model choice and hyperparameters.

Different features in our data may have wildly different scales: a per- son’s age may fall in the range from 0 to 100, while their savings can fall in the range from 0 to 100 000. For many machine learning algo- rithms, we need to modify the data so that all features have roughly the same scale. Which is not a method to achieve this? A Imputation B Standardization C Normalization D Principal Component Analysis

A Imputation

Sophie and Emma are doing a machine learning project together, and training the single-feature regression model y = w1x2 + w2x + b. Sophie says this is a non-linear model, because it learns a parabola not a line. Emma says it is a linear model, but on the features x and x2, derived from the original single feature. A Sophie is right. Emma is wrong. B Emma is right. Sophie is wrong. C Both are right. D Both are wrong.

C Both are right.

You finish this exam and hand it in. You say to your fellow students: “The probability that I’ve passed this exam is 60%.” Which is true? A This is not a frequentist use of the word probability, because it uses a percentage instead of a frequency. B This is not a Bayesian use of the word probability because it expresses a belief, not a result of repeated experiments. C This is a frequentist use of the word probability. D This is a Bayesian use of the word probability.

D This is a Bayesian use of the word probability.

We have a dataset with a number of categoric fea- tures, each of which takes one of two values. In naive Bayes, proba- bility estimates can go to zero if we see a feature take a value that it doesn’t take in the training data. We can solve this by Laplace smooth- ing, which we can interpret as adding pseudo-observations. Which is true? A The number of pseudo-observations we need to add is the number of classes times two. B The number of pseudo-observations we need to add is the number of classes, times two to the power of the number of features. C After adding the pseudo-observations, we must change the denominator for the probability estimate to ensure that all probabilities still add up to one. D After adding the pseudo-observations, we must change the numerator for the probability estimate to ensure that all probabilities still add up to one.

A The number of pseudo-observations we need to add is the number of classes times two.

We have two discrete random variables: A with outcomes 1, 2, 3 and B with outcomes a, b, c. We are given the joint probability p(A, B) in a table, with the outcomes of A enumerated along the rows (verti- cally), and the outcomes of B enumerated along the columns (hori- zontally). How do we compute the probability p(A = 1 | B = a)? A We find the probability in the first column and the first row. B We find the probability in the first column and the first row, and divide it by the sum over the first column. C We find the probability in the first column and the first row, and divide it by the sum over the first row. D We sum the probabilities over the first column and the first row.

B We find the probability in the first column and the first row, and divide it by the sum over the first column.

The maximum margin objective that leads to the support vector ma- chine, can be developed in two different ways. Which is false? A We can rewrite the objective so that it can be trained with basic gradient descent, but then we can’t use the kernel trick. B We can rewrite the objective so that we can use the kernel trick, but then we can’t train it with basic gradient descent. C To be able to use basic gradient descent, we must rewrite the objective function using Lagrange multipliers. D To be able to use the kernel trick, we must rewrite the objective func- tion using Lagrange multipliers.

C To be able to use basic gradient descent, we must rewrite the objective function using Lagrange multipliers.

L1, L2, and Dropout are forms of regularization. Which is true? A L1 and L2 work by randomly disabling hidden nodes. B Dropout works by adding a term to the loss function that represents the complexity of the model. C L1 is a sparsity-enforcing regularizer. It makes it more likely that weights become exactly 0. D L2 is a sparsity-enforcing regularizer. It makes it more likely that weights become exactly 0.

C L1 is a sparsity-enforcing regularizer. It makes it more likely that weights become exactly 0.

A convolutional layer is defined by four parameters: stride, padding, kernel size and the number of output channels. What should we do if we want the resolution of the output to be the same as the input? A Don’t use any padding, and make the stride as big as the number of out- put channels. B Make the kernel size the same as the image resolution, and set padding and stride to 0. C Set the kernel size to 1 by 1, use a stride of 1 and set padding to 3. D Set the stride to 1, and the padding to half the kernel size minus one.

D Set the stride to 1, and the padding to half the kernel size minus one.

The Expectation-Maximization algorithm is used to approximate the maximum likelihood fit for a probability model. For which of the fol- lowing models does it make sense to use it? A A univariate normal distribution B A multivariate normal distribution C A Gaussian mixture model D A generator neural network

C A Gaussian mixture model

The slides mention four types of GANs (generative adversarial net- works). One of them trains two generators: to map from domain A to B and from B to A. It then adds a term to the loss function that ensures that mapping from A to B and back again, results in as little change as possible. Which GAN is this? A Vanilla GAN B Conditional GAN C CycleGAN D StyleGAN

C CycleGAN

What is the cold start problem? A The situation where a neural network is initialized so that its sigmoid activations are saturated, and it has no gradient to start learning with. B The situation in sequence-to-label learning where there is a long dis- tance between the start of the sequence and the label, so that the model only learns from the end of the sequence. C The situation where we can sample from a generator network but we don’t know which instance from the data to compare it to, because we don’t have a mapping from the data to the latent space. D The situation where a new item (user, movie, etc.) is added to a recom- mender system, and we have no ratings to build an embedding from.

D The situation where a new item (user, movie, etc.) is added to a recom- mender system, and we have no ratings to build an embedding from.

What is mode collapse? A When the loss surface of a generator network flattens out, so the gradi- ent becomes zero. B When a network has a sampling step in the middle, so we cannot back- propagate down to the input. C When the distribution in the latent space is similar to a hypersphere, so we should interpolate along an arc instead of a line. D When a generator network outputs the mean of the data, instead of providing different samples with variation between samples.

D When a generator network outputs the mean of the data, instead of providing different samples with variation between samples.

The variational autoencoder adapts the regular autoencoder in a number of ways. Which is not one of them? A It adds a sampling step in the middle. B It makes the outputs of the encoder and decoder parameters of probability distributions. C It adds a loss term to ensure that the latent space is laid out like a stan- dard normal distribution. D It adds a discriminator that learns to separate generated examples from those in the dataset.

D It adds a discriminator that learns to separate generated examples from those in the dataset.

In a tree model (a decision or regression tree), when does it make sense to split on the same feature twice for the same instance? A This never makes sense. B When the feature is categoric, but not when it is numeric. C When the feature is numeric, but not when it is categoric. D When we are training a regression tree, but not when we are training a decision tree.

C When the feature is numeric, but not when it is categoric.

Which is true? A Bagging reduces variance. To reduce bias we can use boosting. B Boosting reduces variance. To reduce variance bias, we can use bagging. C It is not possible to reduce bias through ensembling: hypothesis boost- ing has been proven to be impossible. D Ensembling methods reduce neither bias nor variance. They just allow us to estimate our bias and variance more accurately.

A Bagging reduces variance. To reduce bias we can use boosting. B Boosting reduces variance. To reduce variance bias, we can use bagging.

Why is the Markov assumption like the Naive Bayes assumption? A They both translate a probability model to a differentiable network, so that we can use backpropagation to approximate a solution. B They both use approximations to conditional probability which are not actually valid probabilities, but which work in practice. C They both make assumptions which we know to be untrue, but which simplify the model a lot. D They both enable us to convert numeric data to categoric data.

C They both make assumptions which we know to be untrue, but which simplify the model a lot.

In the context of recurrent neural networks (RNNs), what is un- rolling? A A process that turns an RNN into a generator by providing it with a ran- domly sampled input. B A process that turns an RNN into a non-recurrent network by making a copy of the network for each timestep in the input sequence. C A process that samples a random sequence by sequentially sampling from the probabilities predicted by the network, and feeding it back the resulting sample. D A process that eliminates recurrent connections, by decaying particular weights of the network.

B A process that turns an RNN into a non-recurrent network by making a copy of the network for each timestep in the input sequence.

We are building a recommender system for movies, based on matrix factorization. We decide to withhold some of the users and movies as a test set. Will this work? A No, different users may have different average ratings. Discarding some users may change the distribution. B No, the model trains embeddings for the users and the movies. We can’t make predictions for users and movies that we haven’t seen during training. C No, more obscure movies tend to get higher ratings on average, because only people who like them are aware of them. Withholding these movies will change the data distribution. D Yes, but only if we make sure that users in the test set joined after users in the training set.

B No, the model trains embeddings for the users and the movies. We can’t make predictions for users and movies that we haven’t seen during training.

Why is gradient descent difficult to apply in a reinforcement learning setting? A The loss surface is flat in most places, so the gradient is zero almost ev- erywhere. B The backpropagation algorithm doesn’t apply if the output of a model is a probability distribution. C There is a non-differentiable step between the model input and the model output. D There is a non-differentiable step between the model output and the reward.

D There is a non-differentiable step between the model output and the reward.

Which statement is false? A PCA provides a normalisation that accounts for correlations between features. B PCA finds a change of basis for the data. C The first principal component is the direction in which the variance is highest. D The second principal component is the direction in which the bias is highest.

D The second principal component is the direction in which the bias is highest.

What is the advantage of gradient descent over random search? A In gradient descent, parallel searches are allowed to communicate. B Gradient descent is less likely to get stuck in local minima. C Gradient descent computes the direction of steepest descent, random search approximates it. D Gradient descent is easier to parallelise.

C Gradient descent computes the direction of steepest descent, random search approximates it.

Why is accuracy a bad loss function to use in gradient descent? A It is expensive to compute. B It makes the gradient zero almost everywhere. C It is unreliable in situations with high class imbalance. D The confidence interval is high on small test sets.

B It makes the gradient zero almost everywhere.

I’m performing spam classification. I represent each email by three numbers: how often the word pill occurs, how often the word hello occurs and how often the word congratulations occurs. What are these three attributes called? A The instances B The classes C The features D The principal components

C The features

There are many classification problems that are not linearly separa- ble. Which trick often allows us to still separate the classes with a linear classifier? A Adding a regularizer. B Deriving new features from the existing ones. C Using least-squares-loss instead of accuracy C Using least-squares-loss instead of accuracy-loss. D Reducing the dimensionality with PCA.

B Deriving new features from the existing ones.

Which is true? A Grouping models have a higher refinement than grading models. B kNN has a higher refinement than linear classification. C Linear classification has a higher refinement than decision tree classification. D We can estimate the refinement from the confusion matrix.

C Linear classification has a higher refinement than decision tree classification.

Why is the naive Bayes classifier called naive? A It doesn’t follow Bayes’ rule, but an approximation. B It assumes that all features are independent. C It assumes that all features are independent, conditional on the class. D It assumes that all features are independent, conditional on the hyper- parameters.

C It assumes that all features are independent, conditional on the class.

Which statement is true? A The posterior is equal to the prior multiplied by the likelihood, divided by the data distribution. B The prior is equal to the posterior multiplied by the likelihood, divided by the data distribution. C The prior and the posterior are both equal to the likelihood divided by the data distribution. D A prior and a posterior are strictly frequentist concepts that cannot be related by probability.

A The posterior is equal to the prior multiplied by the likelihood, divided by the data distribution.

What is the difference between a discriminative and a generative classifier? A A generative classifier learns a distribution on the data, a discriminative classifier doesn’t. B A generative classifier is frequentist, a discriminative classifier is Bayesian. C A generative classifier can’t be learned by gradient descent. D A discriminative classifier must be constructed from neural networks.

A A generative classifier learns a distribution on the data, a discriminative classifier doesn’t.

We build a Bayes classifier by fitting an MVN to each class. What would happen to the MVNs if we made this classifier naive? A The decision boundary would become linear. B Their covariance matrices would become diagonal. C They would become Bayes optimal. D Their covariance matrices would have unit eigenvalues.

B Their covariance matrices would become diagonal.

Define two random variables: Age, with outcomes {child, teenager, adult} and Wealth, with outcomes {poor, rich}. Let these represent two at- tributes of the same person, selected at random from the Dutch population. Which entropies can we compute over these random variables? A We can compute the conditional entropy (of Age given Wealth and vice versa) and the cross-entropy of (Age vs. Wealth and Wealth vs. Age). B We can compute the conditional entropy, but not the cross-entropy. C We can compute the cross-entropy, but not the conditional entropy. D We can compute neither.

B We can compute the conditional entropy, but not the cross-entropy.

What happens if I build a two-layer neural network with no activa- tion function on the units in the hidden layer? A The whole network becomes equivalent to a linear single layer net- work. B It becomes impossible to backpropagate a gradient to the first layer. C The optimization problem becomes non-convex. D We must use autodiff to compute the gradient.

A The whole network becomes equivalent to a linear single layer net- work.

Which is false? A Softmax is an activation function that allows us to perform multiclass classification. B The derivative of the sigmoid activation is always either 0 or 1. C A single-layer neural network with a sigmoid output is the same model as logistic regression. D A single-layer neural network with a linear output is the same model as basic linear regression.

B The derivative of the sigmoid activation is always either 0 or 1.

When do we require the multivariate chain rule in automatic differen- tiation? A When the computation graph is not a line. B When there are multiple paths between the output of the computation graph and one of the weights. C When the loss node in the computation graph has multiple inputs. D When there are latent variables in our model.

B When there are multiple paths between the output of the computation graph and one of the weights.

Which is false? A Backpropagation is symbolic differentiation, like Wolfram Alpha does. B Backpropagation is a mixture of symbolic and numeric differentiation. C Backpropagation applies the chain rule locally to each node in the com- putation graph. D Backpropagation distributes error back down the network based on weights and activations used during the forward pass.

A Backpropagation is symbolic differentiation, like Wolfram Alpha does.

What is the relation between the maximum margin hyperplane crite- rion (MMC) and the support vectors? A The support vectors can be removed from the data once the maximum margin hyperplane has been found. B The support vectors determine the hyperplane that satisfies the MMC. C The MMC and the support vectors describe different loss functions that we can use to fit a hyperplane. D The support vectors provide an approximation to the hyperplane that satisfies the MMC.

B The support vectors determine the hyperplane that satisfies the MMC.

Which is false? A The kernel trick can be applied if we rephrase the SVM solution in terms of Lagrange multipliers. B The kernel trick allows us to extend our feature space without explicitly computing the extensions. C The kernel trick allows us to phrase the SVM algorithm purely in terms of dot products of pairs of instances. D The kernel trick allows us to use the SVM loss function in neural net- works.

D The kernel trick allows us to use the SVM loss function in neural net- works.

What is the purpose of a regularizer? A It determines the number of layers your neural network should have. B It adds nonlinearity to your neural network. C It negates the need for a validation set. D It reduces overfitting.

D It reduces overfitting.

What is the credit assignment problem? A The problem of how to train weights based on incomplete data. B The issue of how to propagate a negative error down a neural network. C The problem in reinforcement learning that rewards often come long after the action that was responsible for them. D The issue that unsupervised learning problems do not have training labels, so the learning algorithm cannot be rewarded for correct behavior

C The problem in reinforcement learning that rewards often come long after the action that was responsible for them.

We train a generative model through plain gradient descent, using random examples from the data as targets, comparing these against random samples from the models, and backpropagating the error. After training, all samples from the model look like the average over the dataset. We can avoid this by training in a different way. Which is not a training method that allows us to avoid this problem? A Expectation-maximization B Random search C Variational autoencoders D Generative adversarial networks

B Random search

How do cycleGANs avoid mode collapse? A By adding a cycle-consistency term to the loss function. B By cycling through different hyperparameters during training. C By adding a “cycle” network that infers a latent representation. D By adding a cycle-regularization term to the latent representation.

A By adding a cycle-consistency term to the loss function.

I want to test a matrix factorization method for a basic movie recom- mendation task. I hold out some users and some movies as a test set. Why won’t this work? A You will not learn representations during training for the users and movies in your test set. B You will be committing multiple testing. C Matrix factorization is an unsupervised task, so it cannot be evaluated with a test set. D The training data will contain examples that are in the future from the perspective of your test set.

A You will not learn representations during training for the users and movies in your test set.

A lazy algorithm is a machine learning method that simply stores the data and refers back to it during evaluation, instead of training to establish a good model which can be stored independent of the data. Which of the following methods is a lazy algorithm? A Linear classification B Decision trees C k-Nearest neighbors D None of the above

C k-Nearest neighbors

I want to predict house prices, from a set of examples, based on two attributes: surface area and the local crime rate. I create a scatter- plot with the surface area of the house on the horizontal axis and the crime rate on the vertical. I plot each house in my dataset as a point in these axes. What have I drawn? A the model space B the loss curve C the feature space D the output space

C the feature space

How are random search and gradient descent related? A Gradient descent is an approximation to random search. B Random search is an approximation to gradient descent. C Gradient descent is like random search but with a smoothed loss sur- face. D Random search is like gradient descent but with a smoothed loss surface.

B Random search is an approximation to gradient descent.

In the slides, we get the advice that “sometimes your loss function should not be the same as your evaluation function.” Why not? A The evaluation function may not provide a smooth loss surface. B The evaluation function may be poorly chosen. C The evaluation function may not be linear. D The evaluation function may not be computable

A The evaluation function may not provide a smooth loss surface.

It is common practice in machine learning to separate out a training set and a test set. Often, we then split the training data again, to get a validation set. Which is false? A The validation set is not used until the end of the project. B We do this avoid multiple testing on the test set. C We use the validation set for hyperparameter optimization. D The test set is ideally used only once.

A The validation set is not used until the end of the project.

Which answer describes the precision? A The proportion of the actual positives that were classified as positive. B The proportion of the instances classified as positive that are actually positive. C The proportion of the actual negatives that were classified as negative. D The proportion of the instances classified as negative that are actually negative.

B The proportion of the instances classified as positive that are actually positive.

Imagine a machine learning task where the instances are customers. You know the phone number for each customer and their occupation (one of seven categories). You’re wondering how to turn these into features. Which is false? A You can extract several useful categoric features from the phone number. B The phone number is an integer, so you should use it as a numeric feature. C Whether to use the occupation directly or turn it into a numeric feature depends on the model. D For some models, you may want to turn the occupation into several nu- meric features.

B The phone number is an integer, so you should use it as a numeric feature.

The slides mention two ways to adapt a categoric feature for a clas- sifier that only accepts numeric features: integer coding and one-hot coding. Which is true? A One-hot coding always turns one categoric feature into one numeric feature. B Integer coding always turns one categoric feature into one numeric feature. C Integer coding becomes inefficient if there are too many values. D One-hot coding becomes inefficient if there are too few categories.

B Integer coding always turns one categoric feature into one numeric feature.

Which is false? A In PCA, the first principal component provides the direction of greatest variance. B PCA is a supervised method. C PCA can be used for dimensionality reduction. D PCA can be used for data preprocessing.

B PCA is a supervised method.

We are performing classification. We represent our instance by the random variable X and its class by the random variable Y. Which is true? A Generative modeling is training a model for p(X | Y) and computing p(Y | X) from that. B Discriminative modeling is training a model for p(X | Y) and computing p(Y | X) from that. C Generative modeling can only be done through the EM algorithm. D Discriminative modeling can only be done through the EM algorithm.

A Generative modeling is training a model for p(X | Y) and computing p(Y | X) from that.

Toxoplasmosis is a relatively harmless parasitic infection that usually causes no obvious symptoms. Which statement is acceptable from a Bayesian perspective, but not from a frequentist perspective? Note that we don’t care whether the statement is correct, just whether it fits these frameworks. A One in five Dutch people has toxoplasmosis. B Being Dutch, the probability that Fred has toxoplasmosis is 0.2. C The mean age of people with toxoplasmosis is 54. D The probability that a person chosen at random from the Dutch popula- tion has toxoplasmosis is 0.2.

B Being Dutch, the probability that Fred has toxoplasmosis is 0.2.

How does stochastic gradient descent (SGD) differ from regular gra- dient descent? A SGD is used to train stochastic models instead of deterministic ones. B SGD trains in epochs, regular gradient descent doesn’t. C SGD uses the loss over a small subset of the data. D SGD only works on neural networks.

C SGD uses the loss over a small subset of the data.

Which is false? A Autodiff combines aspects of symbolic differentiation and numeric dif- ferentiation. B Autodiff computes the gradient but only for a specific input. C Autodiff is an alternative to backpropagation. D Autodiff boils down to repeated application of the chain rule.

C Autodiff is an alternative to backpropagation.

Which is false? A The kernel trick allows us to use support vector machines as a loss func- tion in neural networks. B The kernel trick allows us to compute SVMs in a high dimensional space. C The SVM algorithm computes the maximum margin hyperplane. D The SVM algorithm can be computed without using the kernel trick.

A The kernel trick allows us to use support vector machines as a loss func- tion in neural networks.

Which is true? A A maximum likelihood objective for least-squares regression does not provide a smooth loss surface. B The least-squares loss function for linear regression can be derived from a maximum likelihood objective. C Linear regression can be performed with a maximum likelihood objec- tive but the results will be different from the least-squares version. D The loss function for logistic regression is derived from assuming a nor- mal distribution on the residuals.

B The least-squares loss function for linear regression can be derived from a maximum likelihood objective.

Which statement is false? [bonus question, due to multiple correct answers] A The entropy is the expected codelength using an optimal code. B The relative entropy is the KL divergence minus the entropy. C The KL divergence is the difference in expected codelength between the optimal code and another. D The KL divergence is the relative entropy minus the entropy.

B The relative entropy is the KL divergence minus the entropy. D The KL divergence is the relative entropy minus the entropy.

What is the relation between the k-Means algorithm and the EM al- gorithm? A The EM algorithm is a simplified version of the k-Means algorithm. B k-Means is a simplified version of the EM algorithm. C k-Means is to k-Nearest neighbors as the EM algorithm is to Support Vector Machines. D k-Means is to k-Nearest neighbors as Support Vector Machines are to the EM algorithm.

B k-Means is a simplified version of the EM algorithm.

I’m training a neural network. I notice that during training, the loss on the training data goes to zero, but the loss on the validation set doesn’t get any better than chance. Which is true? A The model is overfitting. A good solution is to increase the model capac- ity. B The model is overfitting. A good solution is to add L2-regularization. C The model is suffering from vanishing gradients. A good solution is to use sigmoid activations. D The model is suffering from vanishing gradients. A good solution is to increase the batch size.

B The model is overfitting. A good solution is to add L2-regularization.

When training generative models, mode collapse is an important prob- lem. Which is false? A Generative Adversarial Networks are a way to train generative models, while avoiding mode collapse. B Variational Autoencoders are a way to train generative models, while avoiding mode collapse. C Generative Adversarial Networks avoid mode collapse by learning a net- work that maps each instance to a latent variable. D Variational Autoencoders avoid mode collapse by learning a network that maps each instance to a latent variable.

C Generative Adversarial Networks avoid mode collapse by learning a net- work that maps each instance to a latent variable.

Which is false? A Decision trees do not deal with categorical data naturally. To use such data we must convert it to one-hot vectors. B Decision trees do not deal with numeric data naturally. To use such data we must choose a value to split on. C The standard decision algorithm (without pruning) operates greedily: once it has chosen a split, it will never reconsider that decision. D When splitting a numeric feature, we must choose a threshold value to split on.

A Decision trees do not deal with categorical data naturally. To use such data we must convert it to one-hot vectors.

When building a decision tree, we choose which feature to split on, one after the other. With categoric features it makes no sense to split on the same feature twice. With numeric features it does. Which ex- planation is entirely correct? Let C be a categoric feature and N be a numeric feature. A In the second split on C there will be no examples left. In the second split on N we can add noise to change the values. B In the second split on C there will be no examples left. In the second split on N, we can set the threshold at a different value from the first. C In the second split on C, all examples reaching that node will have the same value for C. In the second split on N we can add noise to change the values. D In the second split on C, all examples reaching that node will have the same value for C. In the second split on N, we can set the threshold at a different value from the first.

D In the second split on C, all examples reaching that node will have the same value for C. In the second split on N, we can set the threshold at a different value from the first.

Sarah has a large dataset of many recipes and many ingredients. She doesn’t know anything about the recipes except which ingredients occur in each, and she doesn’t know anything about the ingredients except in which recipes they occur. She would like to predict new recipe/ingredient pairs for ingredients that could be added to exist- ing recipes. Which is true? A She could model the recipes as instances with their ingredients as a sin- gle categorical feature, and solve the problem with a decision tree. B She could model the ingredients as instances and their recipes as a sin- gle categorical feature, and solve the problem with a decision tree. C She could model this as a matrix decomposition problem. D None of the algorithms described in the course are applicable.

C She could model this as a matrix decomposition problem.

Which is false? A Word2Vec creates embedding vectors of tokens in a sequence. B Recurrent Neural Networks are Neural Networks with cycles, that allow them to operate on sequences. C The advantage of Markov models over Recurrent Neural Nets is that they have a potentially unbounded memory. D LSTMs tend to have better memories than plain RNNs because of the use of forget gates.

C The advantage of Markov models over Recurrent Neural Nets is that they have a potentially unbounded memory.

Why is it especially important to choose your test and validation sets carefully when training on sequential data? A Unlike with non-sequential models, if you evaluate on your training data, you risk overfitting. B If your learning rate is too low, you risk selecting hyperparameters that gave good performance by random chance. C If you remove users randomly from the recommendation set, you will be training on incomplete movie representations (and vice versa). D If you sample your test data randomly from the sequence, you may be training on data that is in the future compared to some of your test in- stances.

D If you sample your test data randomly from the sequence, you may be training on data that is in the future compared to some of your test in- stances.

Shortly after AlphaGo beat Lee Sedol, DeepMind introduced AlphaGo zero, a Go engine that could learn from only self-play. Which of the following was not a change introduced to make AlphaGo better? A Use random search instead of policy gradients. B Introduce residual connections between blocks of layers. C Combine the policy net and the value net into a single network with two outputs. D Use Monte Carlo Tree-Search as a way to generate a better policy than the one implemented by the neural net.

A Use random search instead of policy gradients.

Which answer contains only unsupervised methods and tasks? A k-Means, Clustering, Density estimation B Clustering, Linear regression, Generative modelling C Classification, Clustering, k-Means D k-NN, Density estimation, Clustering

A k-Means, Clustering, Density estimation

In the book, Flach makes a distinction between grouping and grading models. Which statement is false? A Grouping models segment the feature space. B Grading models combine other classifiers, assigning a grade to each. C Grading models can assign each element in the feature space a different prediction. D Grouping models can only assign a finite number of predictions.

C Grading models can assign each element in the feature space a different prediction.

We plot the ROC curve for a ranking classifier. What does the area under the curve estimate? A The probability of a ranking error B The accuracy C The sum of squared errors D The probability of a misclassification

A The probability of a ranking error

You want to search for a model in a discrete model space. Which search method is the least applicable? A Random search B Simulated annealing C Evolutionary methods D Gradient descent

D Gradient descent

In bar charts, what do error bars represent? A Standard deviation B Standard error C A confidence interval D All are possible

D All are possible

We can decompose the sample covariance matrix S into a transformation matrix as follows S = AAT . This allows us to transform normally distributed data into standard normally distributed data. However, the Principal Component Analysis doesn’t use this decomposition, but the Singular Value Decomposition (S = UZUT ). Why? A It’s easier to compute. B There isn’t always an A such that S = AAT . C It makes the loss surface more smooth. D It ensures the first axis has the highest eigenvalue.

D It ensures the first axis has the highest eigenvalue.

What is the relation between an ROC curve and a coverage matrix? A Normalizing the axes of the coverage matrix gives an ROC curve. B Normalizing the axes of the ROC curve matrix gives a coverage matrix. C Dividing the values in the coverage matrix by the ranking error gives the coverage matrix. D The ROC curve is the transpose of the coverage matrix.

A Normalizing the axes of the coverage matrix gives an ROC curve.

Which statement is true? A The average error of many models with high bias is low. B The average error of many models with high variance is low. C A model with high bias has low variance D High bias is an indication of overfitting

B The average error of many models with high variance is low.

The mean squared error (MSE) loss function Σi (yi − ti)^2 and the mean absolute error (MAE) loss function Σi|yi − ti| are two popular loss functions. Here the sum is over n instances, yi is the model output for instance i and ti is the training label. Which would be a a reason for preferring the MAE over the MSE? A In MAE, the negative and positive differences do not cancel out in the sum. B The mean of the error yi−ti minimizes the MAE. C The MAE is less sensitive to outliers than the MSE. D There is no advantage in using the MAE over the MSE.

C The MAE is less sensitive to outliers than the MSE.

We are choosing a new basis for our data. We decide to use an or- thonormal basis. What is the advantage of having an orthonormal basis? A It ensures that the basis vectors are equal to the principal components. B It ensures that the basis vectors are orthogonal to the principal compo- nents. C It ensures that the inverse of the basis matrix is equal to its transpose. D It ensures that the data is automatically whitened in the new basis.

C It ensures that the inverse of the basis matrix is equal to its transpose.

Which of these statements about principal component analysis is false? A The first principal component provides the direction of greatest variance. B It is a supervised method. C It can be used for dimensionality reduction. D It can be used for data pre-processing.

B It is a supervised method.

We want to represent color videos in a deep learning system. Each is a series of frames, with each frame an RGB image. Which is the most natural representation for one such video? A As a 1-tensor. B As a 2-tensor. C As a 3-tensor. D As a 4-tensor.

D As a 4-tensor.

What is not true about the naive Bayes classifier? A The naive Bayes classifier can be applied to multi-class classification. B The naive Bayes classifier assumes that the features are independent. C To avoid zero probabilities in the naive Bayes classifier, one can use the Laplace smoothing technique. D The naive Bayes classifier is a probabilistic generative classifier.

B The naive Bayes classifier assumes that the features are independent.

One way to think of a convolution layer is as a fully connected layer, with some extra constraints. Which is not one of these constraints? A Some of the weights are forced to have the same value. B Some of the connections are removed. C The L2 norm of the weights is limited to a maximum value. D All of the above are part of a convolution layer.

C The L2 norm of the weights is limited to a maximum value.

Which answer contains methods that can all be used as sequence-to- sequence layers? A Convolutions, RNN, LSTMX B Gradient boosting, LSTM, Deep Q-Learning C Convolutions, Word2Vec, Gradient boosting D RNN, Deep Q-Learning, Word2Vec

A Convolutions, RNN, LSTMX

How are the Gaussian Mixture Model (GMM) and the Mixture Density Network (MDN) related? A The MDN is an alternative to the GMM that can also describe complex distributions, but that doesn’t use Gaussians. B The GMM is an alternative to the MDN that can also describe complex distributions, but that isn’t a mixture model. C The MDN is a neural network which uses the GMM as its output distri- bution. D The GMM is a neural network which uses the MDN as its output distri- bution.

C The MDN is a neural network which uses the GMM as its output distri- bution.

I have a dataset of politicians in the European parliament and which laws they voted for and against. The record is incomplete, but I have some votes for every law and for every politician. I would like to predict, for a new law, which politicians will vote for and which will vote against. I plan to model this as a recommender system using matrix factorization. Which is true? A This is not a good model, because there are too many classes and not enough instances. B This is not a good model, because there are not enough classes, and too many instances. C I would have to deal with the cold start problem, because for the new law I don’t have any voting information. D I would have to deal with the cold start problem, because the voting record is incomplete.

C I would have to deal with the cold start problem, because for the new law I don’t have any voting information.

Which is a good principle to decide what feature to use for the next split in a decision tree? A After the split we want the probability of the majority class to be as low as possible. B After the split we want the probability of the majority class to be as high as possible. C After the split we want the entropy of the class distribution to be as low as possible. D After the split we want the entropy of the class distribution to be as high as possible.

C After the split we want the entropy of the class distribution to be as low as possible.

What is the difference between inductive and transductive learning? A In transductive learning the model is allowed to see the labels of the test data during training. B In inductive learning the model is allowed to see the labels of the test data during training. C In transductive learning, the model is allowed to see the features of the test data during training. D In inductive learning, the model is allowed to see the features of the test data during training.

C In transductive learning, the model is allowed to see the features of the test data during training.

ML past exam Questions Flashcards

(134 cards)