Exam questions Flashcards by Freek Byrman

What is the difference between classification and regression?

A Classification predicts an item from a finite set, regression predicts a
numeric value.
B Regression predicts an item from a finite set, classification predicts a
numeric value.
C Classification is unsupervised, regression is supervised.
D Regression is unsupervised, classification is supervised.

A Classification predicts an item from a finite set, regression predicts a
numeric value.

How well did you know this?

Not at all

Perfectly

What is a valid reason to prefer gradient descent over random search?

A My model is easily differentiable.
B I need to be sure that I find the global minimum.
C My loss function is not smooth.
D There is some computation between the output of my model and my
loss function, which I do not control.

A My model is easily differentiable.

How well did you know this?

Not at all

Perfectly

We are training a classification model by gradient descent, and we want to figure out which learning rate to use, before comparing the model to other classifiers. We try five learning rate values, resulting in five different models. How do we choose among these five models?

A We measure the accuracy of each model on the training set.
B We measure the accuracy of each model on the validation set.
C We measure the accuracy of each model on the test set.
D We measure the accuracy of each model on the full dataset

B We measure the accuracy of each model on the validation set.

How well did you know this?

Not at all

Perfectly

Undersampling and oversampling are ways to deal with imbalanced classes. Which is true?

A You oversample your majority class.
B You undersample your minority class.
C Undersampling leads to duplicate instances in your data.
D Oversampling leads to duplicate instances in your data.

D Oversampling leads to duplicate instances in your data.

How well did you know this?

Not at all

Perfectly

The slides mention two ways to adapt a categoric feature for a classifier that only accepts numeric features: integer coding and one-hot coding. Which is true?

A Integer coding always turns one categoric feature into multiple numeric
features.
B One-hot coding always turns one categoric feature into multiple numeric features.
C Integer coding becomes inefficient if there are too many categories.
D One-hot coding becomes inefficient if there are too few categories.

B One-hot coding always turns one categoric feature into multiple numeric features.

How well did you know this?

Not at all

Perfectly

If somebody says: “There is a high probability that the mean height of Italian women is below 2 meters.”Which is true?

A A strict subjectivist would consider this an improper use of the term probability.
B A strict Bayesian would consider this an improper use of the term probability.
C A strict frequentist would consider this an improper use of the term
probability.
D In machine learning, we would consider this an improper use of the term probability.

C A strict frequentist would consider this an improper use of the term
probability.

How well did you know this?

Not at all

Perfectly

How does dropout help with the overfitting problem?

A By propagating the gradient of the loss back down the network.
B By randomly disabling nodes in a neural network, to eliminate solutions that require highly specific configurations.
C By ensuring that the output distribution of a neural network is normally distributed if the input distribution is.
D By converting the scalar backpropagation algorithm to work with tensors.

B By randomly disabling nodes in a neural network, to eliminate solutions that require highly specific configurations.

How well did you know this?

Not at all

Perfectly

The soft margin SVM loss is defined as a constrained optimization objective. We can rewrite this in two ways. Which is true?

A We can rewrite to an unconstrained problem. This allows us to use the kernel trick.
B We can rewrite to an unconstrained problem. This expresses the solution purely in terms of the dot product between pairs of instances.
C We can rewrite using KKT multipliers. This allows us to use the kernel trick.
D We can rewrite using KKT multipliers. This allows some instances to fall inside the margin.

C We can rewrite using KKT multipliers. This allows us to use the kernel trick.

How well did you know this?

Not at all

Perfectly

Neural networks usually contain activation functions. What is their purpose?

A They are used to compute a local approximation of the gradient.
B They are applied after a linear transformation, so that the network can learn nonlinear functions.
C They control the magnitude of the the step taken during an iteration of gradient descent.
D They function as a regularizer, to combat overfitting.

B They are applied after a linear transformation, so that the network can learn nonlinear functions.

How well did you know this?

Not at all

Perfectly

By what method do variational autoencoders avoid mode collapse?

A By training the “decoder” network through a discriminator.
B By using a regularizer to steer the network toward the data average.
C By feeding the discriminator network pairs of inputs.
D By learning the latent representation of an instance through an “encoder” network.

D By learning the latent representation of an instance through an “encoder” network.

How well did you know this?

Not at all

Perfectly

I am training a generator network to generate faces. I take a random sample, compare it to a randomly chosen image from the data, and backpropagate the error. When training is finished, all samples from
the network look like the average over all faces in the dataset. What name do we have for this phenomenon?

A Multiple testing
B Overfitting
C Dropout
D Mode collapse

D Mode collapse

How well did you know this?

Not at all

Perfectly

In some machine learning settings it is said that we must make a tradeoff between exploration and exploitation. What do we mean by this?

A That hyperparameter selection (exploration) uses computational resources that can also used in training the model (exploitation).
B That an online algorithm needs to balance optimization of its expectedreward with exploring to learn more about its environment.
C That an insufficiently thoroughly trained model may be biased against minorities.
D This refers to the problem of balancing the loss function with the regularization terms in matrix factorization.

B That an online algorithm needs to balance optimization of its expected
reward with exploring to learn more about its environment.

How well did you know this?

Not at all

Perfectly

Which is false?

A To use decision trees on data with categorical features, we must convert those features to one-hot vectors.
B To use decision trees on data with numeric features, we must choose a threshold value to split on, for every split.
C When training a decision tree on only categorical features, there’s no
use in splitting again on a feature you’ve already split on.
D When training a decision tree on numeric features, it can often be useful to split on a feature you’ve already used before.

A To use decision trees on data with categorical features, we must convert those features to one-hot vectors.

How well did you know this?

Not at all

Perfectly

Some models are built on the Markov assumption. What do we mean by this?

A We can apply backpropagation to neural networks by unrolling them.
B The probability of a word does not depend on the current class for which we are predicting the probability.
C The operation of an LSTM cell depends only on its predecessors through two inputs.
D A word is conditionally dependent only on a finite number of words preceding it.

D A word is conditionally dependent only on a finite number of words
preceding it.

How well did you know this?

Not at all

Perfectly

Which is (primarily) a supervised machine learning method?
A Principal Component Analysis
B Support Vector Machines
C Variational Autoencoders
D None of the above

B Support Vector Machines

How well did you know this?

Not at all

Perfectly

We are fitting a regression model using the least squares loss. We have seen two different forms of the loss function:
Sum_i(yi − ti)^2 and
0.5Sum_i(yi − ti)^2

(where yi is the model output and ti is the true value
given by the data). Which is true?

A The global minima of these two loss functions occur at different points
in the model space .
B If we work out the solution analytically, when we set the gradient equal to zero, the constant factor 1 2
in the second loss function changes the parameters of the optimal solution.
C If we use these loss functions with gradient descent, it makes no difference which we use, the behavior is exactly the same.
D If we use these loss functions with gradient descent, there is a small difference in which we use, but if we scale the learning rate appropriately, the
difference will disappear.

D If we use these loss functions with gradient descent, there is a small difference in which we use, but if we scale the learning rate appropriately, the
difference will disappear.

How well did you know this?

Not at all

Perfectly

We are choosing a new basis for our data. We decide to use an orthonormal basis. What is the advantage of having an orthonormal basis?
A It ensures that the basis vectors are equal to the principal components.
B It ensures that the inverse of the basis matrix is equal to its transpose.
C It ensures that the basis vectors are orthogonal to the principal components.
D It ensures that the data is automatically whitened in the new basis.

B It ensures that the inverse of the basis matrix is equal to its transpose.

How well did you know this?

Not at all

Perfectly

We are considering using either gradient descent or random search for a problem. Which is true?

A For both, which optimum they find depends on the initial starting point.
B Gradient descent can get stuck in a local optimum, random search cannot.
C Gradient descent cannot get stuck in a local optimum, random search can.
D Gradient descent is more efficient than random search and can always be applied, so we always prefer gradient descent over random search.

A For both, which optimum they find depends on the initial starting point.

How well did you know this?

Not at all

Perfectly

Which property is common to both logistic regression and support vector machines?

A For both, the decision boundary is chosen by minimizing the number of misclassified examples.
B Both are usually optimized by alternating optimization.
C Both require backpropagation to work out the gradient efficiently.
D They both focus mostly or only on the points closest to the decision boundary.

D They both focus mostly or only on the points closest to the decision
boundary.

How well did you know this?

Not at all

Perfectly

Imagine we have a naive Bayes classifier. In our dataset we have two binary features (categorical with two possible values) and two classes.
How many pseudo-observations do we need to add if we want to apply Laplace smoothing?

A 1
B 2
C 4
D 8

C 4

How well did you know this?

Not at all

Perfectly

One can choose between the likelihood function or the log likelihood function as a loss function. Which is usually preferred, and why?

A Both result in a maximum at the same point in model space, but the loglikelihood is often easier to work with.
B The likelihood function. When this is maximised we have the best fitting model whereas the log likelihood results in a worse model.
C The log likelihood function. The squared errors are minimized only when the log-likelihood is maximized.
D The likelihood function. The squared errors are minimized only when the likelihood is maximized.

A Both result in a maximum at the same point in model space, but the loglikelihood is often easier to work with.

How well did you know this?

Not at all

Perfectly

We have a logistic regression model for a binary classification problem, which predicts class probabilities q. We compare these to the true class probabilities p, which are always 1 for the correct class and 0 for the incorrect class. The slides mention two loss functions for this purpose: logarithmic loss and binary cross-entropy. Which is true?

A Log-loss does not lead to a smooth loss landscape, so we approximate it by cross-entropy so that we can search for a good model using gradient descent.
B Cross-entropy loss does not lead to a smooth loss landscape, so we approximate it by log-loss so that we can search for a good model using gradient descent.
C Log-loss is equal to the binary cross-entropy H(p, q).
D Log-loss is equal to the binary cross-entropy H(q, p).

C Log-loss is equal to the binary cross-entropy H(p, q).

How well did you know this?

Not at all

Perfectly

We want to represent color videos in a deep learning system. Each is a series of frames, with each frame an RGB image. Which is the most natural representation for one such video?

A As a 1-tensor.
B As a 2-tensor.
C As a 3-tensor.
D As a 4-tensor.

D As a 4-tensor.

How well did you know this?

Not at all

Perfectly

In support vector machines, how is the maximum margin hyperplane criterion (MMC) related to the support vectors?

A The support vectors can be removed from the data once the maximum margin hyperplane has been found.
B The support vectors determine the hyperplane that satisfies the MMC.
C The MMC and the support vectors describe different loss functions that we can use to fit a hyperplane.
D The support vectors provide an approximation to the hyperplane that satisfies the MMC.

B The support vectors determine the hyperplane that satisfies the MMC.

How well did you know this?

Not at all

Perfectly

Which of the following is not a method to prevent overfitting? A Boosting B Bagging C Dropout D L1 regularization

A Boosting

What problem, if it exists for a single model, cannot be solved by training an ensemble of such models? A High bias. B High variance. C High overfitting. D High training time.

D High training time.

27. I have a dataset of politicians in the European parliament and which past laws they voted for and against. The record is incomplete, but I have some votes for every law and for every politician. I would like to predict, for future laws, which politicians will vote for and which will vote against. I plan to model this as a recommender system using matrix factorization. Which is true? A This is not a good model, because there are too many classes and not enough instances. B This is not a good model, because there are not enough classes, and too many instances. C I would have to deal with the cold start problem, because for the future laws I don’t have any voting information. D I would have to deal with the cold start problem, because the voting record for past laws is incomplete.

C I would have to deal with the cold start problem, because for the future laws I don’t have any voting information.

Which answer contains only unsupervised methods and tasks? A k-Means, Clustering, Density estimation B Clustering, Linear regression, Generative modelling C Classification, Clustering, k-Means D k-NN, Density estimation, Clustering

A k-Means, Clustering, Density estimation

In the book, Flach makes a distinction between grouping and grading models. Which statement is false? A Grouping models segment the feature space. B Grading models combine other classifiers, assigning a grade to each. C Grading models can assign each element in the feature space a different prediction. D Grouping models can only assign a finite number of predictions.

B Grading models combine other classifiers, assigning a grade to each.

We plot the ROC curve for a ranking classifier. What does the area under the curve estimate? A The probability of a ranking error B The accuracy C The sum of squared errors D The probability of a misclassification

A The probability of a ranking error

You want to search for a model in a discrete model space. Which search method is the least applicable? A Random search B Simulated annealing C Evolutionary methods D Gradient descent

D gradient descent

In bar charts, what do error bars represent? A Standard deviation B Standard error C A confidence interval D All are possible

D All are possible

We can decompose the sample covariance matrix S into a transformation matrix as follows S = AAT. This allows us to transform normally distributed data into standard normally distributed data. However, the Principal Component Analysis doesn’t use this decomposition, but the Singular Value Decomposition (S = UZUT). Why? A It’s easier to compute. B There isn’t always an A such that S = AAT C It makes the loss surface more smooth. D It ensures the first axis has the highest eigenvalue.

D It ensures the first axis has the highest eigenvalue.

What is the relation between an ROC curve and a coverage matrix? A Normalizing the axes of the coverage matrix gives an ROC curve. B Normalizing the axes of the ROC curve matrix gives a coverage matrix. C Dividing the values in the coverage matrix by the ranking error gives the coverage matrix. D The ROC curve is the transpose of the coverage matrix

A Normalizing the axes of the coverage matrix gives an ROC curve.

Which statement is true? A The average error of many models with high bias is low. B The average error of many models with high variance is low. C A model with high bias has low variance D High bias is an indication of overfitting

B The average error of many models with high variance is low.

Which statement is false? A Entropy is an expression of the uniformity of a probability distribution. B Entropy is an expectation of a codelength. C Entropy is expressed in bits. D The KL divergence is the least squares distance between two distributions.

D The KL divergence is the least squares distance between two distributions.

Logistic regression fits a linear classifier by passing the result through h1i, and applying a h2i loss. Fill in the blanks. A 1: a linear rectifier, 2: cross-entropy B 1: a linear rectifier, 2:least-squares C 1: a sigmoid function, 2: cross-entropy D 1: a sigmoid function, 2:least-squares

C 1: a sigmoid function, 2: cross-entropy

A convolutional layer reduces the number of weights (compared to a fully connected network). Which is false? A It does this by setting the weights on connections to be equal. B It does this by including forget gates. C It does this by not connecting every input to every node in the hidden layer. D It does this by exploiting the locality encoded in the input.

B It does this by including forget gates.

It is difficult to find the maximum likelihood parameters for hidden variable models. Which is false? A If we marginalise out the hidden variable, we get a sum with a huge number of terms. B Optimizing the parameters directly by gradient descent often leads to mode collapse. C The EM algorithm often works but is not guaranteed to converge. D If the values of the hidden variables are given, the problem becomes tractable.

C The EM algorithm often works but is not guaranteed to converge.

Which is false? A From a conditional distribution and a marginal distribution we can compute the joint distribution. B From the joint distribution we can always compute any conditional distribution. C From a conditional distribution we can always compute the joint distribution. D For independent random variables we can compute the joint distribution from their marginal distributions.

C From a conditional distribution we can always compute the joint distribution.

What is special about a denoising autoencoder? A It allows us to encode to a hidden layer bigger than the input. B It directly optimizes the maximum likelihood. C It borrows ideas from the Expectation-Maximization algorithm. D It can be used for generative modeling.

A It allows us to encode to a hidden layer bigger than the input.

I have an irrational fear of Lagrange multipliers. Can I derive the SVM algorithm without them? A Yes, but you can’t use the kernel trick. B Yes, but you’ll have to make do with an approximation. C Yes, but you can’t use the result as the top layer of a neural network. D No.

A Yes, but you can’t use the kernel trick.

I have a binary classification task. I build a Bayes classifier by fitting an MVN to each class. Which is true? A The decision boundary is linear only if both MVNs have diagonal covariance matrices. B The decision boundary is always linear. C If the covariance matrices of both MVNs are the same, the decision boundary is linear. D If the covariance matrices of both MVNs are the same, the decision is a (nonlinear) hyperbole.

A The decision boundary is linear only if both MVNs have diagonal covariance matrices.

We have a classification problem where the dataset is arranged on a straight line in a 3 dimensional space. The points are linearly separable along the line. Near the point of separation, some 3D Gaussian noise has been applied. Which classifier is most appropriate? A A logistic regression classifier. The noised points are outliers to which the LR classifier is robust. B A basic linear classifier. Others would pay too much attention to the points near the boundary, where the noise is. C A kernel SVM. A kernel could project these points into a higher dimensional space, where they are linearly separable. D A linear SVM. The maximum margin hyperplane criterion ensures that the linear nature of the dataset is taken into account.

B A basic linear classifier. Others would pay too much attention to the points near the boundary, where the noise is.

Which does not describe the purpose of a regularizer? A It simplifies selection of hyperparameters. B It reduces overfitting. C It functions as a definition of what constitutes a simple model. D It biases the search algorithm towards simple models.

A It simplifies selection of hyperparameters.

How do variational autoencoders avoid mode collapse? A By training the “decoder” network through a discriminator. B By using a regularizer to steer the network away from the data average. C By feeding the discriminator network pairs of inputs. D By learning the latent representation of an instance through an “encoder” network.

D By learning the latent representation of an instance through an “encoder” network.

What is an n-gram? A A short sequence of n tokens in sequential data. B A latent representation of n dimensions. C A way of storing data in a neural network. D A batch of n instances.

A A short sequence of n tokens in sequential data.

What is the purpose of the word2vec method? A It creates vector representations of words that embed semantics. B It creates one-hot representations of words. C It is an RNN model with increased ability to forget useless information. D It ensures that Markov models can deal with unseen n-grams.

A It creates vector representations of words that embed semantics.

We train a generative model through plain gradient descent, using random examples from the data as targets, comparing these against random samples from the model, and backpropagating the error. After training, all samples from the model look like the average over the datasets. What is this phenomenon called? A Multiple testing B Overfitting C Dropout D Mode collapse

D Mode collapse

What is meant by the “exploration vs. exploitation” tradeoff? A That hyperparameter selection (exploration) uses computational resources that can also used in training the model (exploitation). B That an online algorithm needs to balance optimization of its expected reward with exploring to lean more about its environment. C That an insufficiently thoroughly trained model may be biased against minorities. D Balancing the loss function with the regularization terms in matrix factorization.

B That an online algorithm needs to balance optimization of its expected reward with exploring to lean more about its environment.

In recommender systems, what is implicit feedback? A Associations between users and items assumed from user behavior. B Ratings given by a single “like” button rather than a more fine-grained system. C Recommendations that take the temporal structure of the data into account. D Recommendations derived from manually crafted item features rather than learned ones.

A Associations between users and items assumed from user behavior.

The variational autoencoder adapts the regular autoencoder in a number of ways. Which is not one of them? A It adds a sampling step in the middle. B It makes the outputs of the encoder and decoder parameters of probability distributions. C It adds a loss term to ensure that the latent space is laid out like a standard normal distribution. D It adds a discriminator that learns to separate generated examples from those in the dataset.

D It adds a discriminator that learns to separate generated examples from those in the dataset.

Why is gradient descent difficult to apply in a reinforcement learning setting? A The loss surface is flat in most places, so the gradient is zero almost ev- erywhere. B The backpropagation algorithm doesn’t apply if the output of a model is a probability distribution. C There is a non-differentiable step between the model input and the model output. D There is a non-differentiable step between the model output and the reward.

D There is a non-differentiable step between the model output and the reward.

The two most important conceptual spaces in machine learning are the model space and the feature space. Which is true? A Every point in the model space represents a loss function that we can choose for our task. B Every point in the model space represents a single instance in the dataset. C Every point in the feature space represents a single feature of a single instance in the dataset. D Every point in the feature space represents a single instance in the dataset.

D Every point in the feature space represents a single instance in the dataset.

3. The ALVINN system from 1995 was a self-driving car system imple- mented as a classifier: a grayscale camera was pointed at the road and a classifier was trained to predict the correct position of the steer- ing wheel based on the behavior of a human driver. In this example, which are the instances and which are the features? A The instances are the different cars the system is deployed in and the features are the angles of the steering wheel. B The instances are the angles of the steering wheel and the features are the different cars the system is deployed in. C The instances are the frames produced by the camera and the features are the pixel values. D The instances are the pixel values and the features are the frames pro- duced by the camera.

C The instances are the frames produced by the camera and the features are the pixel values.

Maria is fitting a regression model to predict the year in which a particular piece of instrumental music was written. The prediction is based on various features like average and variance of loudness, rhythm, key etc. She realizes that she has many outliers: for instance, the atonal music of the 1920s produces extreme variations in loudness, and John Cage’s piece 4’33” from 1952 is entirely silent. What should she do? A She should remove these instances entirely. Removing outliers will make it easier to fit the data with a normal distribution. B She should remove these instances from the training data, but leave them in the test data. C She should leave these instances in the training data, but remove them from the test data. D She should leave these instances in. They are important examples of the data distribution.

D She should leave these instances in. They are important examples of the data distribution.

When we apply the chain rule to a complex operation involving tensors, in order to use the backpropagation algorithm, the local derivatives might be something like the derivative of a vector with respect to a matrix. The result is a 3-tensor which is complex to work out, and expensive to store in memory. How do modern machine learning frameworks avoid this problem in their implementation of backpropagation? A They approximate the local derivative using random search. B They approximate the local derivative using the EM algorithm. C They don’t compute the local derivative, but the product of the up- stream derivative (the loss over the module outputs) with the local derivative. D They don’t compute the local derivative, but the product of the down- stream derivative (the module inputs over the network inputs) with the local derivative.

C They don’t compute the local derivative, but the product of the up- stream derivative (the loss over the module outputs) with the local derivative.

If we train a generator network by comparing a random output to a random target example from the data and backpropagating the difference, we get mode collapse. The problem is that we don’t know which random input corresponds to which item in the dataset. How do GANs solve this problem? A By training a second network to map the target example to a distribution on the input space. B By adding a KL-loss term on the random inputs of the generator net- work. C By training the generator to generate outputs that a second network recognizes as real, and training the second network to distinguish generated outputs from real data. D By adding a cycle-consistency loss-term.

C By training the generator to generate outputs that a second network recognizes as real, and training the second network to distinguish generated outputs from real data.

Sometimes we want to learn a model that maps some input to some output, but in some aspects we also want the model to behave like a generator. For instance, if we train a model to colorize photographs, we don’t want it to be purely deterministic: to colorize the label of a beer bottle, it should randomly imagine some colors even if it can’t in- fer the correct colors from the input. Which GAN approach is designed to accomplish this? A Vanilla GAN B Conditional GAN C CycleGAN D StyleGAN

B Conditional GAN

The Variational Autoencoder (VAE) differs from a regular autoencoder in several aspects. Which is not one of them? A It includes a discriminator, which tries to tell the difference between data points and samples from the generator. B It has an added loss term that ensures that the data looks like a standard normal distribution in the latent space. C For a given instance, the encoder produces a distribution on the latent space, instead of a single point. D It includes a sampling step in the middle, where a latent vector is sampled from the distribution provided by the encoder.

D It includes a sampling step in the middle, where a latent vector is sampled from the distribution provided by the encoder.

Boosting is a popular method to improve a model’s performance. Why do we rarely see boosting used in research settings (unless specifically studying ensembling methods)? A In research we want to measure the relative performance with respect to the baseline. If we apply boosting to our model, we should apply boost- ing to the baseline as well. B Boosting cannot be applied in combination with a validation split, which is required in research. C Boosting makes it difficult to compute a confidence interval over the accuracy, which is required in research. D Boosting requires some information from the test set to be used in training, which is not allowed in research.

A In research we want to measure the relative performance with respect to the baseline. If we apply boosting to our model, we should apply boost- ing to the baseline as well.

In recommender systems, what is implicit feedback? A Ratings given by a single “like” button rather than a more fine-grained system. B Recommendations that take the temporal structure of the data into account. C Associations between users and items assumed from user behavior. D Recommendations derived from manually crafted item features rather than learned ones.

C Associations between users and items assumed from user behavior.

Word2Vec and matrix factorization are both embedding methods that make it possible to learn about a large set of featureless objects. How do they do this? A By taking known features for each object and mapping these to a low- dimensional representation. B By taking known features for each object and mapping these to a high- dimensional representation. C By embedding these objects into a Euclidean space, with each object represented by a vector. D By embedding these objects into a Euclidean space, with each object represented by a scalar.

C By embedding these objects into a Euclidean space, with each object represented by a vector.

What is batch normalization? A An operation in a neural network that normalizes the output of a layer so that it is normally distributed over the current batch. B An operation in a neural network that normalizes the output of a layer so that it is uniformly distributed over the batch. C A hyperparameter selection technique that sets the batch size to a value that ensures a normal distribution in the gradients of a neural network. D A hyperparameter selection technique that sets the batch size to a value that ensures a uniform distribution in the gradients of a neural network.

A An operation in a neural network that normalizes the output of a layer so that it is normally distributed over the current batch.

If we have a classification problem that is not linearly separable, which trick allows us to still separate the classes with a linear classifier? A Using MSE loss instead of log-loss. B Adding a regulariser term. C Deriving novel features from existing features. D Reducing the dimensionality with PCA.

D Reducing the dimensionality with PCA.

Which statement is false? A The second principal component is the direction in which the bias is highest. B PCA computes a change of basis for the data. C PCA provides a normalization that accounts for correlations between features. D The first principal component is the direction in which the variance is highest.

D The first principal component is the direction in which the variance is highest.

If we want to maximize the log-probability for a model, but the model is too complex to make that computationally feasible, we can break up the log-probability into two terms as follows: What does this buy us? A We can optimize each term alternately, as done in GANs, or we can use the first term as a lower bound for the log probability, as done in SVMs. B We can optimize each term alternately, as done in the VAE, or we can use the first term as a lower bound for the log probability, as done in the EM algorithm. C We can optimize each term alternately, as done in SVMs, or we can use the first term as a lower bound for the log probability, as done in GANs. D We can optimize each term alternately, as done in the EM algorithm, or we can use the first term as a lower bound for the log probability, as done in the VAE.

Which is false? A A Gaussian mixture model can be used for density estimation. B It can be shown that the EM algorithm always converges to the global optimum. You Answered C A Gaussian Mixture Model is a latent variable model: the latent variable indicates which component generated each point. D EM is a softened version of k-means: a point can belong to multiple components at once.

A A Gaussian mixture model can be used for density estimation.

When we apply the chain rule to a complex operation involving tensors, in order to use the backpropagation algorithm, the local derivatives might be something like the derivative of a vector with respect to a matrix. The result is a 3-tensor which is complex to work out, and expensive to store in memory. How do modern machine learning frameworks avoid this problem in their implementation of backpropagation? A They don't compute the local derivative, but the product of the loss over the module outputs with the local derivative. B They don't compute the local derivative, but the product of the module inputs over the network inputs with the local derivative. C They approximate the local derivative using the EM algorithm D They approximate the local derivative using random search.

They approximate the local derivative using random search.

Boosting is a popular method to improve a model's performance. Why do we rarely see boosting used in research settings (unless specifically studying ensembling methods)? A: Boosting cannot be applied in combination with a validation split, which is required in research. B: In research we want to measure the relative performance with respect to the baseline. If we apply boosting to our model, we should apply boosting to the baseline as well. C: Boosting requires some information from the test set to be used in training, which is not allowed in research. D: Boosting makes it difficult to compute a confidence interval over the accuracy, which is required in research.

C: Boosting requires some information from the test set to be used in training, which is not allowed in research.

What is the relation between the k-Means algorithm and the EM algorithm? A Both are instances of alternating optimization. B The EM algorithm is a simplified version of the k-Means algorithm. C k-Means is to k-Nearest neighbors as the EM algorithm is to Support Vector Machines. D k-Means is to k-Nearest neighbors as Support Vector Machines are to the EM algorithm.

B The EM algorithm is a simplified version of the k-Means algorithm.

What is the Markov assumption? Does the LSTM make the Markov assumption? A That it is more likely that a word should be forgotten than that it should be remembered. The LSTM does not make a Markov assumption. B That it is more likely that a word should be forgotten than that it should be remembered. The LSTM makes a Markov assumption. C That the probability of a word depends only on a fixed number of words before it. The LSTM does not make a Markov assumption. D That the probability of a word depends only on a fixed number of words before it. The LSTM makes a Markov assumption.

D That the probability of a word depends only on a fixed number of words before it. The LSTM makes a Markov assumption.