ML past exam Questions Flashcards
How are hyperparameters defined?
A They are parameters of a learning algorithm that are set before we start
training on the data.
B They are parameters of which the values are found by the training algo- rithm itself.
C They are parameters that make the models train faster.
D They are parameters of which the values are set after training the model.
A They are parameters of a learning algorithm that are set before we start
training on the data.
In principal component analysis, which is true?
A The first principal component is the direction in which the variance is
the smallest.
B The first principal component is the direction that maximizes the recon- struction loss.
C The first principal component is the direction that minimizes the recon- struction loss.
D The first principal component is the direction in which the bias is the smallest.
C The first principal component is the direction that minimizes the recon- struction loss.
What is a valid reason to prefer gradient descent over random search?
A My model is easily differentiable.
B I need to be sure that I find the global minimum.
C My loss function is not smooth.
D There is some computation between the output of my model and my loss function, which I do not control.
A My model is easily differentiable.
If we have a feature that is categorical, but our model requires numeric features, we can turn the categoric feature into one or more numeric features either by integer coding, or by one-hot coding. Which is true?
A The benefit of integer coding is that it assigns each value a separate fea- ture.
B Integer coding results in more features than one-hot coding.
C The feature(s) introduced by integer coding are are valued between 0 and 1.
D The feature(s) introduced by one-hot coding are valued 0 or 1.
D The feature(s) introduced by one-hot coding are valued 0 or 1.
Two classes that are not linearly separable may become linearly sepa- rable if we add a new feature that is derived from one or more of the other features. Why?
A This approach removes outliers.
B A linear decision boundary in the new space may represent a nonlinear boundary in the old one.
C Adding a new feature derived from the existing ones normalizes the data.
D This approach makes a non-differentiable loss function differentiable.
B A linear decision boundary in the new space may represent a nonlinear boundary in the old one.
In the Expectation-Maximization algorithm for Gaussian mixture models, we train a model consisting of several components, and we compute various responsibilities. Which is true?
A Each component is a normal distribution. Its responsibility for a point is the normalized probability density it assigns the point.
B Each component is a normal distribution. Its responsibility is the integer number of points it generates.
C Each component is one of the features of the data. Its responsibility for a point is the normalized probability density it assigns the point.
D Each component is one of the features of the data. Its responsibility is the integer number of points it generates.
B Each component is a normal distribution. Its responsibility is the integer number of points it generates.
Lise is given an imbalanced dataset for a binary classification task. Which one of the following options is useful solution to this problem?
A She can oversample her majority class by sampling with replacement.
B She can use accuracy to measure the performance of her classifier.
C She can use SMOTE to augment her training data.
D She can augment the feature set by adding the multiplication of some features.
C She can use SMOTE to augment her training data.
What is true about logistic regression?
A Logistic regression is a linear, discriminative classifier with a cross- entropy loss function.
B Logistic regression finds the maximum margin hyperplane when applied to data with well-separated classes.
C Logistic regression is identical to the least-square linear regression.
D Thanks to the log-loss function, logistic regression can learn nonlinear decision boundaries.
A Logistic regression is a linear, discriminative classifier with a cross- entropy loss function.
Neural networks usually contain activation functions. What is their purpose?
A They are used to compute a local approximation of the gradient.
B They are applied after a linear transformation, so that the network can
learn nonlinear functions.
C They control the magnitude of the the step taken during an iteration of gradient descent.
D They function as a regularizer, to combat overfitting.
B They are applied after a linear transformation, so that the network can
learn nonlinear functions.
I am training a generator network to generate faces. I take a random sample, compare it to a randomly chosen image form the data, and backpropagate the error.
When training is finished, all samples from the network look like the average over all faces in the dataset.
What name do we have for this phenomenon?
A Multiple testing
B Overfitting
C Dropout
D Mode collapse
D Mode collapse
Which is true?
A A maximum likelihood objective for least-squares regression does not provide a smooth loss surface.
B The least-squares loss for linear regression can be derived from a maxi- mum likelihood objective.
C Linear regression can be performed with a maximum likelihood objec- tive but the results will be different from the least-squares loss.
D The loss function used in logistic regression is derived from assuming a normal distribution on the residuals.
B The least-squares loss for linear regression can be derived from a maxi- mum likelihood objective.
What is an important difference between regular recurrent neural net- works (RNNs) and LSTM neural networks?
A All LSTMs have a forget gate, allowing them to ignore parts of the cell state right away.
B All RNNs have a forget gate, allowing them to ignore parts of of the cell state right away.
C LSTMs have a vanishing gradient problem, RNNs don’t.
D RNNs can be turned into variational autoencoders, LSTMs can’t.
A All LSTMs have a forget gate, allowing them to ignore parts of the cell state right away.
Which of the following is an ensemble method? A Random forest
B Gradient boosting
C AdaBoost
D All of the above
D All of the above
William has a dataset of many recipes and many ingredients. He doesn’t know anything about the recipes except which ingredients oc- cur in each, and he doesn’t know anything about the ingredients ex- cept in which recipes they occur.
He’d like to predict for a pair of an ingredient and a recipe (both al- ready in the data) whether the recipe would likely be improved by adding that ingredient.
Which is true?
A He could model the recipes as instances with their ingredients as a sin- gle categorical feature, and solve the problem with a decision tree.
B He could model the ingredients as instances and their recipes as a sin- gle categorical feature, and solve the problem with a decision tree.
C He could model this as a matrix decomposition problem.
D This problem requires a sequence-to-sequence model.
C He could model this as a matrix decomposition problem.
What separates offline learning from reinforcement learning?
A In reinforcement learning the training labels are reinforced through
boosting.
B Offline learning can be done without connection to the internet. Rein- forcement learning requires reinforcement from a separate server.
C In reinforcement learning, the learner takes actions and receives feed- back from the environment. In offline learning we learn from a fixed dataset.
D Reinforcement learning uses backpropagation to approximate the gradi- ent, whereas offline learning uses symbolic computation.
C In reinforcement learning, the learner takes actions and receives feed- back from the environment. In offline learning we learn from a fixed dataset.
The most important rule in machine learning is “never judge your performance on the training data.” If we break this rule, what can happen as a consequence?
A The loss surface no longer provides an informative gradient.
B We get cost imbalance.
C We end up choosing a model that overfits the training data.
D We commit multiple testing.
C We end up choosing a model that overfits the training data.
We have a classifier c and a test set. Which is true?
A To compute the precision for c on the test set, we must define how to turn it into a ranking classifier.
B To compute the false positive rate for c on the test set, we must define how to turn it into a ranking classifier.
C To compute the confusion matrix for c on the test set, we must define how to turn it into a ranking classifier.
D To compute the ROC area under the curve for c on the test set, we must define how to turn it into a ranking classifier.
D To compute the ROC area under the curve for c on the test set, we must define how to turn it into a ranking classifier.
Testing too many times on the test set increases the chance of ran- dom effects influencing your choice of model. Nevertheless, we may need to test many different models and many different hyperparame- ters. What is the solution suggested in the lectures?
A To withhold the test set, and use a train/validation split on the remain- der to evaluate model choices and hyperparameters.
B To normalize the the data so that they appear normally distributed. Nor- malization will not help with this problem.
C To use bootstrap sampling to gauge the variance of the model. Boot- strap sampling (lecture 3 and lecture 10) will help you gauge the variance. But that will not solve this problem.
D To use a boosted ensemble, to reduce the variance of the model, and with it, the probability of random effects.
A To withhold the test set, and use a train/validation split on the remain- der to evaluate model choices and hyperparameters.
Which answer contains only unsupervised methods and tasks?
A Clustering, Linear regression, Generative modeling
B k-Means, Clustering, Density estimation
C Classification, Clustering, Expectation-Maximization
D Logistic regression, Density estimation, Clustering
B k-Means, Clustering, Density estimation
The two most important conceptual spaces in machine learning are the model space and the feature space. Which is true?
A Every point in the model space represents a loss function that we can choose for our task.
B Every point in the model space represents a single instance in the dataset.
C Every point in the feature space represents a single feature of a single instance in the dataset.
D Every point in the feature space represents a single instance in the dataset.
D Every point in the feature space represents a single instance in the dataset.
The ALVINN system from 1995 was a self-driving car system imple- mented as a classifier: a grayscale camera was pointed at the road and a classifier was trained to predict the correct position of the steer- ing wheel based on the behavior of a human driver. In this example, which are the instances and which are the features?
A The instances are the different cars the system is deployed in and the features are the angles of the steering wheel.
B The instances are the angles of the steering wheel and the features are the different cars the system is deployed in.
C The instances are the frames produced by the camera and the features are the pixel values.
D The instances are the pixel values and the features are the frames produced by the camera.
C The instances are the frames produced by the camera and the features are the pixel values.
What is the relation between the loss landscape and its gradient?
A The gradient points in the direction that the loss increases the fastest.
B The gradient points in the direction that the loss decreases the fastest.
C The gradient is the region of the loss landscape where the loss is the lowest.
D The gradient is the region of the loss landscape where the loss is the highest.
A The gradient points in the direction that the loss increases the fastest.
Which is a legitimate reason to prefer random search over gradient descent as a search method?
A The loss surface is complicated, so I want the size of the steps to change as I approach a minimum.
B I need to be sure that I’ve found a global minimum.
C My model has multiple layers, so I want to use backpropagation.
D My model is not differentiable.
D My model is not differentiable.
What is the benefit of a convex loss surface?
A It allows us to use the backpropagation algorithm.
B It allows us to use evolutionary methods.
C It ensures that there are no local minima other than the global mini- mum.
D It allows gradient descent to escape local maxima.
C It ensures that there are no local minima other than the global mini- mum.
Rob trains a k-nearest neighbors classifier. He withholds 20% of his data as a test set and uses the rest as his training data. He runs the training algorithm twenty times, for k = 1 to k = 20. For each, he computes the accuracy on the test set. Het gets the best accuracy for k = 17, so he reports this accuracy as an estimate of the performance of k-nearest neighbors on his data. What fundamental mistake has Rob made?
A Rob is checking a linear range of values for hyperparameter k when a logarithmic range would be better.
B He is using arbitrary values for k. He should use a grid search.
C The test set should always be bigger than the training set.
D By reusing his test set, he may be inflating his performance estimate and overfitting to an arbitrary value of k.
D By reusing his test set, he may be inflating his performance estimate and overfitting to an arbitrary value of k.
Accuracy is a very simple and effective performance metric, but in certain cases, we should be careful. Imagine a spam classifier that au- tomatically deletes emails detected as spam. The user receives about one spam email for every legitimate email. Why should we be careful optimizing for accuracy?
A Because we have very high class imbalance.
B Because we have very high cost imbalance.
C Because the data arrives irregularly.
D Because this is an online learning problem.
B Because we have very high cost imbalance.
Maria is fitting a regression model to predict the year in which a particular piece of instrumental music was written. The prediction is based on various features like average and variance of loudness, rhythm, key etc. She realizes that she has many outliers: for instance, the atonal music of the 1920s produces extreme variations in loudness, and John Cage’s piece 4’33” from 1952 is entirely silent. What should she do?
A She should remove these instances entirely. Removing outliers will make it easier to fit the data with a normal distribution.
B She should remove these instances from the training data, but leave them in the test data.
C She should leave these instances in the training data, but remove them from the test data.
D She should leave these instances in. They are important examples of the data distribution.
D She should leave these instances in. They are important examples of the data distribution.
When we want to model the throwing of a single die, using probability theory, we define a sample space and an event space. Which is true?
A “Rolling an even number” is an element of the event space.“Rolling a 1” is an element of the sample space.
B “Rolling an even number” is an element of the sample space.“Rolling a 1” is an element of the event space.
C For this example, the event space is continuous, the sample space is dis- crete.
D For this example, the sample space is continuous, the event space is dis- crete.
A “Rolling an even number” is an element of the event space.“Rolling a 1” is an element of the sample space.
Let f(x) = σ(wx + b) be a logistic regression model. We interpret f(x) as the probability that x has the positive class. If x actually has the negative class, what is the cross-entropy loss for this single example?
A −logf(x)
B −log(1−f(x))
C −logf(x)−log(1−f(x))
D logf(x)−log(1−f(x))
B −log(1−f(x))
Frank is a researcher in the 1960s. He’s just read about a new model called the perceptron, which is a highly simplified simulation of a sin- gle brain cell. Frank decides that if a brain is powerful because it chains together multiple brain cells, he should try to chain together multiple perceptrons, to make a network that is more powerful than a single perceptron. Why won’t chaining perceptrons together work in this way?
A A GPU is needed to compute the output of such a function.
B The perceptron is a linear function, and the composition of linear func- tions is still a linear function.
C Such a model would suffer from vanishing gradients.
D This is equivalent to hypothesis boosting, which has been proven to be impossible.
B The perceptron is a linear function, and the composition of linear func- tions is still a linear function.
In a support vector machine, what are the support vectors?
A The parameters wT that are multiplied by the input x to produce a clas- sification.
B The parameters b that are added to the input x to produce a classifica- tion.
C The positive and negative data points that are allowed to fall inside the margin.
D The positive and negative data points that are closest to the decision boundary.
D The positive and negative data points that are closest to the decision boundary.
In deep learning, what is the difference between lazy and eager exe- cution (or evaluation)?
A In lazy execution, the computation graph is compiled and kept static during training. In eager execution, it is built up for each forward pass.
B In eager execution, the computation graph is compiled and kept static during training. In lazy execution, it is built up for each forward pass.
C In lazy execution, the gradient is computed by numeric approximation, while eager execution uses the backpropagation algorithm.
D In eager execution, the gradient is computed by numeric approxima- tion, while lazy execution uses the backpropagation algorithm.
A In lazy execution, the computation graph is compiled and kept static during training. In eager execution, it is built up for each forward pass.
Why is the ReLU activation function often preferred over the sigmoid activation function, for hidden nodes?
A It causes more vanishing gradients, which help learning.
B The sigmoid function cannot be used with gradient descent.
C Its derivative is almost always either 0 or 1, reducing vanishing gradients.
D The sigmoid function contains a point where the gradient is not defined.
C Its derivative is almost always either 0 or 1, reducing vanishing gradients.
When we apply the chain rule to a complex operation involving ten- sors, in order to use the backpropagation algorithm, the local deriva- tives might be something like the derivative of a vector with respect to a matrix. The result is a 3-tensor which is complex to work out, and expensive to store in memory. How do modern machine learning frameworks avoid this problem in their implementation of backpropa- gation?
A They approximate the local derivative using random search.
B They approximate the local derivative using the EM algorithm.
C They don’t compute the local derivative, but the product of the up- stream derivative (the loss over the module outputs) with the local derivative
D They don’t compute the local derivative, but the product of the down- stream derivative (the module inputs over the network inputs) with the local derivative.
C They don’t compute the local derivative, but the product of the up- stream derivative (the loss over the module outputs) with the local derivative
How is the log-likelihood like a loss function?
A The log-likelihood is an approximation of the loss function.
B The loss function is an approximation of the log-likelihood.
C When we train a model, we minimize the loss, and when we fit a distri- bution, we often minimize the log-likelihood.
D When we train a model, we minimize the loss, and when we fit a distribution, we often maximize the log-likelihood.
D When we train a model, we minimize the loss, and when we fit a distribution, we often maximize the log-likelihood.
If we train a generator network by comparing a random output to a random target example from the data and backpropagating the differ- ence, we get mode collapse. The problem is that we don’t know which random input corresponds to which item in the dataset. How do GANs solve this problem?
A By training a second network to map the target example to a distribu- tion on the input space.
B By adding a KL-loss term on the random inputs of the generator net- work.
C By training the generator to generate outputs that a second network recognizes as real, and training the second network to distinguish generated outputs from real data.
D By adding a cycle-consistency loss-term.
C By training the generator to generate outputs that a second network recognizes as real, and training the second network to distinguish generated outputs from real data.
Sometimes we want to learn a model that maps some input to some output, but in some aspects we also want the model to behave like a generator. For instance, if we train a model to colorize photographs, we don’t want it to be purely deterministic: to colorize the label of a beer bottle, it should randomly imagine some colors even if it can’t in- fer the correct colors from the input. Which GAN approach is designed to accomplish this?
A Vanilla GAN
B Conditional GAN
C CycleGAN
D StyleGAN
B Conditional GAN
The Variational Autoencoder (VAE) differs from a regular autoencoder in several aspects. Which is not one of them?
A It includes a discriminator, which tries to tell the difference between data points and samples from the generator.
B It has an added loss term that ensures that the data looks like a stan- dard normal distribution in the latent space.
C For a given instance, the encoder produces a distribution on the latent space, instead of a single point.
D It includes a sampling step in the middle, where a latent vector is sam- pled from the distribution provided by the encoder.
A It includes a discriminator, which tries to tell the difference between data points and samples from the generator.
The standard decision tree algorithm doesn’t stop adding nodes until all leaves either contain no data instances, or only instances with the same label (or all features have been used). Why is this a problem, and what is the default solution (mentioned in the slides)?
A It’s a problem because the algorithm may never terminate. To solve
it, we can use a validation set to see if removing nodes improves perfor- mance.
B It’s a problem because the algorithm may never terminate. To solve it, we can remove features from the data, so that fewer splits are available.
C It’s a problem because we may be overfitting on the training set. To solve it, we can use a validation set to see if removing nodes improves performance.
D It’s a problem because we may be overfitting on the training set. To solve it, we can add features to the data, so that more splits are available.
C It’s a problem because we may be overfitting on the training set. To solve it, we can use a validation set to see if removing nodes improves performance.
Boosting is a popular method to improve a model’s performance. Why
do we rarely see boosting used in research settings (unless specifically studying ensembling methods)?
A In research we want to measure the relative performance with respect to the baseline. If we apply boosting to our model, we should apply boost- ing to the baseline as well.
B Boosting cannot be applied in combination with a validation split, which is required in research.
C Boosting makes it difficult to compute a confidence interval over the accuracy, which is required in research.
D Boosting requires some information from the test set to be used in train- ing, which is not allowed in research.
A In research we want to measure the relative performance with respect to the baseline. If we apply boosting to our model, we should apply boost- ing to the baseline as well.
In recommender systems, what is implicit feedback?
A Ratings given by a single “like” button rather than a more fine-grained
system.
B Recommendations that take the temporal structure of the data into ac- count
C Associations between users and items assumed from user behavior.
D Recommendations derived from manually crafted item features rather than learned ones.
C Associations between users and items assumed from user behavior.
Word2Vec and matrix factorization are both embedding methods that make it possible to learn about a large set of featureless objects. How do they do this?
A By taking known features for each object and mapping these to a low- dimensional representation.
B By taking known features for each object and mapping these to a high- dimensional representation.
C By embedding these objects into a Euclidean space, with each object represented by a vector.
D By embedding these objects into a Euclidean space, with each object represented by a scalar.
C By embedding these objects into a Euclidean space, with each object represented by a vector.
What is batch normalization?
A An operation in a neural network that normalizes the output of a layer so that it is normally distributed over the current batch.
B An operation in a neural network that normalizes the output of a layer so that it is uniformly distributed over the batch.
C A hyperparameter selection technique that sets the batch size to a value that ensures a normal distribution in the gradients of a neural network.
D A hyperparameter selection technique that sets the batch size to a value that ensures a uniform distribution in the gradients of a neural network.
A An operation in a neural network that normalizes the output of a layer so that it is normally distributed over the current batch.
What separates offline learning from reinforcement learning?
A In reinforcement learning the training labels are reinforced through
boosting.
B Offline learning can be done without connection to the internet. Rein- forcement learning requires reinforcement from a separate server.
C In reinforcement learning, the learner takes actions and receives feed- back from the environment. In offline learning we learn from a fixed dataset.
D Reinforcement learning uses backpropagation to approximate the gradi- ent, whereas offline learning uses symbolic computation.
C In reinforcement learning, the learner takes actions and receives feed- back from the environment. In offline learning we learn from a fixed dataset.
The most important rule in machine learning is “never judge your performance on the training data.” If we break this rule, what can happen as a consequence?
A The loss surface no longer provides an informative gradient.
B We get cost imbalance.
C We end up choosing a model that overfits the training data.
D We commit multiple testing.
C We end up choosing a model that overfits the training data.
The squared error loss function looks like this: Σi(yi − ti)^2, where the sum is over all instances, yi is the model output for instance i and ti is the training label. Which is not a reason for squaring the difference between the two?
A It ensures that negative and positive differences don’t cancel out in the sum.
B It ensures that large errors count very heavily towards the total loss.
C When used in classification, it ensures that points near the decision boundary weigh most heavily.
D It is a consequence of assuming normally distributed errors, and deriving the maximum likelihood solution.
C When used in classification, it ensures that points near the decision boundary weigh most heavily.
We have a classifier c and a test set. Which is true?
A To compute the precision for c on the test set, we must define how to turn it into a ranking classifier.
B To compute the false positive rate for c on the test set, we must define how to turn it into a ranking classifier.
C To compute the confusion matrix for c on the test set, we must define how to turn it into a ranking classifier.
D To compute the area under the curve for c on the test set, we must de- fine how to turn it into a ranking classifier.
D To compute the area under the curve for c on the test set, we must de- fine how to turn it into a ranking classifier.
Testing too many times on the test set increases the chance of ran-
dom effects influencing your choice of model. What is the solution suggested in the lectures?
A To withhold the test set until a hypothesis is established, and use a train/validation split on the remainder to evaluate model choice and hyperparameters.
B To perform cross-validation on the training data, so that all instances are used as training data at least once.
C To use bootstrap sampling to gauge the variance of the model.
D To use a boosted ensemble, to reduce the variance of the model, and with it, the probability of random effects.
A To withhold the test set until a hypothesis is established, and use a train/validation split on the remainder to evaluate model choice and hyperparameters.
Different features in our data may have wildly different scales: a per- son’s age may fall in the range from 0 to 100, while their savings can fall in the range from 0 to 100 000. For many machine learning algo- rithms, we need to modify the data so that all features have roughly the same scale. Which is not a method to achieve this?
A Imputation
B Standardization
C Normalization
D Principal Component Analysis
A Imputation
Sophie and Emma are doing a machine learning project together, and training the single-feature regression model y = w1x2 + w2x + b. Sophie says this is a non-linear model, because it learns a parabola not a line. Emma says it is a linear model, but on the features x and x2, derived from the original single feature.
A Sophie is right. Emma is wrong.
B Emma is right. Sophie is wrong.
C Both are right.
D Both are wrong.
C Both are right.
You finish this exam and hand it in. You say to your fellow students:
“The probability that I’ve passed this exam is 60%.” Which is true?
A This is not a frequentist use of the word probability, because it uses a percentage instead of a frequency.
B This is not a Bayesian use of the word probability because it expresses a belief, not a result of repeated experiments.
C This is a frequentist use of the word probability.
D This is a Bayesian use of the word probability.
D This is a Bayesian use of the word probability.
We have a dataset with a number of categoric fea- tures, each of which takes one of two values. In naive Bayes, proba- bility estimates can go to zero if we see a feature take a value that it doesn’t take in the training data. We can solve this by Laplace smooth- ing, which we can interpret as adding pseudo-observations. Which is true?
A The number of pseudo-observations we need to add is the number of classes times two.
B The number of pseudo-observations we need to add is the number of classes, times two to the power of the number of features.
C After adding the pseudo-observations, we must change the denominator for the probability estimate to ensure that all probabilities still add up to one.
D After adding the pseudo-observations, we must change the numerator for the probability estimate to ensure that all probabilities still add up to one.
A The number of pseudo-observations we need to add is the number of classes times two.
We have two discrete random variables: A with outcomes 1, 2, 3 and B with outcomes a, b, c. We are given the joint probability p(A, B) in a table, with the outcomes of A enumerated along the rows (verti- cally), and the outcomes of B enumerated along the columns (hori- zontally). How do we compute the probability p(A = 1 | B = a)?
A We find the probability in the first column and the first row.
B We find the probability in the first column and the first row, and divide it by the sum over the first column.
C We find the probability in the first column and the first row, and divide it by the sum over the first row.
D We sum the probabilities over the first column and the first row.
B We find the probability in the first column and the first row, and divide it by the sum over the first column.