ML past exam Questions Flashcards
How are hyperparameters defined?
A They are parameters of a learning algorithm that are set before we start
training on the data.
B They are parameters of which the values are found by the training algo- rithm itself.
C They are parameters that make the models train faster.
D They are parameters of which the values are set after training the model.
A They are parameters of a learning algorithm that are set before we start
training on the data.
In principal component analysis, which is true?
A The first principal component is the direction in which the variance is
the smallest.
B The first principal component is the direction that maximizes the recon- struction loss.
C The first principal component is the direction that minimizes the recon- struction loss.
D The first principal component is the direction in which the bias is the smallest.
C The first principal component is the direction that minimizes the recon- struction loss.
What is a valid reason to prefer gradient descent over random search?
A My model is easily differentiable.
B I need to be sure that I find the global minimum.
C My loss function is not smooth.
D There is some computation between the output of my model and my loss function, which I do not control.
A My model is easily differentiable.
If we have a feature that is categorical, but our model requires numeric features, we can turn the categoric feature into one or more numeric features either by integer coding, or by one-hot coding. Which is true?
A The benefit of integer coding is that it assigns each value a separate fea- ture.
B Integer coding results in more features than one-hot coding.
C The feature(s) introduced by integer coding are are valued between 0 and 1.
D The feature(s) introduced by one-hot coding are valued 0 or 1.
D The feature(s) introduced by one-hot coding are valued 0 or 1.
Two classes that are not linearly separable may become linearly sepa- rable if we add a new feature that is derived from one or more of the other features. Why?
A This approach removes outliers.
B A linear decision boundary in the new space may represent a nonlinear boundary in the old one.
C Adding a new feature derived from the existing ones normalizes the data.
D This approach makes a non-differentiable loss function differentiable.
B A linear decision boundary in the new space may represent a nonlinear boundary in the old one.
In the Expectation-Maximization algorithm for Gaussian mixture models, we train a model consisting of several components, and we compute various responsibilities. Which is true?
A Each component is a normal distribution. Its responsibility for a point is the normalized probability density it assigns the point.
B Each component is a normal distribution. Its responsibility is the integer number of points it generates.
C Each component is one of the features of the data. Its responsibility for a point is the normalized probability density it assigns the point.
D Each component is one of the features of the data. Its responsibility is the integer number of points it generates.
B Each component is a normal distribution. Its responsibility is the integer number of points it generates.
Lise is given an imbalanced dataset for a binary classification task. Which one of the following options is useful solution to this problem?
A She can oversample her majority class by sampling with replacement.
B She can use accuracy to measure the performance of her classifier.
C She can use SMOTE to augment her training data.
D She can augment the feature set by adding the multiplication of some features.
C She can use SMOTE to augment her training data.
What is true about logistic regression?
A Logistic regression is a linear, discriminative classifier with a cross- entropy loss function.
B Logistic regression finds the maximum margin hyperplane when applied to data with well-separated classes.
C Logistic regression is identical to the least-square linear regression.
D Thanks to the log-loss function, logistic regression can learn nonlinear decision boundaries.
A Logistic regression is a linear, discriminative classifier with a cross- entropy loss function.
Neural networks usually contain activation functions. What is their purpose?
A They are used to compute a local approximation of the gradient.
B They are applied after a linear transformation, so that the network can
learn nonlinear functions.
C They control the magnitude of the the step taken during an iteration of gradient descent.
D They function as a regularizer, to combat overfitting.
B They are applied after a linear transformation, so that the network can
learn nonlinear functions.
I am training a generator network to generate faces. I take a random sample, compare it to a randomly chosen image form the data, and backpropagate the error.
When training is finished, all samples from the network look like the average over all faces in the dataset.
What name do we have for this phenomenon?
A Multiple testing
B Overfitting
C Dropout
D Mode collapse
D Mode collapse
Which is true?
A A maximum likelihood objective for least-squares regression does not provide a smooth loss surface.
B The least-squares loss for linear regression can be derived from a maxi- mum likelihood objective.
C Linear regression can be performed with a maximum likelihood objec- tive but the results will be different from the least-squares loss.
D The loss function used in logistic regression is derived from assuming a normal distribution on the residuals.
B The least-squares loss for linear regression can be derived from a maxi- mum likelihood objective.
What is an important difference between regular recurrent neural net- works (RNNs) and LSTM neural networks?
A All LSTMs have a forget gate, allowing them to ignore parts of the cell state right away.
B All RNNs have a forget gate, allowing them to ignore parts of of the cell state right away.
C LSTMs have a vanishing gradient problem, RNNs don’t.
D RNNs can be turned into variational autoencoders, LSTMs can’t.
A All LSTMs have a forget gate, allowing them to ignore parts of the cell state right away.
Which of the following is an ensemble method? A Random forest
B Gradient boosting
C AdaBoost
D All of the above
D All of the above
William has a dataset of many recipes and many ingredients. He doesn’t know anything about the recipes except which ingredients oc- cur in each, and he doesn’t know anything about the ingredients ex- cept in which recipes they occur.
He’d like to predict for a pair of an ingredient and a recipe (both al- ready in the data) whether the recipe would likely be improved by adding that ingredient.
Which is true?
A He could model the recipes as instances with their ingredients as a sin- gle categorical feature, and solve the problem with a decision tree.
B He could model the ingredients as instances and their recipes as a sin- gle categorical feature, and solve the problem with a decision tree.
C He could model this as a matrix decomposition problem.
D This problem requires a sequence-to-sequence model.
C He could model this as a matrix decomposition problem.
What separates offline learning from reinforcement learning?
A In reinforcement learning the training labels are reinforced through
boosting.
B Offline learning can be done without connection to the internet. Rein- forcement learning requires reinforcement from a separate server.
C In reinforcement learning, the learner takes actions and receives feed- back from the environment. In offline learning we learn from a fixed dataset.
D Reinforcement learning uses backpropagation to approximate the gradi- ent, whereas offline learning uses symbolic computation.
C In reinforcement learning, the learner takes actions and receives feed- back from the environment. In offline learning we learn from a fixed dataset.
The most important rule in machine learning is “never judge your performance on the training data.” If we break this rule, what can happen as a consequence?
A The loss surface no longer provides an informative gradient.
B We get cost imbalance.
C We end up choosing a model that overfits the training data.
D We commit multiple testing.
C We end up choosing a model that overfits the training data.
We have a classifier c and a test set. Which is true?
A To compute the precision for c on the test set, we must define how to turn it into a ranking classifier.
B To compute the false positive rate for c on the test set, we must define how to turn it into a ranking classifier.
C To compute the confusion matrix for c on the test set, we must define how to turn it into a ranking classifier.
D To compute the ROC area under the curve for c on the test set, we must define how to turn it into a ranking classifier.
D To compute the ROC area under the curve for c on the test set, we must define how to turn it into a ranking classifier.
Testing too many times on the test set increases the chance of ran- dom effects influencing your choice of model. Nevertheless, we may need to test many different models and many different hyperparame- ters. What is the solution suggested in the lectures?
A To withhold the test set, and use a train/validation split on the remain- der to evaluate model choices and hyperparameters.
B To normalize the the data so that they appear normally distributed. Nor- malization will not help with this problem.
C To use bootstrap sampling to gauge the variance of the model. Boot- strap sampling (lecture 3 and lecture 10) will help you gauge the variance. But that will not solve this problem.
D To use a boosted ensemble, to reduce the variance of the model, and with it, the probability of random effects.
A To withhold the test set, and use a train/validation split on the remain- der to evaluate model choices and hyperparameters.
Which answer contains only unsupervised methods and tasks?
A Clustering, Linear regression, Generative modeling
B k-Means, Clustering, Density estimation
C Classification, Clustering, Expectation-Maximization
D Logistic regression, Density estimation, Clustering
B k-Means, Clustering, Density estimation
The two most important conceptual spaces in machine learning are the model space and the feature space. Which is true?
A Every point in the model space represents a loss function that we can choose for our task.
B Every point in the model space represents a single instance in the dataset.
C Every point in the feature space represents a single feature of a single instance in the dataset.
D Every point in the feature space represents a single instance in the dataset.
D Every point in the feature space represents a single instance in the dataset.
The ALVINN system from 1995 was a self-driving car system imple- mented as a classifier: a grayscale camera was pointed at the road and a classifier was trained to predict the correct position of the steer- ing wheel based on the behavior of a human driver. In this example, which are the instances and which are the features?
A The instances are the different cars the system is deployed in and the features are the angles of the steering wheel.
B The instances are the angles of the steering wheel and the features are the different cars the system is deployed in.
C The instances are the frames produced by the camera and the features are the pixel values.
D The instances are the pixel values and the features are the frames produced by the camera.
C The instances are the frames produced by the camera and the features are the pixel values.
What is the relation between the loss landscape and its gradient?
A The gradient points in the direction that the loss increases the fastest.
B The gradient points in the direction that the loss decreases the fastest.
C The gradient is the region of the loss landscape where the loss is the lowest.
D The gradient is the region of the loss landscape where the loss is the highest.
A The gradient points in the direction that the loss increases the fastest.
Which is a legitimate reason to prefer random search over gradient descent as a search method?
A The loss surface is complicated, so I want the size of the steps to change as I approach a minimum.
B I need to be sure that I’ve found a global minimum.
C My model has multiple layers, so I want to use backpropagation.
D My model is not differentiable.
D My model is not differentiable.
What is the benefit of a convex loss surface?
A It allows us to use the backpropagation algorithm.
B It allows us to use evolutionary methods.
C It ensures that there are no local minima other than the global mini- mum.
D It allows gradient descent to escape local maxima.
C It ensures that there are no local minima other than the global mini- mum.