Recall Questions Flashcards

1
Q

A lazy algorithm is a machine learning method that simply stores the data and refers back to it during evaluation, instead of training to establish a good model which can be stored independent of the data.

Which of the following methods is a lazy algorithm?

A Linear classification
B Decision trees
C k-Nearest neighbors
D None of the above

A

C k-Nearest neighbors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

I want to predict house prices, from a set of examples, based on two attributes: surface area and the local crime rate.
I create a scatterplot with the surface area of the house on the horizontal axis and the crime rate on the vertical. I plot each house in my dataset as a point in these axes.

What have I drawn?
A the model space
B the loss curve
C the feature space
D the output space

A

C the feature space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How are random search and gradient descent related?

A Gradient descent is an approximation to random search.
B Random search is an approximation to gradient descent.
C Gradient descent is like random search but with a smoothed loss surface.
D Random search is like gradient descent but with a smoothed loss surface

A

B Random search is an approximation to gradient descent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In the slides, we get the advice that “sometimes your loss function should not be the same as your evaluation function.”

Why not?

A The evaluation function may not provide a smooth loss surface.
B The evaluation function may be poorly chosen.
C The evaluation function may not be linear.
D The evaluation function may not be computable.

A

A The evaluation function may not provide a smooth loss surface.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

It is common practice in machine learning to separate out a training set and a test set. Often, we then split the training data again, to get a validation set. Which is false?

A The validation set is not used until the end of the project.
B We do this avoid multiple testing on the test set.
C We use the validation set for hyperparameter optimization.
D The test set is ideally used only once.

A

A The validation set is not used until the end of the project.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which answer describes the precision?

A The proportion of the actual positives that were classified as positive.
B The proportion of the instances classified as positive that are actually
positive.
C The proportion of the actual negatives that were classified as negative.
D The proportion of the instances classified as negative that are actually
negative.

A

B The proportion of the instances classified as positive that are actually
positive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Imagine a machine learning task where the instances are customers.

You know the phone number for each customer and their occupation (one of seven categories). You’re wondering how to turn these into features.

Which is false?

A You can extract several useful categoric features from the phone number.
B The phone number is an integer, so you should use it as a numeric feature.
C Whether to use the occupation directly or turn it into a numeric feature
depends on the model.
D For some models, you may want to turn the occupation into several numeric features.

A

B The phone number is an integer, so you should use it as a numeric feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

The slides mention two ways to adapt a categoric feature for a classifier that only accepts numeric features: integer coding and one-hot coding.
Which is true?

A One-hot coding always turns one categoric feature into one numeric
feature.
B Integer coding always turns one categoric feature into one numeric feature.
C Integer coding becomes inefficient if there are too many values.
D One-hot coding becomes inefficient if there are too few categories

A

B Integer coding always turns one categoric feature into one numeric feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which is false?
A In PCA, the first principal component provides the direction of greatest
variance.
B PCA is a supervised method.
C PCA can be used for dimensionality reduction.
D PCA can be used for data preprocessing.

A

B PCA is a supervised method.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

We are performing classification.
We represent our instance by the random variable X and its class by the random variable Y.

Which is true?

A Generative modeling is training a model for p(X | Y) and computing p(Y | X) from that.
B Discriminative modeling is training a model for p(X | Y) and computing p(Y | X) from that.
C Generative modeling can only be done through the EM algorithm.
D Discriminative modeling can only be done through the EM algorithm.

A

A Generative modeling is training a model for p(X | Y) and computing p(Y | X) from that.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Toxoplasmosis is a relatively harmless parasitic infection that usually causes no obvious symptoms.

Which statement is acceptable from a Bayesian perspective, but not from a frequentist perspective?

Note that we don’t care whether the statement is correct, just whether it fits these frameworks.

A One in five Dutch people has toxoplasmosis.
B Being Dutch, the probability that Fred has toxoplasmosis is 0.2.
C The mean age of people with toxoplasmosis is 54.
D The probability that a person chosen at random from the Dutch population has toxoplasmosis is 0.2.

A

B Being Dutch, the probability that Fred has toxoplasmosis is 0.2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does stochastic gradient descent (SGD) differ from regular gradient descent?

A SGD is used to train stochastic models instead of deterministic ones.
B SGD trains in epochs, regular gradient descent doesn’t.
C SGD uses the loss over a small subset of the data.
D SGD only works on neural networks.

A

C SGD uses the loss over a small subset of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which is false?

A Autodiff combines aspects of symbolic differentiation and numeric differentiation.
B Autodiff computes the gradient but only for a specific input.
C Autodiff is an alternative to backpropagation.
D Autodiff boils down to repeated application of the chain rule.

A

C Autodiff is an alternative to backpropagation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which is false?

A The kernel trick allows us to use support vector machines as a loss function in neural networks.
B The kernel trick allows us to compute SVMs in a high dimensional space.
C The SVM algorithm computes the maximum margin hyperplane.
D The SVM algorithm can be computed without using the kernel trick.

A

A The kernel trick allows us to use support vector machines as a loss function in neural networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which is true?

A A maximum likelihood objective for least-squares regression does not
provide a smooth loss surface.
B The least-squares loss function for linear regression can be derived from
a maximum likelihood objective.
C Linear regression can be performed with a maximum likelihood objective but the results will be different from the least-squares version.
D The loss function for logistic regression is derived from assuming a normal distribution on the residuals.

A

B The least-squares loss function for linear regression can be derived from a maximum likelihood objective.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which statement is false? [bonus question, due to multiple correct
answers]
A The entropy is the expected codelength using an optimal code.
B The relative entropy is the KL divergence minus the entropy.
C The KL divergence is the difference in expected codelength between the
optimal code and another.
D The KL divergence is the relative entropy minus the entropy.

A

B The relative entropy is the KL divergence minus the entropy.

D The KL divergence is the relative entropy minus the entropy.

17
Q

What is the relation between the k-Means algorithm and the EM algorithm?

A The EM algorithm is a simplified version of the k Means algorithm.
B k-Means is a simplified version of the EM algorithm.
C k-Means is to k-Nearest neighbors as the EM algorithm is to Support Vector Machines.
D k-Means is to k-Nearest neighbors as Support Vector Machines are to the EM algorithm.

A

B k-Means is a simplified version of the EM algorithm.

18
Q

I’m training a neural network. I notice that during training, the loss on the training data goes to zero, but the loss on the validation set doesn’t get any better than chance.

Which is true?

A The model is overfitting. A good solution is to increase the model capacity.
B The model is overfitting. A good solution is to add L2-regularization.
C The model is suffering from vanishing gradients. A good solution is to use sigmoid activations.
D The model is suffering from vanishing gradients. A good solution is to increase the batch size.

A

B The model is overfitting. A good solution is to add L2-regularization.

19
Q

When training generative models, mode collapse is an important problem.

Which is false?

A Generative Adversarial Networks are a way to train generative models, while avoiding mode collapse.
B Variational Autoencoders are a way to train generative models, while avoiding mode collapse.
C Generative Adversarial Networks avoid mode collapse by learning a network that maps each instance to a latent variable.
D Variational Autoencoders avoid mode collapse by learning a network that maps each instance to a latent variable.

A

C Generative Adversarial Networks avoid mode collapse by learning a network that maps each instance to a latent variable.

20
Q

Which is false?

A Decision trees do not deal with categorical data naturally. To use such data we must convert it to one-hot vectors.
B Decision trees do not deal with numeric data naturally. To use such data we must choose a value to split on.
C The standard decision algorithm (without pruning) operates greedily: once it has chosen a split, it will never reconsider that decision.
D When splitting a numeric feature, we must choose a threshold value to
split on.

A

A Decision trees do not deal with categorical data naturally. To use such data we must convert it to one-hot vectors.