Recall Questions Flashcards
A lazy algorithm is a machine learning method that simply stores the data and refers back to it during evaluation, instead of training to establish a good model which can be stored independent of the data.
Which of the following methods is a lazy algorithm?
A Linear classification
B Decision trees
C k-Nearest neighbors
D None of the above
C k-Nearest neighbors
I want to predict house prices, from a set of examples, based on two attributes: surface area and the local crime rate.
I create a scatterplot with the surface area of the house on the horizontal axis and the crime rate on the vertical. I plot each house in my dataset as a point in these axes.
What have I drawn?
A the model space
B the loss curve
C the feature space
D the output space
C the feature space
How are random search and gradient descent related?
A Gradient descent is an approximation to random search.
B Random search is an approximation to gradient descent.
C Gradient descent is like random search but with a smoothed loss surface.
D Random search is like gradient descent but with a smoothed loss surface
B Random search is an approximation to gradient descent.
In the slides, we get the advice that “sometimes your loss function should not be the same as your evaluation function.”
Why not?
A The evaluation function may not provide a smooth loss surface.
B The evaluation function may be poorly chosen.
C The evaluation function may not be linear.
D The evaluation function may not be computable.
A The evaluation function may not provide a smooth loss surface.
It is common practice in machine learning to separate out a training set and a test set. Often, we then split the training data again, to get a validation set. Which is false?
A The validation set is not used until the end of the project.
B We do this avoid multiple testing on the test set.
C We use the validation set for hyperparameter optimization.
D The test set is ideally used only once.
A The validation set is not used until the end of the project.
Which answer describes the precision?
A The proportion of the actual positives that were classified as positive.
B The proportion of the instances classified as positive that are actually
positive.
C The proportion of the actual negatives that were classified as negative.
D The proportion of the instances classified as negative that are actually
negative.
B The proportion of the instances classified as positive that are actually
positive.
Imagine a machine learning task where the instances are customers.
You know the phone number for each customer and their occupation (one of seven categories). You’re wondering how to turn these into features.
Which is false?
A You can extract several useful categoric features from the phone number.
B The phone number is an integer, so you should use it as a numeric feature.
C Whether to use the occupation directly or turn it into a numeric feature
depends on the model.
D For some models, you may want to turn the occupation into several numeric features.
B The phone number is an integer, so you should use it as a numeric feature.
The slides mention two ways to adapt a categoric feature for a classifier that only accepts numeric features: integer coding and one-hot coding.
Which is true?
A One-hot coding always turns one categoric feature into one numeric
feature.
B Integer coding always turns one categoric feature into one numeric feature.
C Integer coding becomes inefficient if there are too many values.
D One-hot coding becomes inefficient if there are too few categories
B Integer coding always turns one categoric feature into one numeric feature.
Which is false?
A In PCA, the first principal component provides the direction of greatest
variance.
B PCA is a supervised method.
C PCA can be used for dimensionality reduction.
D PCA can be used for data preprocessing.
B PCA is a supervised method.
We are performing classification.
We represent our instance by the random variable X and its class by the random variable Y.
Which is true?
A Generative modeling is training a model for p(X | Y) and computing p(Y | X) from that.
B Discriminative modeling is training a model for p(X | Y) and computing p(Y | X) from that.
C Generative modeling can only be done through the EM algorithm.
D Discriminative modeling can only be done through the EM algorithm.
A Generative modeling is training a model for p(X | Y) and computing p(Y | X) from that.
Toxoplasmosis is a relatively harmless parasitic infection that usually causes no obvious symptoms.
Which statement is acceptable from a Bayesian perspective, but not from a frequentist perspective?
Note that we don’t care whether the statement is correct, just whether it fits these frameworks.
A One in five Dutch people has toxoplasmosis.
B Being Dutch, the probability that Fred has toxoplasmosis is 0.2.
C The mean age of people with toxoplasmosis is 54.
D The probability that a person chosen at random from the Dutch population has toxoplasmosis is 0.2.
B Being Dutch, the probability that Fred has toxoplasmosis is 0.2.
How does stochastic gradient descent (SGD) differ from regular gradient descent?
A SGD is used to train stochastic models instead of deterministic ones.
B SGD trains in epochs, regular gradient descent doesn’t.
C SGD uses the loss over a small subset of the data.
D SGD only works on neural networks.
C SGD uses the loss over a small subset of the data.
Which is false?
A Autodiff combines aspects of symbolic differentiation and numeric differentiation.
B Autodiff computes the gradient but only for a specific input.
C Autodiff is an alternative to backpropagation.
D Autodiff boils down to repeated application of the chain rule.
C Autodiff is an alternative to backpropagation.
Which is false?
A The kernel trick allows us to use support vector machines as a loss function in neural networks.
B The kernel trick allows us to compute SVMs in a high dimensional space.
C The SVM algorithm computes the maximum margin hyperplane.
D The SVM algorithm can be computed without using the kernel trick.
A The kernel trick allows us to use support vector machines as a loss function in neural networks.
Which is true?
A A maximum likelihood objective for least-squares regression does not
provide a smooth loss surface.
B The least-squares loss function for linear regression can be derived from
a maximum likelihood objective.
C Linear regression can be performed with a maximum likelihood objective but the results will be different from the least-squares version.
D The loss function for logistic regression is derived from assuming a normal distribution on the residuals.
B The least-squares loss function for linear regression can be derived from a maximum likelihood objective.