Chapter 1: The Machine Learning Landscape Flashcards

Question 1

Q

How would you define Machine Learning?

Answer

A

Machine Learning is about building systems that can learn from data. Learning
means getting better at some task, given some performance measure.

Question 2

Q

Can you name four types of problems where Machine Learning shines?

Answer

A

Machine Learning is great for complex problems for which we have no algorith‐
mic solution, to replace long lists of hand-tuned rules, to build systems that adapt
to fluctuating environments, and finally to help humans learn (e.g., data mining).

Question 3

Q

What is a labeled training set?

Answer

A

A labeled training set is a training set that contains the desired solution (a.k.a. a
label) for each instance.

Question 4

Q

What are the two most common supervised tasks?

Answer

A

The two most common supervised tasks are regression and classification.

Question 5

Q

Can you name four common unsupervised tasks?

Answer

A

Common unsupervised tasks include clustering, visualization, dimensionality
reduction, and association rule learning.

Question 6

Q

What type of Machine Learning algorithm would you use to allow a robot to
walk in various unknown terrains?

Answer

A

Reinforcement Learning is likely to perform best if we want a robot to learn to
walk in various unknown terrains, since this is typically the type of problem that
Reinforcement Learning tackles. It might be possible to express the problem as a
supervised or semisupervised learning problem, but it would be less natural.

Question 7

Q

What type of algorithm would you use to segment your customers into multiple
groups?

Answer

A

If you don’t know how to define the groups, then you can use a clustering algo‐
rithm (unsupervised learning) to segment your customers into clusters of similar
customers. However, if you know what groups you would like to have, then you can feed many examples of each group to a classification algorithm (supervised
learning), and it will classify all your customers into these groups

Question 8

Q

Would you frame the problem of spam detection as a supervised learning problem or an unsupervised learning problem?

Answer

A

Spam detection is a typical supervised learning problem: the algorithm is fed
many emails along with their labels (spam or not spam).

Question 9

Q

What is an online learning system?

Answer

A

An online learning system can learn incrementally, as opposed to a batch learn‐
ing system. This makes it capable of adapting rapidly to both changing data and
autonomous systems, and of training on very large quantities of data.

Question 10

Q

What is out-of-core learning?

Answer

A

Out-of-core algorithms can handle vast quantities of data that cannot fit in a
computer’s main memory. An out-of-core learning algorithm chops the data into
mini-batches and uses online learning techniques to learn from these minibatches

Question 11

Q

What type of learning algorithm relies on a similarity measure to make predictions?

Answer

A

An instance-based learning system learns the training data by heart; then, when
given a new instance, it uses a similarity measure to find the most similar learned
instances and uses them to make predictions.

Question 12

Q

What is the difference between a model parameter and a learning algorithm’s
hyperparameter?

Answer

A

A model has one or more model parameters that determine what it will predict
given a new instance (e.g., the slope of a linear model). A learning algorithm tries
to find optimal values for these parameters such that the model generalizes well
to new instances. A hyperparameter is a parameter of the learning algorithm
itself, not of the model (e.g., the amount of regularization to apply).

Question 13

Q

What do model-based learning algorithms search for? What is the most common
strategy they use to succeed? How do they make predictions?

Answer

A

Model-based learning algorithms search for an optimal value for the model
parameters such that the model will generalize well to new instances. We usually
train such systems by minimizing a cost function that measures how bad the sys‐
tem is at making predictions on the training data, plus a penalty for model com‐
plexity if the model is regularized. To make predictions, we feed the new
instance’s features into the model’s prediction function, using the parameter val‐
ues found by the learning algorithm.

Question 14

Q

Can you name four of the main challenges in Machine Learning?

Answer

A

Some of the main challenges in Machine Learning are the lack of data, poor data
quality, nonrepresentative data, uninformative features, excessively simple mod‐
els that underfit the training data, and excessively complex models that overfit
the data.

Question 15

Q

If your model performs great on the training data but generalizes poorly to new
instances, what is happening? Can you name three possible solutions?

Answer

A

If a model performs great on the training data but generalizes poorly to new
instances, the model is likely overfitting the training data (or we got extremely
lucky on the training data). Possible solutions to overfitting are getting more
data, simplifying the model (selecting a simpler algorithm, reducing the number
of parameters or features used, or regularizing the model), or reducing the noise
in the training data.

Question 16

Q

What is a test set, and why would you want to use it?

Answer

Study These Flashcards

A

A test set is used to estimate the generalization error that a model will make on
new instances, before the model is launched in production.

Question 17

Q

What is the purpose of a validation set?

Answer

Study These Flashcards

A

A validation set is used to compare models. It makes it possible to select the best
model and tune the hyperparameters.

Question 18

Q

What is the train-dev set, when do you need it, and how do you use it?

Answer

Study These Flashcards

A

The train-dev set is used when there is a risk of mismatch between the training
data and the data used in the validation and test datasets (which should always be
as close as possible to the data used once the model is in production). The traindev set is a part of the training set that’s held out (the model is not trained on it).
The model is trained on the rest of the training set, and evaluated on both the
train-dev set and the validation set. If the model performs well on the training set
but not on the train-dev set, then the model is likely overfitting the training set. If
it performs well on both the training set and the train-dev set, but not on the val‐
idation set, then there is probably a significant data mismatch between the train‐
ing data and the validation + test data, and you should try to improve the
training data to make it look more like the validation + test data.

Question 19

Q

What can go wrong if you tune hyperparameters using the test set?

Answer

Study These Flashcards

A

If you tune hyperparameters using the test set, you risk overfitting the test set,
and the generalization error you measure will be optimistic (you may launch a
model that performs worse than you expect).

Chapter 1: The Machine Learning Landscape Flashcards

(19 cards)