Fundamentals Flashcards

Question 1

Q

Training Set

Answer

A

The examples that the system uses to learn are called the training set. Each training example is called a training instance (or sample).

Question 2

Q

Why Machine Learning?

Answer

A

The ML programs are much shorter, easier to maintain, and most likely more accurate.
The ML program learns automatically.
ML solves problems that are either too complex for traditional approaches or have no known algorithm.
ML can help humans learn.

Question 3

Q

Machine Learning is great for?

Answer

A

Problems for which existing solutions require a lot of hand-tuning or long lists of rules: one Machine Learning algorithm can often simplify code and perform better.
Complex problems for which there is no good solution at all using a traditional approach: the best Machine Learning techniques can find a solution.
Fluctuating environments: a Machine Learning system can adapt to new data.
Getting insights or finding patterns into complex problems and large amounts of data.

Question 4

Q

Types of Machine Learning Systems?

Answer

A

Whether or not they are trained with human supervision (supervised, unsupervised, semi-supervised, and Reinforcement Learning)
Whether or not they can learn incrementally on the fly (online versus batch learning)
Whether they work by simply comparing new data points to known data points, or instead detect patterns in the training data and build a predictive model, much like scientists do (instance-based versus model-based learning)

Question 5

Q

What are typical supervised learning tasks?

Answer

A

Classification (to group into categories)
Regression (to predict a target value)
Anomaly Detection (to detect the outliers)
Association rule learning (to discover interesting relationships between attributes)

Question 6

Q

What is the difference between attribute and feature?

Answer

A

An attribute is a data type and feature means an attribute and its value.

Question 7

Q

Which are the most important supervised learning algorithms?

Answer

A

k-Nearest Neighbors
Linear Regression
Logistic Regression
Support Vector Machines (SVMs)
Decision Trees and Random Forests
Neural networks

Question 8

Q

Which are the most important unsupervised learning algorithms?

Answer

A

Clustering (groups similar data)
- k-Means
- Hierarchical Cluster Analysis (HCA)
- Expectation Maximization
Visualization (plots 2D or 3D representations) and dimensionality reduction (data simplification)
- Principal Component Analysis (PCA)
- Kernel PCA
- Locally-Linear Embedding (LLE)
- t-distributed Stochastic Neighbor Embedding (t-SNE)
Association rule learning
- Apriori
- Eclat

Question 9

Q

What is feature extraction?

Answer

A

Merging of similar features into one without sacrificing accuracy.

Question 10

Q

When should you use online learning algorithms?

Answer

A

When you need a reactive system e.g. stock price predictor.
When autonomous learning is needed e.g. rover on Mars.
When resources are limited e.g. smartphone app.

Question 11

Q

What is the learning rate?

Answer

A

One important parameter of online learning systems is how fast they should adapt to changing data: this is called the learning rate.

Question 12

Q

What is instance-based learning?

Answer

A

The system learns the examples by heart, then generalizes to new cases using a similarity measure.

Question 13

Q

What is model-based learning?

Answer

A

It’s the way to generalize from a set of examples is to build a model of these examples, then use that model to make predictions.

Question 14

Q

How do you define the performance measure of your algorithms?

Answer

A

You can either define a utility function (or fitness function) that measures how good your model is, or you can define a cost function that measures how bad it is.

Question 15

Q

What is the lifecycle of a typical ML project?

Answer

A

You study the data.
You select a model.
You train it on the training data (i.e., the learning algorithm searched for the model parameter values that minimize a cost function).
Finally, you apply the model to make predictions on new cases (this is called inference), hoping that this model will generalize well.

Question 16

Q

What are the main challenges of ML?

Answer

A

Insufficient quantity of training data. It takes a lot of data for most ML algorithms to work properly.
Non-representative training data. If the sample size is too small you can have sampling noise. If the sampling method is flawed you can have a sampling bias.
Poor quality data. If your training data is full of errors, outliers, and noise, it will make it harder for the system to detect the underlying patterns, so your system is less likely to perform well.
Irrelevant features.
Overfitting the training data.
Underfitting the training data.

Question 17

Q

What is overfitting?

Answer

A

The model performs well on the training data, but it does not generalize well.

Question 18

Q

What is feature engineering?

Answer

A

Feature engineering is coming up with a good set of features to train on. It involves:

Feature selection: selecting the most useful features to train on among existing features.
Feature extraction: combining existing features to produce a more useful one (as we saw earlier, dimensionality reduction algorithms can help).
Creating new features by gathering new data.

Question 19

Q

What is regularization?

Answer

A

Constraining the parameters of the learning model to prevent it from overfitting.

Question 20

Q

What are the solutions for overfitting?

Answer

A

To simplify the model by selecting one with fewer parameters (e.g., a linear model rather than a high-degree polynomial model), by reducing the number of attributes in the training data or by constraining the model
To gather more training data
To reduce the noise in the training data (e.g., fix data errors and remove outliers)

Question 21

Q

What is regularization hyperparameter?

Answer

A

Hyperparameter is the parameter of the learning algorithm. It is applied to control the amount of regularization during learning.

Question 22

Q

What is underfitting?

Answer

A

It occurs when your model is too simple to learn the underlying structure of the data.

Question 23

Q

How do you fix underfitting?

Answer

A

Selecting a more powerful model, with more parameters
Feeding better features to the learning algorithm (feature engineering)
Reducing the constraints on the model (e.g., reducing the regularization hyperparameter)

Question 24

Q

What is machine learning?

Answer

A

Machine Learning is about building systems that can learn from data. Learning means getting better at some task, given some performance measure.

Question 25

Q

What are common unsupervised learning tasks?

Answer

A

Common unsupervised tasks include clustering, visualization, dimensionality reduction, and association rule learning.

Question 26

Q

What type of Machine Learning algorithm would you use to allow a robot to walk in various unknown terrains?

Answer

A

Reinforcement Learning is likely to perform best if we want a robot to learn to walk in various unknown terrains since this is typically the type of problem that Reinforcement Learning tackles. It might be possible to express the problem as a supervised or semi-supervised learning problem, but it would be less natural.

Question 27

Q

What type of algorithm would you use to segment your customers into multiple groups?

Answer

A

If you don’t know how to define the groups, then you can use a clustering algorithm (unsupervised learning) to segment your customers into clusters of similar customers. However, if you know what groups you would like to have, then you can feed many examples of each group to a classification algorithm (supervised learning), and it will classify all your customers into these groups.

Question 28

Q

What is out-of-core learning?

Answer

A

Out-of-core algorithms can handle vast quantities of data that cannot fit in a computer’s main memory. An out-of-core learning algorithm chops the data into mini-batches and uses online learning techniques to learn from these mini-batches.

Question 29

Q

What type of learning algorithm relies on a similarity measure to make predictions?

Answer

A

An instance-based learning system learns the training data by heart; then, when given a new instance, it uses a similarity measure to find the most similar learned instances and uses them to make predictions.

Question 30

Q

What is the difference between a model parameter and a learning algorithm’s hyperparameter?

Answer

A

A model has one or more model parameters that determine what it will predict given a new instance (e.g., the slope of a linear model). A learning algorithm tries to find optimal values for these parameters such that the model generalizes well to new instances. A hyperparameter is a parameter of the learning algorithm itself, not of the model (e.g., the amount of regularization to apply).

Question 31

Q

What do model-based learning algorithms search for? What is the most common strategy they use to succeed? How do they make predictions?

Answer

A

Model-based learning algorithms search for an optimal value for the model parameters such that the model will generalize well to new instances. We usually train such systems by minimizing a cost function that measures how bad the system is at making predictions on the training data, plus a penalty for model complexity if the model is regularized. To make predictions, we feed the new instance’s features into the model’s prediction function, using the parameter values found by the learning algorithm.

Question 32

Q

What is a test set and why would you want to use it?

Answer

A

A test set is used to estimate the generalization error that a model will make on new instances before the model is launched in production.

Question 33

Q

What is the purpose of a validation set?

Answer

A

A validation set is used to compare models. It makes it possible to select the best model and tune the hyperparameters.

Question 34

Q

What can go wrong if you tune hyperparameters using the test set?

Answer

A

If you tune hyperparameters using the test set, you risk overfitting the test set, and the generalization error you measure will be optimistic (you may launch a model that performs worse than you expect).

Question 35

Q

What is cross-validation and why would you prefer it to a validation set?

Answer

A

Cross-validation is a technique that makes it possible to compare models (for model selection and hyperparameter tuning) without the need for a separate validation set. This saves precious training data.

Fundamentals Flashcards

Keeps fundamentals of machine learning on your tips.