machine learning landscape Flashcards
Supervised/Unsupervised Learning
Question 1
Définir ce qu’est l’apprentissage supervisée
In supervised learning, the training data you feed to the algorithm includes the desired solutions, called labels.

Question 02
Supervised-learning
Quels sont les deux tâches principales de l’apprentissage supervisée ?
Classification(spam ou ham)
Régression(prédicteur)
Question 03
Définir ce qu’est l’apprentissage non supervisé ?
In unsupervised learning, as you might guess, the training data is unlabeled.
The system tries to learn without a teacher.
Question 04
Donner un exemple d’algorithme non supervisée.
For example, say you have a lot of data about your blog’s visitors. You may want to run a clustering algorithm to try to detect groups of similar visitors.
At no point do you tell the algorithm which group a visitor belongs to: it finds those connections without your help.
For example, it might notice that 40% of your visitors are males who love comic books and generally read your blog in the evening, while 20% are young sci-fi lovers who visit during the weekends, and so on.
Question 05
Apprentissage non supervisée
Quel est la deuxième application d’un apprentissage non supervisée ?
Visualization algorithms are also good examples of unsupervised learning algorithms: you feed them a lot of complex and unlabeled data, and they output a 2D or 3D representation of your data that can easily be plotted

Question 06
Apprentissage non supervisée
Quelle est la quatrième application d’un apprentissage non supervisé ?
Dimensionality reduction, in which the goal is to simplify the data without losing too much information.
One way to do this is to merge several correlated features into one.
For example, a car’s mileage may be very correlated with its age, so the dimensionality reduction algorithm will merge them into one feature that represents the car’s wear and tear. This is called feature extraction.
Question 07
Apprentissage non supervisé
Quelle est la première application d’un apprentissage non supervisé ?
For example, say you have a lot of data about your blog’s visitors. You may want to run a clustering algorithm to try to detect groups of similar visitors.
If you use a hierarchical clustering algorithm, it may also subdivide each group into smaller groups. This may help you target your posts for each group.

Question 08
Définir ce qu’est l’apprentissage semi-supervisé ?
Some algorithms can deal with partially labeled training data, usually a lot of unlabeled data and a little bit of labeled data.
Question 09
Apprentissage semi supervisé
Donner une application d’apprentissage semi supervisé ?
Some photo-hosting services, such as Google Photos, are good examples of this. Once you upload all your family photos to the service, it automatically recognizes that the same person A shows up in photos 1, 5, and 11, while another person B shows up in photos 2, 5, and 7. This is the unsupervised part of the algorithm (clustering). Now all the system needs is for you to tell it who these people are. Just one label per person,4 and it is able to name everyone in every photo, which is useful for searching photos.

Question 10
What is a batch learning system?
In batch learning, the system is incapable of learning incrementally: it must be trained using all the available data.
This will generally take a lot of time and computing resources, so it is typically done offline. First the system is trained, and then it is launched into production and runs without learning anymore; it just applies what it has learned. This is called offline learning.
Question 11
What is an online learning system?
In online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly, as it arrives

Question 12
What is the out-of-core learning ?
Online learning algorithms can also be used to train systems on huge datasets that cannot fit in one machine’s main memory (this is called out-of-core learning). The algorithm loads part of the data, runs a training step on that data, and repeats the process until it has run on all of the data

Question 13
What type of learning algorithm relies on a similarity measure to make predictions ?
Instance-based learning system
The system learns the examples by heart, then generalizes to new cases using a similarity measure(exemple:count the number of words they have in common(spam))
Question 14
What do model-based learning algorithms search for?
Generalize from a set of examples then build a model of these examples, then use that model to make predictions.
Question 15
What is the most common strategy of the model base machine learning use to succeed? How do they make predictions?
You studied the data
You selected a model.
You trained it on the training data (i.e., the learning algorithm searched for the model parameter values that minimize a cost function).
Finally, you applied the model to make predictions on new cases (this is called inference), hoping that this model will generalize well.
Question 16
Can you name four of the main challenges in Machine Learning ?
Insufficient Quantity of Training Data
Nonrepresentative Training Data
Poor-Quality Data
Irrelevant Features
Question 17
Définir ce qu’est l’overfitting ?
It means that the model performs well on the training data, but it does not generalize well
Question 18
Quelles sont les solutions à l’overfitting ?
- To simplify the model by selecting one with fewer parameters (e.g., a linear model rather than a high-degree polynomial model), by reducing the number of attributes in the training data or by constraining the model
- To gather more training data
- To reduce the noise in the training data (e.g., fix data errors and remove outliers)
Question 19
Constraining a model to make it simpler and reduce the risk of overfitting is called …
Constraining a model to make it simpler and reduce the risk of overfitting is called regularisation
Question 20
The amount of regularization to apply during learning can be controlled by a …. A hyperparameter is a parameter of a learning algorithm (not of the model).
The amount of regularization to apply during learning can be controlled by a hyperparameter. A hyperparameter is a parameter of a learning algorithm (not of the model).
Question 21
What is a test set and why would you want to use it?
A test set is used to estimate the generalization error that a model will make on new instances, before the model is launched in production
Question 22
What is the purpose of a validation set?
A validation set is used to compare models. It makes it possible to select the best model and tune the hyperparameters.
Question 23
What can go wrong if you tune hyperparameters using the test set?
If you tune hyperparameters using the test set, you risk overfitting the test set, and the generalization error you measure will be optimistic (you may launch a model that performs worse than you expect).
Question 24
What is cross-validation and why would you prefer it to a validation set?
Cross-validation is a technique that makes it possible to compare models (for model selection and hyperparameter tuning) without the need for a separate validation set. This saves precious training data.