Introduction Flashcards
Machine Learning
It is the process of extracting patterns from the data.
Data includes features and target.
If an expert can deduce pattern from data, so can ML
Features
All the information about an object e.g. different characteristics of let’s say a car, year, make, model etc
Target
Output or labels we want to predict based on the features
Training
Features + Target = Supervised Learning
This results in the model which has learned the patterns based on features and target.
Predictions
Features + Model = Predictions
We put features in model which predicts the target variable/label.
Why not use Rule based Systems
It is difficult to write rules for every possible scenario e.g. in case of spam emails, writing rules would quickly end up with huge and messy code.
We can use ML by simply giving it the data i.e. features and target as spam/not spam. We can train the model and then use the model on new dataset. Most of these rules are converted to features.
What does a model output?
Model outputs probabilities. A threshold can be defined to make a final decision on probabilities.
Rule based systems
We have a software which takes in data and code to output the outcome and it can become difficult to maintain.
Supervised ML
We show model examples of the data e.g. labelling data so that the model can learn patterns from the features and labels using mathematics and statistics.
Feature Matrix
A two dimensional array (matrix) with columns as features and rows as objects/observations for which we want to predict and usually is denoted by X.
Target Vector
A vector for each row of the feature Matrix X. And it is denoted by y.
Mathematical Expression for Model
g(X) = approximates y
g is the model, X is the feature Matrix and y is the target variable.
Types of Supervised ML is based on
output of the model and the type of target variable
Regression
A type of supervised machine learning where the model returns a number between 0 and infinity.
E.g. prediction of car prices, prediction of house prices
Classification
A type of supervised machine learning where the output is a category e.g. output of an image is a car, output is a spam/not spam.
Multi Class Classification
A type of supervised machine learning and a sub type of classification where output can be multiple categories e.g. a car, a dog or a cat. It can be as many categories as you need as long as they are more than 2
Binary Classification
A type of supervised machine learning and a sub type of classification where the target variable can only be either of two categories e.g. Spam/Not Spam.
Ranking
A type of supervised machine learning where you want to rank something e.g. a recommendation system. When we search something on Google, it ranks web pages based on the user and search relevance.
CRISP-DM
A methodology for organizing ML projects. It stands for Cross Industry Standard Processing - Data Mining.
From problem understanding to deployment. It’s an old methodology that was developed by IBM in the 90s.
ML Projects (6 steps of CRISP-DM)
- Business understanding of the problem
+Do we need ML to solve the problem? If not, what is the alternative solution.
+Identify the problem and if it is important
+What’s the measurable goal? e.g. reducing the number of spam messages. - Data understanding
+What data is available to solve the problem?
+How we can get the data? Buy or maybe collect the dataset.
+Is it reliable?
+Do we track it correctly?
+ Is the dataset large enough? Do we need to get more data?
+Sometimes we go back to the first step if the problem or data is not suitable. - Data Preparation
+Extracting features
+Cleaning data
+Pipelines that convert raw data and transform into suitable features 4. Modelling
+Train the model
+Try different models and choose the best one
+Add new features or fix data issues - Evaluation
+Go back to the business understanding and check results whether our metrics approve or not.
+Maybe we need to go back to the business understanding and start again - Evaluation + Deployment:
+Online evaluation of live users
+Deploy the model and evaluate it
+Evaluate it on small number of users
+Roll the model to all users, proper monitoring, ensuring the quality and maintainability - Iterate
+Is it good? Should we improve it or not?
Selecting the best model
We need to mimic the model performance on real unseen data. This can be done by keeping the 20% of the data separately and train model on the remaining 80% of the data. 20% of the data is then our validation dataset. We take g on the validation dataset and we get predictions. We compare the validation prediction with the actual values . We then see in how many cases this is correct. We need to improve the accuracy of the model and we choose the best model based on the best accuracy. Different types of models can be logistic regression, Decision Trees, Random Forest, Neural Networks etc
Multiple Comparisons Problem
It could be that a model gets lucky in predicting a particular type of dataset if we try many many different models.. If we take another 20% of the data, the results could be totally different This is a statistics problem.
Validation / Test dataset Split
To guard against Multiple comparison problem, we can have three non-overlapping datasets. E.g. 60% training dataset, 20% validation dataset, 20% testing dataset. So to make sure that this model didn’t got lucky, we select the best model and check it on testing dataset.
Steps of Model selection
- Split dataset into train, validation and test
- Training
- Validation
- Repeat 2,3 for different models and select the best one
- Apply on the test dataset and check
Examples of Supervised Machine Learning
1) Spam Filtering with features from email data set and labels spam or not
2) Online advertising where ads and user infor as input data and labels whether a user is likely to click on an ad or not.
3) Self driving cars, given the image and radar info, position of other cars as labels
4) Fuel optimization, given the ship route and fuel consumed as labels
5) Visual Inspection, given the image of phone and detecting a defect in the image.
6) Restaurant Reputation monitoring, given the restaurant reviews, detect sentiment positive or negative
[2010-2020] Large Scale Supervised Learning
If you were training AI model on small to large datasets on your local machine, the performance plateau. But if you were training AI model on small to large datasets with large computation, then we see significant improvement in AI model.