L7: Supervised machine learning Flashcards
What is machine learning
A branch of AI
What does machine learns from?
.. Learning from data
… Discovering hidden patterns
… Essential for data-driven decisions
What questions can we ask?
Predictive
Examples of applied ML
Examples:
* Credit card fraud detection in Financial Institutions
* Recommendation systems on websites for personalization
* Customer segmentation for marketing strategies
* Customer churn to foresee service cancellations
* Predictive maintenance in manufacturing companies
* Sentiment analysis of social media data
* Health diagnosis to aid doctors
ML Pipeline
Aqquire –> Prepare –> Analyze –> Report –> Act
ML Pipeline step 1: Aqquire data
- Identify data sources: Check the question that needs to be addressed
- Collect data: Record the necessary data
- Integrate data (data wrangling): Merge/join data, if needed
ML Pipeline step 2: Prepare Data
- Explore: Understand your data e.g.,
- Check the structure and variable types
- Check for outliers, missing values etc.
- Pre-process: Prepare your data for analysis e.g.,
- Clean (missing values, mistakes etc.)
- Feature selection (e.g., combine, remove, add)
- Feature transformation (e.g., scaling, dimensionality reduction, filtering)
ML Pipeline step 3: Analyze data
- Select analytical techniques
- Build models
- Assess results
STEP 4: REPORT RESULTS
Communicate results
Recommend actions
STEP 5: ACT
Apply the results
Implement, maintain, and assess the impact
The goal is to estimate a model from a selection of input variables to give
the best estimate of the target (i.e., outcome variable). It predicts something
we have seen before (i.e., data labels guides the learning process).
Requires:
* A range of input variables
* An outcome variable
Supervised ML
The process of adding informative labels or tags to our data
* Think of it as the “ground truth” for the target variable/outcome variable
* Necessary for a supervised ML algorithm
DATA LABELS
DATA LABELS
The process of adding informative labels or tags to our data
* Think of it as the “ground truth” for the target variable/outcome variable
* Necessary for a supervised ML algorithm
Types of supervised ML
Regression and classifacation
Regression
Given input variables, predict a numeric (continuous) value.
Examples:
* Estimate average house price for a region
* Determine demand for a new product
* Predict power usage
Given input variables, predict a numeric (continuous) value.
Examples:
* Estimate average house price for a region
* Determine demand for a new product
* Predict power usage
Regression
CLASSIFICATION
Given input variables, predict a categorical variable.
Examples:
* Predict if it will rain tomorrow
* Determine if loan application is high-,medium-, or low-risk
* Identify sentiment as positive, negative, or neutral
Given input variables, predict a categorical variable.
Examples:
* Predict if it will rain tomorrow
* Determine if loan application is high-,medium-, or low-risk
* Identify sentiment as positive, negative, or neutral
CLASSIFICATION
MACHINE LEARNING ALGORIHMS
Some examples:
* Linear regression
* Logistic regression
* K-nearest neighbor
* Decision trees
* Support vector machines
PARAMETRIC/NON-PARAMETRIC
ALGORITHMS
PARAMETRIC:
Pre-known functional form f()
* Restrictive assumptions
* Non-flexible
* Pre-determined number of
parameters
NON-PARAMETRIC:
Any functional form f()
* No assumptions
* Flexible
* Parameters learned from data
LINEAR REGRESSION
Linear regression models the linear relationship between a dependent
variable and one or more independent variables based on a fixed
functional form f(x). The simplest type of regression is linear regression. If you
add more than one independent variable, it is called a multple linear
regression.
models the linear relationship between a dependent
variable and one or more independent variables based on a fixed
functional form f(x). The simplest type of regression is linear regression. If you
add more than one independent variable, it is called a multple linear
regression.
LINEAR REGRESSION
LOGISTIC REGRESSION
Logistic regression predicts the probability of an event occuring (binary
outcome variable) based on a number of input variables. It has a fixed
functional form for f(x), and can accommodate a range of input variables.
Classification
predicts the probability of an event occuring (binary
outcome variable) based on a number of input variables. It has a fixed
functional form for f(x), and can accommodate a range of input variables.
Classification
LOGISTIC REGRESSION
K-NEAREST NEIGHBOUR (KNN)
KNN is an algorithm that works locally because it uses a pre-specified
number of observations (k = the number of nearest neighbours) to make
the prediction. For regression, the average score is used whereas, in
classification, the majority always wins.
ü Regression
ü Classification
is an algorithm that works locally because it uses a pre-specified
number of observations (k = the number of nearest neighbours) to make
the prediction. For regression, the average score is used whereas, in
classification, the majority always wins.
ü Regression
ü Classification
K-NEAREST NEIGHBOUR (KNN)
DECISION TREES
Decision Trees are a global approach that use all observations to make a
prediction. The tree-like structure shows that the functional form f(x) is
approx. in a step-wise manner by means of recursive binary splitting.
ü Regression
ü Classification
are a global approach that use all observations to make a
prediction. The tree-like structure shows that the functional form f(x) is
approx. in a step-wise manner by means of recursive binary splitting.
ü Regression
ü Classification
DECISION TREES
SUPPORT VECTOR MACHINESestimate the most optimal decision boundary (i.e., line/plane/
hyperplane that seperates our data) by applying the kernel trick (i.e., place
data in higher dimensions). The data points nearest the decision boundary
are reffered to as support vectors and they form important margins.
ü Regression
ü Classification
SUPPORT VECTOR MACHINES
estimate the most optimal decision boundary (i.e., line/plane/
hyperplane that seperates our data) by applying the kernel trick (i.e., place
data in higher dimensions). The data points nearest the decision boundary
are reffered to as support vectors and they form important margins.
ü Regression
ü Classification
SUPPORT VECTOR MACHINES
Unsupervised ML
The goal is to derive associations and patterns based on a selection of input
variables without knowing the target (outcome variable) i.e., we have no
ground truth.
Requires:
* A range of input variable
* No outcome variable
The goal is to derive associations and patterns based on a selection of input
variables without knowing the target (outcome variable) i.e., we have no
ground truth.
Requires:
* A range of input variable
* No outcome variable
UNSUPERVISED ML
What is a model
A simplified representation of reality created for a specific purpose based
on some assumptions.
Example: Customer churn
* Create a “formula” for predicting the probability of customer attrition at
contract expiration
How to build a model
- Consider the domain and your problem statement
- Consider the requirement for explainability
- Choose the type of algorithm
- Establish success criteria i.e., definition of success
- Train models
- Model selection
Curse of dimensonality
Curse of dimensionality refers to the situation where we keep on adding
more input variables to our data, which creates high-dimensional data.
High-dimensional data = # of input variables ≥ # of observations
The amount of training
data needs to grow
exponentially to
maintain the same
coverage!
Black box ML model
“Black box” ML models are too complex for humans to understand or
interpret. A limitation some ML algorithms suffer from, but not all!
- A complex decision process made by the algorithm
- Difficult to trace back from the predictions to the origin
- Hard to determine why an action was taken
- Model parameters that are non-interpretable
Think carefully about explainability (Can your stakeholders understand the
results of the chosen model?).
In general, it is good practice to use simpler and more interpretable models
when there is no significant benefit gained from choosing a more complex
alternative, an idea also known as Occam’s Razor.
Overfitting
When you learn patterns in the training data that only are
there by chance i.e., not present in new unseen data.
Non-parametric and non-linear models are prone to
overfitting because they have more flexibility when they
approximate the functional form of f(x).
When you learn patterns in the training data that only are
there by chance i.e., not present in new unseen data.
Non-parametric and non-linear models are prone to
overfitting because they have more flexibility when they
approximate the functional form of f(x).
Overfitting
Underfitting
When you do not learn important patterns in the training data
nor important generalizable patterns in new unseen data. It
will be obvious from the chosen performance metric (training
data) and the remedy is to move on and try to estimate
alternate models.
When you do not learn important patterns in the training data
nor important generalizable patterns in new unseen data. It
will be obvious from the chosen performance metric (training
data) and the remedy is to move on and try to estimate
alternate models.
Underfitting
BIAS-VARIANCE TRADE-OFF
Prediction Error x Model complexity
Data splitting
The goal: Split the data into a training data set and a test data set.
Why?
- Training a model and predicting with is are two separate things.
- Avoid prediction bias when assessing the accuracy of the model.
Requirements:
* Independent (observations are independent of each other)
* Mutually exclusive (an observation appears in only one of the two sets)
* Completely exhaustive (all observations are allocated)
Data splitting strategies
Random split with 80% train (in-sample) and 20% test data (out-of-sample)
* Stratified random splitting
* Train data set / Validation (tuning) data set / Test data set
* Cross-validation
* Leave-One-Out
* K-fold
Not enough data?
- Use a resampling technique e.g., Boot-strapping
Objective functions
How successful is the chosen algorithm? To measure this, you need to
choose an objective (loss) function that represent your goal.
Examples:
* Mean Squared Error (MSE): The average of sum of the squared difference
between your predictions and your actual observations.
- Mean Absolute Error (MAE): The average of sum of absolute differences
between predictions and your actual observations. - Misclassification rate: The number of incorrect predictions out of the total
number of predictions.
Model tuning
The goal is to establish different versions (candidate models) of the basic
model by tuning the hyperparameters. Hyperparameters are parameters
that are not a part of the model but impacts the training of the model (e.g.,
the k in KNN, the depth of a decision tree, or C and γ in a radial kernel for
SVMs).
How to fine-tune the hyperparameters?
* Run agrid-search & do k-fold cross-validation or use the validation set
FINAL MODEL SELECTION
When selecting the final model (model selection), we look at the fitted
candidate models to choose the best one based on the in-sample error
calculated based on the data points used in the training process.