Fundamentals of ML Flashcards
What is ML?
- teach a computer model to make predictions and draw conclusions from data
- building computer systems that learn from data
- ML algorithms are trained to find relationships and patterns in data
- intersection of Data Science and Software Engineering
- Data Scientist: explore and prepare data, train ML model
- Software Engineer: integrate models in applications
ML as a function
- A ML model is a software application that encapsulates a function to calculate an output value based on one or more input values
- Training = defining functions
- Inferencing = predict new values
Steps of Training and Inference
- Data = past observations
- x = observed attributes / features
- y = known value of prediction / label
- x can be a vector of multiple features - Algorithm is applied to determine relationship between x and y
- Result of algorithm is a model that encapsulates a calculation on x to calculate y
- calculation is a function y = f(x) - Trained model can be used for inference
- predictions are ŷ (y-hat)
- rained models are used to draw conclusions from new data
Types of ML
- Supervised ML
a) Regression
b) Classification
ba) binary classification
bb) multiclass classification - Unsupervised ML
a) Clustering
Supervised ML
Training data with known features and values (= labeled dataset)
- most common type
- label can be anything from a category label to a real-valued number
- model learns a mapping between the input (features) and the output (label) during the training process
- once trained, model can predict the output for new, unseen data
Common Examples for supervised ML
- linear regression for regression problems
- logistic regression for binary classification
- decision trees
- support vector machines for classification problems
Unsupervised ML
Only features no known labels (= unlabeled dataset)
- Model finds patterns and relationships between features
Common Examples of unsupervised ML
- Clustering (grouping similar data points together)
- Dimensionality reduction (reducing the number of random variables under consideration by obtaining a set of principal variables)
- k-means for clustering problems
- Principal Component Analysis (PCA) for dimensionality reduction problems
Regression
Models are trained to predict numeric label values based on training data that includes both features and known labels
e.g. predicting ice-cream sales (y) based on temperature (x)
Regression elements of training process
- Split training data randomly (train and validate subset)
- Use algorithm to fit data to a model
- Validating by predicting values
- compare actual labels to predictions
/ aggregate differences to calculate metric of accuracy
Regression Evaluation Metrics
- MAE (Mean Absolute Error)
- MSE (Mean Squared Error)
- RMSE (Root Mean Squared Error)
- R2 (Coefficient of Determination)
MAE
Mean absolute error
- Variance (by how many was each prediction wrong)
- doesn’t matter if + or -
MSE
Mean squared error
- amplifies larger errors
- no longer represents quantity
- better to have a model that’s consistently slightly wrong than fewer but larger errors
RMSE
Root mean squared error
- to represent quantity with squared error
R2
Coefficient of determination
- proportion of variance in validation results
- natural random variance opposed to anomalous aspect
R2 = 1- ∑(y-ŷ)2 ÷ ∑(y-ȳ)2
ȳ = mean of actual value labels
- result between 0 and 1
- the closer to 1 the better the model is fitting the validation data
Binary Classification
Calculates probability values for class assignments
observed item is or isn’t an instance of a specific class
e.g. predicting Diabetes yes or no
Steps of Binary Classification
- Algorithm calculates probability values for class assignments
- Evaluation metrics compare predicted to actual classes
- Probability is measured as a value between 0.0 and 1.0
- Function describes probability of y being true (=1) for a given value x (f(x) = P (y=1|x)
Classification evaluation metrics
- Confusion matrix
- Accuracy
- Recall (TPR true positive rate)
- Precision
- F1-Score
- AUC (area under the curve)
Confusion Matrix
Matrix of number of correct and incorrect predictions for each possible class label.
columns = ŷ ( 0 and 1)
rows = y (0 and 1)
TN (true negative): ŷ=0 and y=0
FP (false positive): ŷ=1 and y=0
FN (false negative): ŷ=0 and y=1
TP (true positive): ŷ=1 and y=1
The arrangement of the confusion matrix is such that correct (true) predictions are shown in a diagonal line from top-left to bottom-right
Accuracy
Proportion of right predictions
(TN+TP) ÷ (TN+FN+FP+TP)
Recall
TPR true positive rate
Measures proportion of positive cases identified correctly
TP ÷ (TP+FN)
e.g. compared to patients with Diabetes, how many were identified correctly
Precision
Proportion of predicted positive cases where label is actually positive
TP ÷ (TP+FP)
e.g. what proportion of predicted positive cases actually have diabetes
F1- Score
Combines recall and precision
(2 x Precision x Recall) ÷ (Precision + Recall)
AUC (Area under the curve)
Plotting an ROC (received operator curve)
Shows all TPRs and FPRs (true/ false positive rate) for all thresholds (decision point on yes or no)
Where straight line goes into curve = AUC
e.g. if AUC is 0.875 –> works better than random guessing (over 0.5)