Fundamentals of ML Flashcards
What is ML?
- teach a computer model to make predictions and draw conclusions from data
- building computer systems that learn from data
- ML algorithms are trained to find relationships and patterns in data
- intersection of Data Science and Software Engineering
- Data Scientist: explore and prepare data, train ML model
- Software Engineer: integrate models in applications
ML as a function
- A ML model is a software application that encapsulates a function to calculate an output value based on one or more input values
- Training = defining functions
- Inferencing = predict new values
Steps of Training and Inference
- Data = past observations
- x = observed attributes / features
- y = known value of prediction / label
- x can be a vector of multiple features - Algorithm is applied to determine relationship between x and y
- Result of algorithm is a model that encapsulates a calculation on x to calculate y
- calculation is a function y = f(x) - Trained model can be used for inference
- predictions are ŷ (y-hat)
- rained models are used to draw conclusions from new data
Types of ML
- Supervised ML
a) Regression
b) Classification
ba) binary classification
bb) multiclass classification - Unsupervised ML
a) Clustering
Supervised ML
Training data with known features and values (= labeled dataset)
- most common type
- label can be anything from a category label to a real-valued number
- model learns a mapping between the input (features) and the output (label) during the training process
- once trained, model can predict the output for new, unseen data
Common Examples for supervised ML
- linear regression for regression problems
- logistic regression for binary classification
- decision trees
- support vector machines for classification problems
Unsupervised ML
Only features no known labels (= unlabeled dataset)
- Model finds patterns and relationships between features
Common Examples of unsupervised ML
- Clustering (grouping similar data points together)
- Dimensionality reduction (reducing the number of random variables under consideration by obtaining a set of principal variables)
- k-means for clustering problems
- Principal Component Analysis (PCA) for dimensionality reduction problems
Regression
Models are trained to predict numeric label values based on training data that includes both features and known labels
e.g. predicting ice-cream sales (y) based on temperature (x)
Regression elements of training process
- Split training data randomly (train and validate subset)
- Use algorithm to fit data to a model
- Validating by predicting values
- compare actual labels to predictions
/ aggregate differences to calculate metric of accuracy
Regression Evaluation Metrics
- MAE (Mean Absolute Error)
- MSE (Mean Squared Error)
- RMSE (Root Mean Squared Error)
- R2 (Coefficient of Determination)
MAE
Mean absolute error
- Variance (by how many was each prediction wrong)
- doesn’t matter if + or -
MSE
Mean squared error
- amplifies larger errors
- no longer represents quantity
- better to have a model that’s consistently slightly wrong than fewer but larger errors
RMSE
Root mean squared error
- to represent quantity with squared error
R2
Coefficient of determination
- proportion of variance in validation results
- natural random variance opposed to anomalous aspect
R2 = 1- ∑(y-ŷ)2 ÷ ∑(y-ȳ)2
ȳ = mean of actual value labels
- result between 0 and 1
- the closer to 1 the better the model is fitting the validation data