ML System Design Flashcards
Elements of ML System Design
Clarify functional requirements (business objective)
Performance requirements
Frame as an ML Problem (inputs and outputs)
What data do we have access to for training
Feature engineering
Choose a model
Prediction Pipeline
Training pipeline
Offline & Online Metrics
Questions to ask
- How much training data do we have access to?
- What is the state of the data (already in the form of features, event data, log data)?
- What’s most important? accuracy or response time?
- Hardware constraints?
- Time constraints?
- Model retraining (spam vs recommendation systems)
confusion matrix
- summary of prediction results from a classification model
- predicted values on y axis
- actual values on x axis
false positive rate
false positive / total negatives
FP / (FP + TN)
Imbalanced dataset
A classification data set with skewed class proportions (far more positives than negatives or vice versa)
Difference between precision and FPR
Precision measures the probability that a sample classified as positive is actually positive while the FPR measures the ratio of false positives to total negatives
Precision is a better metric for datasets with a large number of negative samples
Training set
Examples used for learning to fit the parameters of the model
Validation set
Set of examples used to tune model parameters. For example, the number of layers of a neural network or batch size
Test set
Used to assess the performance of a fully trained model
sensitivity
measures the model’s ability to predict true positives in each available category
specificity
measures the model’s ability to predict true negatives for each available category
System design: data
Identify target variables
implicit (putting an item in your shopping cart) vs explicit (buying an item)
Example features
user-location
user-age
aggregate features like user-candidate total likes
What to do about missing data and outliers
If the dataset is large enough, you can drop them
If you can’t afford to drop any data, you can impute feature values by replacing them with a default (typically the mean, median, or mode)
sample bias
Happens when the collected data doesn’t accurately represent the environment the program is expected to run into.
e.g. training facial recognition in daytime lighting conditions only
exclusion bias
Happens as a result of excluding some feature(s) from our dataset usually under the umbrella of cleaning our data.
Use feature importance tools. Don’t guess.
Measurement bias
Systematic value distortion happens when there’s an issue with the device used to observe or measure.
Prejudice bias
Happens as a result of cultural influences or stereotypes.
Images of nurses or wedding dresses
Ranking model (recommendation systems)
Estimates the probability a video will be watched
Feature Scaling Techniques
- Normalization (Min/Max Scaling)
- Standardization (Z-score normalization)
- Log scaling
- Discretization (Bucketing)
- Encoding categorical features (integer encoding, one-hot encoding, embedding learning)
Normalization
- Min/max scaling
- Values are mapped from 0 to 1
- Normalization does not change the distribution of a feature
Standardization (Z-Score Normalization)
- Changing the distribution of a feature to have 0 mean and a std dev of 1
Log Scaling
- Mitigates skew
Discretization
- Bucketing
- The process of converting a continuous feature into a categorical feature
- Age -> bucket (reduces a discrete feature to a number of categories)
Techniques for Encoding Categorical Features
- Integer encoding
- One-hot encoding
- Embedding learning
When is integer encoding a good choice?
When there is an ordinal relationship between the categories. Integer encoding works well for excellent = 5 stars, Good = 4 stars
One hot encoding
A new binary feature is created for each unique value
If your feature has 3 colors, convert that to three booleans (isRed, isGreen, isBlue).
Embedding Learning
- Map a categorical feature into an N-dimensional vector
- Useful when the number of unique values of a feature is very large
- Learning an N-dimensional vector for each unique value the categorical feature may take. The resulting vectors are smaller in size than the one-hot encoded versions would be.
Harmonic mean
More heavily weighted to the smallest numbers which mitigates the impact of large outliers.