ML System Design Flashcards
Elements of ML System Design
Clarify functional requirements (business objective)
Performance requirements
Frame as an ML Problem (inputs and outputs)
What data do we have access to for training
Feature engineering
Choose a model
Prediction Pipeline
Training pipeline
Offline & Online Metrics
Questions to ask
- How much training data do we have access to?
- What is the state of the data (already in the form of features, event data, log data)?
- What’s most important? accuracy or response time?
- Hardware constraints?
- Time constraints?
- Model retraining (spam vs recommendation systems)
confusion matrix
- summary of prediction results from a classification model
- predicted values on y axis
- actual values on x axis
false positive rate
false positive / total negatives
FP / (FP + TN)
Imbalanced dataset
A classification data set with skewed class proportions (far more positives than negatives or vice versa)
Difference between precision and FPR
Precision measures the probability that a sample classified as positive is actually positive while the FPR measures the ratio of false positives to total negatives
Precision is a better metric for datasets with a large number of negative samples
Training set
Examples used for learning to fit the parameters of the model
Validation set
Set of examples used to tune model parameters. For example, the number of layers of a neural network or batch size
Test set
Used to assess the performance of a fully trained model
sensitivity
measures the model’s ability to predict true positives in each available category
specificity
measures the model’s ability to predict true negatives for each available category
System design: data
Identify target variables
implicit (putting an item in your shopping cart) vs explicit (buying an item)
Example features
user-location
user-age
aggregate features like user-candidate total likes
What to do about missing data and outliers
If the dataset is large enough, you can drop them
If you can’t afford to drop any data, you can impute feature values by replacing them with a default (typically the mean, median, or mode)
sample bias
Happens when the collected data doesn’t accurately represent the environment the program is expected to run into.
e.g. training facial recognition in daytime lighting conditions only