Data Flashcards
Data components
data modeling
data infrastructure
statistical models
machine learning
heuristic vs. machine learning
Productizing Machine learning models
Model flexibility
- Prototype with real data
- Backtesting
- Create components
Code quality
- Iteration
- Version control
Computational performance
- Run time
- Output frequency
Connectivity
- Databases and API
- Data validation
scored in real-time prescored offline
Retraining models
Similarity between data scientist and PM
Decision making with data.
ML Model evaluation
1-
Process error rate, compare vs baseline.
Overfitting must be inexistant.
2- Make
trade-offs between time, cost and accuracy
3- Soft launch
4- Have a back up - degrading
(albeit probably less accurate) back-up model or even a rule-based system ready to be deployed in place of your model-of-choice when predictions go south.
https://towardsdatascience.com/on-being-a-data-science-product-manager-5c8baf42e0a7
Root Mean Square Error
Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors).
Use to evaluate regression model against best fit.
Predictive Model Markup Language
The PMML file format specifies the data fields to use for the model, the type of calculation to perform (regression), and the structure of the model.
Dataflow
Read
Apply
Write
P value
Determine if results of A/B test are reliable or not. Must be below 0.05
Statistical significance
Degree of certainty of results. 95%
A/B test framework
1- Collect data 2- Identify goals 3- Generate hypothesis 4- Create variations 5- Run experiments 6- Analyze results.
Multivariate testing
The goal of multivariate testing is to determine which combination of variations performs the best out of all of the possible combinations.
Each combination tested is a new group -> more needed volume.
Minimum sample size
Minimum traffic required to reach statistical significance.
For example, if you want a 5% margin of error, your sample size will be approximately 1/(0.05²) = 400
Margin of error
Accuracy of test results -> % chance that results is between x-5% and x+5% ( x = accuracy).
Machine learning models
Supervised
- Classification
- Linear regression
Unsupervised
- Clustering