Developing ML Solutions Flashcards
What are the different stages in the ML Lifecycle?
The end-to-end machine learning lifecycle process includes the following phases:
- Business goal identification - what’s the business objective, what does success look like, what are the metrics? Budget? Value?
- ML problem framing - Converting the business problem into an ML problem? Is ML appropriate for this business problem?
- Data processing (data collection, data preprocessing, and feature engineering) - collect data, convert into usable format,
- Model development (training, tuning, and evaluation) - iterative, can be performed multiple times with additional feature engineering each time.
- Model deployment (inference and prediction)
- Model monitoring
- Model retraining - needed if it does not meet business goals or as new data becomes available. Needed to ensure model remains accurate over time.
What is Feature Engineering?
It is a step in the data processing phase of an ML LC.
Feature engineering is the process of creating, transforming, extracting, and selecting variables from data.
In a ML model, what are hyperparameters?
Hyperparameters are external configuration variables that data scientists use to manage machine learning model training. Sometimes called model hyperparameters, the hyperparameters are manually set before training a model. They’re different from parameters, which are internal parameters automatically derived during the learning process and not set by data scientists.
Examples of hyperparameters include the number of nodes and layers in a neural network and the number of branches in a decision tree. Hyperparameters determine key features such as model architecture, learning rate, and model complexity.
How do you prep data for ML model training?
Split your data as follows:
80% of data to train the model
10% to validate (improve the model with each training iteration)
10% to test.
How can Amazon SageMaker help with the ML Lifecycle?
In a single unified visual interface, you can perform the following tasks:
- Collect and prepare data.
- Build and train machine learning models.
- Deploy the models and monitor the performance of their predictions.
What is Amazon SM Data Wrangler?
- Low Code No Code (LCNC) tool.
- It provides an end-to-end solution to import, prepare, transform, featurize, and analyze data by using a web interface.
- SageMaker is also integrated with EMR and AWS Glue
- With SageMaker Processing API - customers can run scripts and notebooks to transform datasets.
- The data analysis helps customers arrive at the features to define the model and data for the model to be trained on.
What is Amazon SM Feature Store?
Helps data scientists, machine learning engineers, and general practitioners to create, share, and manage features for ML development.
What is Amazon SM Model training and evaluation features?
SM offers feature to train models using built-in algorithms (SM Training Job).
SM can launch compute instances, use training code, and data to train the model. Trained models are stored in an S3 bucket once complete.
SM Jumpstart - provide pretrained models
SM Canvas - LCNC for business analysts to build ML Models.
SM Experiments - experiment with different combinations of data, algorithms, and parameters to observe impact on accuracy.
SM Model Tuning - runs many versions with different hyperparameters and measures performance using a metric
SM can also deploy the models in a production environment.
SM Model Monitor - observe the quality of SageMaker ML models in production; continuous or on-schedule monitoring
SM Registry - model registry
SM Pipelines- Model building pipelines for end-to-end workflows.
What is SM Studio?
- Best way to access SM.
- Web-based interface to develop ML applications, such as prepare data, train, deploy, and monitor models.
What are the different kinds of ML algorithms that SageMaker provides?
- Supervised learning (e.g. Regression, Classification, K-Nearest Neighbor)
- Unsupervised Learning (e.g. Clustering, Dimensionality Reduction, Embeddings, Anomaly Detection, etc.)
- Image Processing (e.g. Image Classification, Object Detection)
- Text Analysis
How do you evaluate ML Models?
- Split the data into Training, Validation and Test sets (80-10-10 rule).
- Model fit - e.g. is it overfitted, underfitted, or balanced?
- Specific Metrics
3a) For classification problems - this cloud be Accuracy, Precision, Recall, F1, AUC-ROC
3b) For regression problems - Mean squared error and R squared
What is the difference between Bias and Variance?
Bias - difference between predicted value and true value
Variance - is how dispersed the values are.
Analogy: Bull’s eye.
An overfitted model has high variance
An underfitted model has high bias
What is a confusion matrix?
Evaluates model performance by classifying the predictions as:
True Positive
True Negative
False Positive
False Negative
So, a cat that is a cat = True Positive
So, a NOT cat that is predicted as a cat = False Positive
A NOT cat that is predicted as NOT cat = True Negative
A cat that is predicted as a NOT cat = False Negative
TPs and TNs are desirable. You want them to be as high as possible.
What are Accuracy, Precision, and Recall?
These are different ways by which you evaluate model performance
Accuracy - the number of times the model was right (i.e. TP and TN as proportion of the total number of predictions)
Precision - removes negative predictions from the model performance. For e.g. how good the model is in identifying a cat (TP/TP+FP) -In email spam detection, this may be important- you do not want your model labeling a legitimate email as spam and preventing your users from seeing that email. Impact of FP is high.
Recall - is sensitivity - proportion of correct sets that are identified as positive. (TP/TP+FN). Think about a model that needs to predict whether a patient has a terminal illness or not. You want to have as few FNs as possible (i.e. don’t classify someone as OK if they have a terminal illness). Impact of FN is high.
What is AUC-ROC?
Area-Under Curve-Receiver-Operator Curve
Essentially, a probability curve that measures separability.
True positive vs false positives (e.g. email spam classification).
True positive = percentage of spam you capture
False positive = negative impact of spam filtering (users not able to see legitimate emails)