Developing Machine Learning Solutions Flashcards by Rob Zelada

ML Lifecycle

Business goal identification
ML problem framing

Data processing (data collection, data preprocessing, and feature engineering)

Model development (training, tuning, and evaluation)

Model deployment (inference and prediction)

Model monitoring

Model retraining

How well did you know this?

Not at all

Perfectly

Feature engineering is the process of creating, transforming, extracting, and selecting variables from data.

the process of creating, transforming, extracting, and selecting variables from data.

How well did you know this?

Not at all

Perfectly

Model Development

Initially, upon training, the model will not yield the expected results. Therefore, developers will do additional feature engineering and tune the model’s hyperparameters before retraining.

How well did you know this?

Not at all

Perfectly

Amazon SageMaker Data Wrangler is a

low-code no-code (LCNC) tool. It provides an end-to-end solution to import, prepare, transform, featurize, and analyze data by using a web interface. Customers can add their own Python scripts and transformations to customize workflows.

How well did you know this?

Not at all

Perfectly

For more advanced users and data preparation at scale,

Amazon SageMaker Studio Classic comes with built-in integration of Amazon EMR and AWS Glue interactive sessions to handle large-scale interactive data preparation and machine learning workflows within your SageMaker Studio Classic notebook.

Finally, by using the SageMaker Processing API

How well did you know this?

Not at all

Perfectly

Amazon SageMaker Feature Store helps data scientists, machine learning engineers, and general practitioners to

create, share, and manage features for ML development.

Features stored in the store can be retrieved and enriched before being served to the ML models for inference

How well did you know this?

Not at all

Perfectly

Customers aiming at a LCNC option can use Amazon SageMaker Canvas.

With SageMaker Canvas, they can use machine learning to generate predictions without needing to write any code.

How well did you know this?

Not at all

Perfectly

Amazon SageMaker JumpStart provides

pretrained, open source models that customers can use for a wide range of problem types.

How well did you know this?

Not at all

Perfectly

Customers can use Amazon SageMaker Experiments to

experiment with multiple combinations of data, algorithms, and parameters, all while observing the impact of incremental changes on model accuracy.

How well did you know this?

Not at all

Perfectly

Amazon SageMaker Automatic Model Tuning

Hyperparameter tuning is a way to find the best version of your models. does that by running many jobs with different hyperparameters in combination and measuring each of them by a metric that you choose.

How well did you know this?

Not at all

Perfectly

Amazon SageMaker Model Monitor, customers can

observe the quality of SageMaker ML models in production. They can set up continuous monitoring or on-schedule monitoring. SageMaker Model Monitor helps maintain model quality by detecting violations of user-defined thresholds for data quality, model quality, bias drift, and feature attribution drift.

How well did you know this?

Not at all

Perfectly

SageMaker JumpStart provides

pretrained open source models for a range of problem types to help you get started with machine learning. models are ready to deploy or to fine-tune

How well did you know this?

Not at all

Perfectly

AutoML is available in SageMaker Canvas. It

simplifies ML development by automating the process of building and deploying machine learning models.

How well did you know this?

Not at all

Perfectly

Built-in models available in SageMaker require more

effort and scale if the dataset is large and significant resources are needed to train and deploy the model

How well did you know this?

Not at all

Perfectly

If there is no built-in solution that works, try to develop one that uses

pre-made images for machine learning and deep learning frameworks for supported frameworks such as scikit-learn, TensorFlow, PyTorch, MXNet, or Chainer.

How well did you know this?

Not at all

Perfectly

You can build your own custom Docker image that is configured to

Study These Flashcards

install the necessary packages or software.

Think about bias as the gap between your predicted value and the actual value, whereas variance describes how dispersed your predicted values are.

Study These Flashcards

Classification Metrics

Study These Flashcards

Accuracy
Precision
Recall
F1
AUC-ROC

Regression Metrics

Study These Flashcards

Mean squared error
R squared

Accuracy

Study These Flashcards

To calculate the model’s accuracy, also known as its score, add up the correct predictions and then divide that number by the total number of predictions.

Problem with Accuracty

Study These Flashcards

Although accuracy is a widely used metric for classification problems, it has limitations. This metric is less effective when there are a lot of true negative cases in your dataset. This is why two other metrics are often used in these situations: precision and recall.

Precision

Study These Flashcards

Precision removes the negative predictions from the picture. Precision is the proportion of positive predictions that are actually correct. You can calculate it by taking the true positive count and dividing it by the total number of positives.

When the cost of false positives are high in your particular business situation, this can be a good metric

Study These Flashcards

precision can be a good metric. Think about a classification model that identifies emails as spam or not. In this case, you do not want your model labeling a legitimate email as spam and preventing your users from seeing that email.

Recall

Study These Flashcards

recall (or sensitivity). In recall, you are looking at the proportion of correct sets that are identified as positive. Recall is calculated by dividing the true positive count by the sum of the true positives and false negatives. By looking at that ratio, you get an idea of how good the algorithm is at detecting, for example, cats.

WHen to use recall

Think about a model that needs to predict whether a patient has a terminal illness or not. In this case, using precision as your evaluation metric does not account for the false negatives in your model. It is extremely important and vital to the success of the model that it not give false negative results. A false negative would be not identifying a patient as having a terminal illness when the patient actually does have a terminal illness. In this situation, recall is a better metric to use

Area under the curve

In general, AUC-ROC can show what the curve for true positive compared to false positive looks like at various thresholds.

Regression metric Mean squared error

The general purpose of mean squared error (MSE) is the same as the classification metrics. You determine the prediction from the model and compare the difference between the prediction and the actual outcome. The smaller the MSE, the better the model's predictive accuracy.

Regression metric r squared

R squared explains the fraction of variance accounted for by the model. It’s like a percentage, reporting a number from 0 to 1. When R squared is close to 1, it usually indicates that a lot of the variance in the data can be explained by the model itself.

Two model deployment options

Self hosted API Managed API (SageMaker is an example)

SageMaker Deployment provides

Sagemaker provides the following: Deployment with one click or a single API call Automatic scaling Model hosting services HTTPS endpoints that can host multiple models

SageMaker Asyncronous Deploy

SageMaker asynchronous inference is a capability in SageMaker that queues incoming requests and processes them asynchronously. This option is ideal for requests with large payload sizes (up to 1GB), long processing times (up to one hour), and near real-time latency requirements

SageMaker Serverless Deploy

On-demand serverless inference is ideal for workloads that have idle periods between traffic spurts and can tolerate cold starts. It is a purpose-built inference option that you can use to deploy and scale ML models without configuring or managing any of the underlying infrastructure.

Four Key principals of MLOps

Version Control Automation CI/CD Model Governance

MLOps: SageMaker training jobs

SageMaker provides a training job feature to train models using built-in algorithms or custom algorithms.

MLOps: SageMaker Experiments

Use SageMaker Experiments to experiment with multiple combinations of data, algorithms, and parameters, all while observing the impact of incremental changes on model accuracy.

MLOps: SageMaker Processing Job

SageMaker Processing refers to the capabilities to run data pre-processing and post-processing, feature engineering, and model evaluation tasks on the SageMaker fully managed infrastructure.

MLOps: SageMaker Model Registry

With SageMaker Model Registry you can catalog models, manage model versions, manage the approval status of a model, or deploy models to production.

bias

(the gap between your predicted value and the actual value)

\variance

(how dispersed your predicted values are)

Developing Machine Learning Solutions Flashcards

(39 cards)