Developing Machine Learning Solutions Flashcards

1
Q

ML Lifecycle

A

Business goal identification
ML problem framing

Data processing (data collection, data preprocessing, and feature engineering)

Model development (training, tuning, and evaluation)

Model deployment (inference and prediction)

Model monitoring

Model retraining

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Feature engineering is the process of creating, transforming, extracting, and selecting variables from data.

A

the process of creating, transforming, extracting, and selecting variables from data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Model Development

A

Initially, upon training, the model will not yield the expected results. Therefore, developers will do additional feature engineering and tune the model’s hyperparameters before retraining.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Amazon SageMaker Data Wrangler is a

A

low-code no-code (LCNC) tool. It provides an end-to-end solution to import, prepare, transform, featurize, and analyze data by using a web interface. Customers can add their own Python scripts and transformations to customize workflows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

For more advanced users and data preparation at scale,

A

Amazon SageMaker Studio Classic comes with built-in integration of Amazon EMR and AWS Glue interactive sessions to handle large-scale interactive data preparation and machine learning workflows within your SageMaker Studio Classic notebook.

Finally, by using the SageMaker Processing API

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Amazon SageMaker Feature Store helps data scientists, machine learning engineers, and general practitioners to

A

create, share, and manage features for ML development.

Features stored in the store can be retrieved and enriched before being served to the ML models for inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Customers aiming at a LCNC option can use Amazon SageMaker Canvas.

A

With SageMaker Canvas, they can use machine learning to generate predictions without needing to write any code.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Amazon SageMaker JumpStart provides

A

pretrained, open source models that customers can use for a wide range of problem types.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Customers can use Amazon SageMaker Experiments to

A

experiment with multiple combinations of data, algorithms, and parameters, all while observing the impact of incremental changes on model accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Amazon SageMaker Automatic Model Tuning

A

Hyperparameter tuning is a way to find the best version of your models. does that by running many jobs with different hyperparameters in combination and measuring each of them by a metric that you choose.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Amazon SageMaker Model Monitor, customers can

A

observe the quality of SageMaker ML models in production. They can set up continuous monitoring or on-schedule monitoring. SageMaker Model Monitor helps maintain model quality by detecting violations of user-defined thresholds for data quality, model quality, bias drift, and feature attribution drift.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

SageMaker JumpStart provides

A

pretrained open source models for a range of problem types to help you get started with machine learning. models are ready to deploy or to fine-tune

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

AutoML is available in SageMaker Canvas. It

A

simplifies ML development by automating the process of building and deploying machine learning models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Built-in models available in SageMaker require more

A

effort and scale if the dataset is large and significant resources are needed to train and deploy the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

If there is no built-in solution that works, try to develop one that uses

A

pre-made images for machine learning and deep learning frameworks for supported frameworks such as scikit-learn, TensorFlow, PyTorch, MXNet, or Chainer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

You can build your own custom Docker image that is configured to

A

install the necessary packages or software.

17
Q

Think about bias as the gap between your predicted value and the actual value, whereas variance describes how dispersed your predicted values are.

A
18
Q

Classification Metrics

A

Accuracy
Precision
Recall
F1
AUC-ROC

19
Q

Regression Metrics

A

Mean squared error
R squared

20
Q

Accuracy

A

To calculate the model’s accuracy, also known as its score, add up the correct predictions and then divide that number by the total number of predictions.

21
Q

Problem with Accuracty

A

Although accuracy is a widely used metric for classification problems, it has limitations. This metric is less effective when there are a lot of true negative cases in your dataset. This is why two other metrics are often used in these situations: precision and recall.

22
Q

Precision

A

Precision removes the negative predictions from the picture. Precision is the proportion of positive predictions that are actually correct. You can calculate it by taking the true positive count and dividing it by the total number of positives.

23
Q

When the cost of false positives are high in your particular business situation, this can be a good metric

A

precision can be a good metric. Think about a classification model that identifies emails as spam or not. In this case, you do not want your model labeling a legitimate email as spam and preventing your users from seeing that email.

24
Q

Recall

A

recall (or sensitivity). In recall, you are looking at the proportion of correct sets that are identified as positive. Recall is calculated by dividing the true positive count by the sum of the true positives and false negatives. By looking at that ratio, you get an idea of how good the algorithm is at detecting, for example, cats.

25
Q

WHen to use recall

A

Think about a model that needs to predict whether a patient has a terminal illness or not. In this case, using precision as your evaluation metric does not account for the false negatives in your model. It is extremely important and vital to the success of the model that it not give false negative results. A false negative would be not identifying a patient as having a terminal illness when the patient actually does have a terminal illness. In this situation, recall is a better metric to use

26
Q

Area under the curve

A

In general, AUC-ROC can show what the curve for true positive compared to false positive looks like at various thresholds.

27
Q

Regression metric Mean squared error

A

The general purpose of mean squared error (MSE) is the same as the classification metrics. You determine the prediction from the model and compare the difference between the prediction and the actual outcome. The smaller the MSE, the better the model’s predictive accuracy.

28
Q

Regression metric r squared

A

R squared explains the fraction of variance accounted for by the model. It’s like a percentage, reporting a number from 0 to 1. When R squared is close to 1, it usually indicates that a lot of the variance in the data can be explained by the model itself.

29
Q

Two model deployment options

A

Self hosted API
Managed API (SageMaker is an example)

30
Q

SageMaker Deployment provides

A

Sagemaker provides the following:

Deployment with one click or a single API call
Automatic scaling
Model hosting services
HTTPS endpoints that can host multiple models

31
Q

SageMaker Asyncronous Deploy

A

SageMaker asynchronous inference is a capability in SageMaker that queues incoming requests and processes them asynchronously. This option is ideal for requests with large payload sizes (up to 1GB), long processing times (up to one hour), and near real-time latency requirements

32
Q

SageMaker Serverless Deploy

A

On-demand serverless inference is ideal for workloads that have idle periods between traffic spurts and can tolerate cold starts. It is a purpose-built inference option that you can use to deploy and scale ML models without configuring or managing any of the underlying infrastructure.

33
Q

Four Key principals of MLOps

A

Version Control
Automation
CI/CD
Model Governance

34
Q

MLOps: SageMaker training jobs

A

SageMaker provides a training job feature to train models using built-in algorithms or custom algorithms.

35
Q

MLOps: SageMaker Experiments

A

Use SageMaker Experiments to experiment with multiple combinations of data, algorithms, and parameters, all while observing the impact of incremental changes on model accuracy.

36
Q

MLOps: SageMaker Processing Job

A

SageMaker Processing refers to the capabilities to run data pre-processing and post-processing, feature engineering, and model evaluation tasks on the SageMaker fully managed infrastructure.

37
Q

MLOps: SageMaker Model Registry

A

With SageMaker Model Registry you can catalog models, manage model versions, manage the approval status of a model, or deploy models to production.

38
Q

bias

A

(the gap between your predicted value and the actual value)

39
Q

\variance

A

(how dispersed your predicted values are)