ML Practice Test #1 Flashcards

1
Q

Amazon SageMaker Feature Store

A

Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for machine learning (ML) models. Features are inputs to ML models used during training and inference. For example, in an application that recommends a music playlist, features could include song ratings, listening duration, and listener demographics.

You can ingest data into SageMaker Feature Store from a variety of sources, such as application and service logs, clickstreams, sensors, and tabular data from Amazon Simple Storage Service (Amazon S3), Amazon Redshift, AWS Lake Formation, Snowflake, and Databricks Delta Lake.

How Feature Store works: via - https://aws.amazon.com/sagemaker/feature-store/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Amazon SageMaker Clarify

A

SageMaker Clarify helps identify potential bias during data preparation without writing code. You specify input features, such as gender or age, and SageMaker Clarify runs an analysis job to detect potential bias in those features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Amazon SageMaker Data Wrangler

A

Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare tabular and image data for ML from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow (including data selection, cleansing, exploration, visualization, and processing at scale) from a single visual interface.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Amazon SageMaker Ground Truth

A

Amazon SageMaker Ground Truth offers the most comprehensive set of human-in-the-loop capabilities, allowing you to harness the power of human feedback across the ML lifecycle to improve the accuracy and relevancy of models. You can complete a variety of human-in-the-loop tasks with SageMaker Ground Truth, from data generation and annotation to model review, customization, and evaluation, either through a self-service or an AWS-managed offering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data Science Strategies - Data augmentation

A

data augmentation to increase the diversity of the training data

Data augmentation is a machine learning technique that changes the sample data slightly every time the model processes it. You can do this by changing the input data in small ways. When done in moderation, data augmentation makes the training sets appear unique to the model and prevents the model from learning their characteristics. For example, applying transformations such as translation, flipping, and rotation to input images.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data Science Strategies - Early stopping

A

early stopping to prevent overfitting

Early stopping pauses the training phase before the machine learning model learns the noise in the data. However, getting the timing right is important; else the model will still not give accurate results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data Science Strategies - Ensembling

A

use ensembling to average predictions from multiple models

Ensembling combines predictions from several separate machine learning algorithms. Some models are called weak learners because their results are often inaccurate. Ensemble methods combine all the weak learners to get more accurate results. They use multiple models to analyze sample data and pick the most accurate outcomes. The two main ensemble methods are bagging and boosting. Boosting trains different machine learning models one after another to get the final result, while bagging trains them in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data Science Strategies - Pruning

A

You might identify several features or parameters that impact the final prediction when you build a model. Feature selection—or pruning—identifies the most important features within the training set and eliminates irrelevant ones. For example, to predict if an image is an animal or human, you can look at various input parameters like face shape, ear position, body structure, etc. You may prioritize face shape and ignore the shape of the eyes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data Science Strategies - Regularization

A

Regularization is a collection of training/optimization techniques that seek to reduce overfitting. These methods try to eliminate those factors that do not impact the prediction outcomes by grading features based on importance. For example, mathematical calculations apply a penalty value to features with minimal impact. Consider a statistical model attempting to predict the housing prices of a city in 20 years. Regularization would give a lower penalty value to features like population growth and average annual income but a higher penalty value to the average annual temperature of the city.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

AWS CloudFormation

A

Use AWS CloudFormation with nested stacks to automate the provisioning of SageMaker, EC2, and RDS resources, and configure outputs from one stack as inputs to another to enable communication between them

AWS CloudFormation with nested stacks allows you to modularize your infrastructure, making it easier to manage and reuse components. By passing outputs from one stack as inputs to another, you can automate the provisioning of resources while ensuring that all stacks can communicate effectively. This approach also enables consistent and scalable deployments across environments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

AWS CDK (Cloud Development Kit)

A

AWS CDK allows you to define infrastructure using high-level programming languages, which is flexible and powerful. However, failing to configure inter-stack communication would lead to a disjointed deployment, where services may not function together as required.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

AWS Elastic Beanstalk

A

AWS Elastic Beanstalk is a managed service for deploying applications, but it is not designed for orchestrating complex ML workflows with multiple resource types like SageMaker, EC2, and RDS. It also lacks fine-grained control over resource provisioning and inter-stack communication.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Boosting

A

Boosting is a method used in machine learning to reduce errors in predictive data analysis. Data scientists train machine learning software, called machine learning models, on labeled data to make guesses about unlabeled data. A single machine learning model might make prediction errors depending on the accuracy of the training dataset.

https://aws.amazon.com/what-is/boosting/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Boosting - Extreme Gradient Boosting (XGBoost)

A

Apply Extreme Gradient Boosting (XGBoost) for its ability to handle imbalanced datasets effectively through regularization, weighted classes, and optimized computational efficiency

The XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that tries to accurately predict a target variable by combining multiple estimates from a set of simpler models. The XGBoost algorithm performs well in machine learning competitions for the following reasons:

Its robust handling of a variety of data types, relationships, distributions.

The variety of hyperparameters that you can fine-tune.

XGBoost is an extension of Gradient Boosting that includes additional features such as regularization, handling of missing values, and support for weighted classes, making it particularly well-suited for imbalanced datasets like fraud detection. It also offers significant computational efficiency, which is beneficial when working with large datasets.

via - https://aws.amazon.com/what-is/boosting/

https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html

XGBoost is known for its ability to deliver high performance with relatively efficient training times, especially with techniques like early stopping and hyperparameter tuning. This approach balances the need for accuracy with reduced computational cost and training time, making it an ideal choice for this scenario.

The XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that tries to accurately predict a target variable by combining multiple estimates from a set of simpler models. The XGBoost algorithm performs well in machine learning competitions for the following reasons:

Its robust handling of a variety of data types, relationships, distributions.

The variety of hyperparameters that you can fine-tune.

XGBoost is a powerful gradient boosting algorithm that excels in structured data problems, such as fraud detection. It allows for custom objective functions, making it highly suitable for optimizing precision and recall, which are critical in imbalanced datasets. Additionally, XGBoost has built-in techniques for handling class imbalance, such as scale_pos_weight.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Boosting - Adaptive Boosting (AdaBoost)

A

AdaBoost works by focusing on correcting the errors of weak classifiers, assigning more weight to misclassified instances in each iteration. However, it may struggle with noisy data and extreme class imbalance, as it can overemphasize hard-to-classify instances.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Boosting - Gradient Boosting

A

Gradient Boosting is a powerful technique that uses the gradient of the loss function to improve the model iteratively. While it can be adapted to handle class imbalance, it does not inherently provide the same level of flexibility and computational optimization as XGBoost for this specific problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Amazon SageMaker JumpStart

A

Use SageMaker JumpStart to deploy a pre-trained NLP model and use the built-in fine-tuning functionality with your custom dataset to create a customized sentiment analysis model

Amazon Bedrock is the easiest way to build and scale generative AI applications with foundation models. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

Amazon SageMaker JumpStart is a machine learning (ML) hub that can help you accelerate your ML journey. With SageMaker JumpStart, you can evaluate, compare, and select FMs quickly based on pre-defined quality and responsibility metrics to perform tasks like article summarization and image generation. SageMaker JumpStart provides managed infrastructure and tools to accelerate scalable, reliable, and secure model building, training, and deployment of ML models.

Fine-tuning trains a pretrained model on a new dataset without training from scratch. This process, also known as transfer learning, can produce accurate models with smaller datasets and less training time.

SageMaker JumpStart is specifically designed for scenarios like this, where you can quickly deploy a pre-trained model and fine-tune it using your custom dataset. This approach allows you to leverage existing NLP models, reducing both development time and computational resources needed for training from scratch.

via - https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-fine-tune.html

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

confusion matrix

A

The confusion matrix illustrates in a table the number or percentage of correct and incorrect predictions for each class by comparing an observation’s predicted class and its true class. The confusion matrix is crucial for understanding the detailed performance of your model, especially in an imbalanced dataset. It allows you to calculate additional metrics such as precision, recall, and F1 score, which are essential for understanding how well your model handles false positives and false negatives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Amazon SageMaker Autopilot

A

Automatically create machine learning models with full visibility. Autopilot is now in SageMaker Canvas with integrated data preparation, multi-modality support, built-in visualizations and what-if analysis, and automation support for predictions.

Amazon SageMaker Autopilot is a feature set that simplifies and accelerates various stages of the machine learning workflow by automating the process of building and deploying machine learning models (AutoML).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

accuracy

A

While accuracy is a common metric, it is not suitable for imbalanced datasets because it can be misleading. A model predicting the majority class most of the time can achieve high accuracy without effectively capturing the minority class (e.g., customers who make a purchase).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Precision and recall

A

Precision and recall are particularly important in an imbalanced dataset. Precision measures the proportion of true positive predictions among all positive predictions, while recall measures the proportion of actual positives that are correctly identified. Focusing on these metrics helps in assessing how well the model avoids false positives and false negatives, which is critical in your scenario.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

RMSE

A

Root mean squared error

RMSE is a regression metric, not suitable for classification problems. In this scenario, you are dealing with a classification task, so metrics like precision, recall, and F1 score are more appropriate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

AUC

A

The area under the curve (AUC) metric is used to compare and evaluate binary classification by algorithms that return probabilities, such as logistic regression. To map the probabilities into classifications, these are compared against a threshold value.

The relevant curve is the receiver operating characteristic curve. The curve plots the true positive rate (TPR) of predictions (or recall) against the false positive rate (FPR) as a function of the threshold value, above which a prediction is considered positive. Increasing the threshold results in fewer false positives, but more false negatives.

AUC is the area under this receiver operating characteristic curve. Therefore, AUC provides an aggregated measure of the model performance across all possible classification thresholds. AUC scores vary between 0 and 1. A score of 1 indicates perfect accuracy, and a score of one half (0.5) indicates that the prediction is not better than a random classifier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Deep Neural Network

A

A deep neural network may provide high accuracy but typically requires significant computational resources and longer training times, leading to higher costs. This approach may not be feasible within a limited budget, especially with the need for frequent retraining.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Logistic regression

A

Logistic regression is simple and cost-effective but may not achieve the level of accuracy required for a critical application like fraud detection. This tradeoff might be too significant if accuracy is compromised.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

SVM

A

support vector machine

SVMs with nonlinear kernels can be very accurate but are computationally intensive, particularly with large datasets. The increased training time and cost might outweigh the benefits, especially when there are more cost-effective alternatives like XGBoost.

Reference:

https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Data drift

A

Data drift occurs when the distribution of the input data changes over time, which can lead to the model receiving data that is different from what it was trained on.

To address data drift, you should use SageMaker Model Monitor to track changes in input data distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Model drift

A

Model drift happens when the model’s underlying assumptions or parameters become outdated

Model drift occurs when the model’s performance degrades because its assumptions or parameters no longer align with the real-world data.

For model drift, you should periodically retrain the model using the latest data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

A transparent and explainable machine learning model benefits?

A

They facilitate easier debugging and optimization

Transparent models allow developers to understand how inputs are transformed into outputs, making it easier to identify and correct errors or inefficiencies in the model. This capability is crucial for optimizing the model’s performance and ensuring it behaves as expected.

They foster trust and confidence in model predictions

When stakeholders can understand the decision-making process of a model, it builds trust in its predictions. Transparency is key in high-stakes scenarios, such as healthcare or finance, where understanding the rationale behind predictions is critical for acceptance and trust.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Opaque models

A

Opaque models, not transparent ones, are typically associated with enhanced security through obscurity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Amazon SageMaker Model Registry

A

Leverage the SageMaker Model Registry to register, track, and manage different versions of models, capturing all relevant metadata, including data sources, hyperparameters, and training code

The SageMaker Model Registry is specifically designed for managing model versions in a systematic and organized manner. It allows you to register different versions of a model, track metadata such as data sources, hyperparameters, and training code, and ensure that each version is easily reproducible. This approach is ideal for regulatory environments where audit trails and model governance are critical.

With the Amazon SageMaker Model Registry you can do the following:

Catalog models for production.

Manage model versions.

Associate metadata, such as training metrics, with a model.

View information from Amazon SageMaker Model Cards in your registered models.

Manage the approval status of a model.

Deploy models to production.

Automate model deployment with CI/CD.

Share models with other users.

Incorrect options:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Create a Docker container with the required environment, push the container image to Amazon ECR (Elastic Container Registry), and use SageMaker’s Script Mode to execute the training script within the container

A

Script mode enables you to write custom training and inference code while still utilizing common ML framework containers maintained by AWS.

SageMaker supports most of the popular ML frameworks through pre-built containers, and has taken the extra step to optimize them to work especially well on AWS compute and network infrastructure in order to achieve near-linear scaling efficiency. These pre-built containers also provide some additional Python packages, such as Pandas and NumPy, so you can write your own code for training an algorithm. These frameworks also allow you to install any Python package hosted on PyPi by including a requirements.txt file with your training code or to include your own code directories.

This is the correct approach for using the BYOC strategy with SageMaker. You build a Docker container that includes the required TensorFlow version and custom dependencies, then push the image to Amazon ECR. SageMaker can reference this image to create training jobs and deploy endpoints. By using Script Mode, you can execute your custom training script within the container, ensuring compatibility with your specific environment.

via - https://aws.amazon.com/blogs/machine-learning/bring-your-own-model-with-amazon-sagemaker-script-mode/

33
Q

Use Amazon SageMaker’s built-in versioning to manage different versions of the model, and deploy the new version in a canary release by redirecting a small percentage of traffic to it initially

A

Amazon SageMaker supports model versioning, which is crucial for tracking different iterations of your model. A canary release allows you to deploy the new model version to a small portion of users, minimizing risk by limiting exposure in case of issues. If the new version performs well, you can gradually increase traffic to it.

34
Q

Utilize Amazon SageMaker’s blue/green deployment strategy to shift traffic gradually from the old model to the new one, ensuring that you can monitor performance and quickly revert if needed

A

A blue/green deployment strategy is a best practice in model deployment. It allows you to deploy the new model version in parallel with the existing one, gradually shifting traffic to the new version while monitoring its performance. If issues are detected, you can quickly roll back to the previous version without disrupting service.

In a blue/green deployment, SageMaker provisions a new fleet with the updates (the green fleet). Then, SageMaker shifts traffic from the old fleet (the blue fleet) to the green fleet. Once the green fleet operates smoothly for a set evaluation period (called the baking period), SageMaker terminates the blue fleet. You can specify Amazon CloudWatch alarms that SageMaker uses to monitor the green fleet. If an issue with the updated code trips any of the alarms, SageMaker initiates an auto-rollback to the blue fleet in order to maintain availability thereby minimizing risk.

via - https://docs.aws.amazon.com/sagemaker/latest/dg/deployment-guardrails-blue-green.html

35
Q

Amazon SageMaker Data Wrangler

A

Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare tabular and image data for ML from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow (including data selection, cleansing, exploration, visualization, and processing at scale) from a single visual interface. You can use SQL to select the data that you want from various data sources and import it quickly. Next, you can use the data quality and insights report to automatically verify data quality and detect anomalies, such as duplicate rows and target leakage. SageMaker Data Wrangler contains over 300 built-in data transformations, so you can quickly transform data without writing code.

With the SageMaker Data Wrangler data selection tool, you can quickly access and select your tabular and image data from various popular sources - such as Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, AWS Lake Formation, Snowflake, and Databricks - and over 50 other third-party sources - such as Salesforce, SAP, Facebook Ads, and Google Analytics. You can also write queries for data sources using SQL and import data directly into SageMaker from various file formats, such as CSV, Parquet, JSON, and database tables.

How Data Wrangler works: via - https://aws.amazon.com/sagemaker/data-wrangler/

36
Q

SageMaker Model Dashboard

A

Amazon SageMaker Model Dashboard is a centralized portal, accessible from the SageMaker console, where you can view, search, and explore all of the models in your account. You can track which models are deployed for inference and if they are used in batch transform jobs or hosted on endpoints.

37
Q

Amazon SageMaker Clarify

A

SageMaker Clarify helps identify potential bias during data preparation without writing code. You specify input features, such as gender or age, and SageMaker Clarify runs an analysis job to detect potential bias in those features.

38
Q

Amazon SageMaker Feature Store

A

Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for machine learning (ML) models. Features are inputs to ML models used during training and inference.

39
Q

Implement continuous monitoring of input data features and model predictions using statistical tests to detect shifts in data distribution or performance, triggering an alert when drift is detected

Retrain the model on the most recent data to ensure it captures current trends, and use model versioning to track performance improvements over time

A

For a model to predict accurately, the data that it is making predictions on must have a similar distribution as the data on which the model was trained. Because data distributions can be expected to drift over time, deploying a model is not a one-time exercise but rather a continuous process. It is a good practice to continuously monitor the incoming data and retrain your model on newer data if you find that the data distribution has deviated significantly from the original training data distribution. If monitoring data to detect a change in the data distribution has a high overhead, then a simpler strategy is to retrain the model periodically, for example, daily, weekly, or monthly.

40
Q

An AUC close to 1.0 indicates that the model has excellent discriminatory power, effectively distinguishing between defaulters and non-defaulters

A

Area Under the (Receiver Operating Characteristic) Curve (AUC) represents an industry-standard accuracy metric for binary classification models. AUC measures the ability of the model to predict a higher score for positive examples as compared to negative examples. Because it is independent of the score cut-off, you can get a sense of the prediction accuracy of your model from the AUC metric without picking a threshold.

The AUC metric returns a decimal value from 0 to 1. AUC values near 1 indicate an ML model that is highly accurate. Values near 0.5 indicate an ML model that is no better than guessing at random. Values near 0 are unusual to see, and typically indicate a problem with the data. Essentially, an AUC near 0 says that the ML model has learned the correct patterns, but is using them to make predictions that are flipped from reality (‘0’s are predicted as ‘1’s and vice versa). The ROC curve is the plot of the true positive rate (TPR) against the false positive rate (FPR) at each threshold setting.

via - https://aws.amazon.com/blogs/machine-learning/is-your-model-good-a-deep-dive-into-amazon-sagemaker-canvas-advanced-metrics/

An AUC close to 1.0 signifies that the model has excellent discriminatory power, meaning it can effectively distinguish between the positive class (defaulters) and the negative class (non-defaulters) across all thresholds. This is desirable in a classification task, especially in scenarios with class imbalance.

via - https://docs.aws.amazon.com/machine-learning/latest/dg/binary-model-insights.html

41
Q

ROC curve

A

A ROC curve close to the diagonal line (AUC ~ 0.5) indicates that the model has no discriminatory power and is performing no better than random guessing. This suggests poor model performance, not that the model performs well across all thresholds.

42
Q

Conditional Demographic Disparity (CDD)

A

CDD evaluates the disparity in positive prediction rates across demographic groups, conditioned on a specific feature like income, to detect bias that may not be apparent when only considering overall outcomes

Conditional Demographic Disparity (CDD) measures the difference in positive prediction rates between demographic groups, while conditioning on relevant features like income. This allows you to identify subtle biases that might be masked when looking only at overall predictions, ensuring that the model’s decisions are fair across different groups given their specific circumstances.

via - https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-data-bias-metric-cddl.html

43
Q

Amazon SageMaker Pipelines

A

Implement the entire ML workflow using Amazon SageMaker Pipelines, which provides integrated orchestration for data processing, model training, tuning, and deployment

Amazon SageMaker Pipelines is a purpose-built workflow orchestration service to automate machine learning (ML) development. SageMaker Pipelines is specifically designed to orchestrate end-to-end ML workflows, integrating data processing, model training, hyperparameter tuning, and deployment in a seamless manner. It provides built-in versioning, lineage tracking, and support for continuous integration and delivery (CI/CD), making it the best choice for this use case.

via - https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html

44
Q

AWS Step Functions

A

AWS Step Functions is a powerful service for orchestrating workflows, and it can integrate with SageMaker and Lambda. However, using Step Functions for the entire ML workflow adds complexity since it requires coordinating multiple services, whereas SageMaker Pipelines provides a more seamless, integrated solution for ML-specific workflows.

45
Q

AWS Glue

A

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.

https://aws.amazon.com/glue/

46
Q

Amazon SageMaker Managed Spot Training

A

Use Amazon SageMaker Managed Spot Training to dynamically allocate Spot Instances for the training job, automatically retrying any interrupted instances via checkpoints

Managed Spot Training uses Amazon EC2 Spot instance to run training jobs instead of on-demand instances. You can specify which training jobs use spot instances and a stopping condition that specifies how long SageMaker waits for a job to run using Amazon EC2 Spot instances. Spot instances can be interrupted, causing jobs to take longer to start or finish. You can configure your managed spot training job to use checkpoints. SageMaker copies checkpoint data from a local path to Amazon S3. When the job is restarted, SageMaker copies the data from Amazon S3 back into the local path. The training job can then resume from the last checkpoint instead of restarting.

via - https://aws.amazon.com/blogs/aws/managed-spot-training-save-up-to-90-on-your-amazon-sagemaker-training-jobs/

47
Q

Amazon SageMaker endpoints

A

Deploy the real-time recommendation model using Amazon SageMaker endpoints to ensure low-latency, high-availability, and managed infrastructure for real-time inference

Real-time inference is ideal for inference workloads where you have real-time, interactive, low latency requirements. You can deploy your model to SageMaker hosting services and get an endpoint that can be used for inference. These endpoints are fully managed and support autoscaling.

This makes it an ideal choice for the recommendation model, which must provide fast responses to user interactions with minimal downtime.

48
Q

Amazon EKS

A

Deploy the generative AI model using Amazon Elastic Kubernetes Service (Amazon EKS) to leverage containerized microservices for high scalability and control over the deployment environment

Amazon EKS is designed for containerized applications that need high scalability and flexibility. It is suitable for the generative AI model, which may require complex orchestration and scaling in response to varying demand, while giving you full control over the deployment environment.

via - https://aws.amazon.com/blogs/containers/deploy-generative-ai-models-on-amazon-eks/

49
Q

AWS Lambda

A

While AWS Lambda is excellent for serverless applications, it may not be the best choice for a fraud detection model if it requires continuous, low-latency processing or needs to handle very high throughput. Lambda is better suited for lightweight, event-driven tasks rather than long-running, complex inference jobs.

50
Q

Amazon ECS

A

Amazon ECS is a good choice for containerized workloads but is generally more appropriate for batch processing or large-scale, stateless applications. It might not provide the low-latency and real-time capabilities needed for the recommendation model.

51
Q

Amazon SageMaker’s multi-model endpoint

A

Use Amazon SageMaker’s multi-model endpoint to deploy multiple models on a single instance, reducing costs by sharing resources

Amazon SageMaker’s multi-model endpoint allows you to deploy multiple models on a single instance. This can significantly reduce costs by sharing resources among models, but it may introduce slight increases in latency due to the need to load the correct model into memory. This tradeoff can be acceptable if cost savings are a priority and latency requirements are not ultra-strict.

via - https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html

52
Q

Auto-scaling

A

Implement auto-scaling on a fleet of medium-sized instances, allowing the system to adjust resources based on real-time demand, balancing cost and performance dynamically

Auto-scaling allows you to dynamically adjust the number of instances based on demand, which helps balance performance and cost. During peak times, more instances can be provisioned to maintain low latency, while during off-peak times, fewer instances are used, reducing costs. This strategy offers a flexible way to manage the tradeoffs between performance, cost, and latency.

https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html

53
Q

Amazon SageMaker Neo

A

While Amazon SageMaker Neo can optimize models for deployment on edge devices, it is not the best fit for this scenario. Neo is more suitable for low-latency, cost-effective deployments on devices with limited resources. In this scenario, the need for scalable, cloud-based infrastructure is more important.

54
Q

Amazon SageMaker Algorithm: Linear Learner algorithm

A

The Linear Learner algorithm can handle classification tasks, and weighting classes can help with imbalance. However, it may not be as effective in capturing complex patterns in the data as more sophisticated algorithms like XGBoost.

Linear Learner could be used for classification tasks, but predicting maintenance needs often involves detecting subtle anomalies rather than simple classification. Additionally, a binary classification model might not capture the complex patterns associated with potential failures.

55
Q

Amazon SageMaker Algorithm: Random Cut Forest (RCF) algorithm

A

Random Cut Forest (RCF) is designed for anomaly detection, which can be relevant for fraud detection. However, RCF is unsupervised and may not leverage the labeled data effectively, leading to suboptimal results in a supervised classification task like this.

Random Cut Forest (RCF) Algorithm to detect anomalies in sensor data that may indicate impending failures

Amazon SageMaker Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set. These are observations which diverge from otherwise well-structured or patterned data. Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or unclassifiable data points. They are easy to describe in that, when viewed in a plot, they are often easily distinguishable from the “regular” data. Including these anomalies in a data set can drastically increase the complexity of a machine learning task since the “regular” data can often be described with a simple model.

Random Cut Forest (RCF) is specifically designed for detecting anomalies in data. This algorithm excels at identifying unexpected patterns in sensor data that could indicate the early stages of equipment failure. It’s particularly well-suited for scenarios where you need to react to unusual behaviors in near-real-time.

Mapping use cases to built-in algorithms:

via - https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html

56
Q

Amazon SageMaker Algorithm: K-Nearest Neighbors (k-NN) algorithm

A

K-Nearest Neighbors (k-NN) can classify based on similarity, but it does not scale well with large datasets and may struggle with the high-dimensional, imbalanced nature of the data in this context.

57
Q

Exploratory data analysis (EDA)

A

Conduct exploratory data analysis (EDA) to understand the data distribution, address missing values, and assess the class imbalance before determining if an ML solution is feasible

Conducting exploratory data analysis (EDA) is the most appropriate first step. EDA allows you to understand the data distribution, identify and address missing values, and assess the extent of the class imbalance. This process helps determine whether the available data is sufficient to build a reliable model and what preprocessing steps might be necessary.

via - https://aws.amazon.com/blogs/machine-learning/exploratory-data-analysis-feature-engineering-and-operationalizing-your-data-flow-into-your-ml-pipeline-with-amazon-sagemaker-data-wrangler/

58
Q

Amazon SageMaker Algorithms

A

https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html

59
Q

Amazon SageMaker Algorithms: DeepAR Algorithm

A

DeepAR is designed for forecasting future time series data, which could be useful for predicting future equipment behavior. However, it is not primarily used for anomaly detection, which is critical for identifying unusual patterns that precede failures.

60
Q

Amazon SageMaker Algorithms: Time Series K-Means Algorithm

A

Time Series K-Means can cluster similar time series patterns, but clustering alone does not provide the precision needed for real-time anomaly detection, which is crucial for predictive maintenance.

61
Q

Hyperparameter automated tuning method: Bayesian Optimization

A

Bayesian Optimization is a technique based on Bayes’ theorem, which describes the probability of an event occurring related to current knowledge. When this is applied to hyperparameter optimization, the algorithm builds a probabilistic model from a set of hyperparameters that optimizes a specific metric. It uses regression analysis to iteratively choose the best set of hyperparameters.

Bayesian Optimization is more efficient than Random Search for hyperparameter tuning, especially when dealing with complex models and large hyperparameter spaces. It learns from previous trials to predict the best set of hyperparameters, thus focusing the search more effectively. Narrowing the range of critical hyperparameters can further improve the chances of finding the optimal values, leading to better model convergence and performance.

https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html

62
Q

Hyperparameter automated tuning method: Random Search

A

Random Search selects groups of hyperparameters randomly on each iteration. It works well when a relatively small number of the hyperparameters primarily determine the model outcome.

63
Q

Hyperparameter automated tuning method: Grid Search

A

Grid Search works well, but it’s relatively tedious and computationally intensive, especially with large numbers of hyperparameters. It is less efficient than Bayesian Optimization for complex models. A wide range of hyperparameters without focus would result in more trials, but it is not guaranteed to find the best values, especially with a larger search space.

64
Q

Amazon SageMaker Serverless Inference

A

Use Amazon SageMaker Serverless Inference that minimizes costs during low-traffic periods while managing large infrequent spikes of requests efficiently

via - https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html

Serverless Inference is designed to automatically scale the compute resources based on incoming requests, making it highly efficient for handling varying levels of traffic. It is cost-effective because you only pay for the compute time used when requests are being processed. This makes it an excellent choice for scenarios where traffic is unpredictable, with periods of low or no traffic. It is ideal for workloads that have idle periods between traffic spikes and can tolerate cold starts.

via - https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html

65
Q

Amazon SageMaker Asynchronous Inference

A

Asynchronous Inference is ideal for handling large and long-running inference requests that do not require an immediate response. However, it may not be as cost-effective for handling fluctuating traffic where immediate scaling and low-latency are priorities.

66
Q

Amazon SageMaker Real-time Inference

A

Real-time inference is ideal for inference workloads where you have real-time, interactive, low latency requirements.

67
Q

Amazon SageMaker Batch transform

A

To get predictions for an entire dataset, you can use Batch transform with Amazon SageMaker.

68
Q

Amazon SageMaker Pipelines

A

Use Amazon SageMaker Pipelines to orchestrate the entire ML workflow, leveraging its built-in integration with SageMaker features like training, tuning, and deployment

Amazon SageMaker Pipelines is a purpose-built workflow orchestration service to automate machine learning (ML) development.

SageMaker Pipelines is specifically designed for orchestrating ML workflows. It provides native integration with SageMaker features like model training, tuning, and deployment. It also supports versioning, lineage tracking, and automatic execution of workflows, making it the ideal choice for managing end-to-end ML workflows in AWS.

via - https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html

69
Q

Apache Airflow

A

Airflow is a powerful orchestration tool that allows you to define complex workflows using custom DAGs. However, it requires significant setup and maintenance, and while it can integrate with AWS services, it does not provide the seamless, built-in integration with SageMaker that SageMaker Pipelines offers.

Amazon Managed Workflows for Apache Airflow (Amazon MWAA):

via - https://aws.amazon.com/managed-workflows-for-apache-airflow/

70
Q

AWS Step Functions

A

AWS Step Functions is a serverless orchestration service that can integrate with SageMaker and other AWS services. However, it is more general-purpose and lacks some of the ML-specific features, such as model lineage tracking and hyperparameter tuning, that are built into SageMaker Pipelines.

71
Q

AWS Lambda functions

A

AWS Lambda is useful for triggering specific tasks, but manually managing each step of a complex ML workflow without a comprehensive orchestration tool is not scalable or maintainable. It does not provide the task dependency management, monitoring, and versioning required for an end-to-end ML workflow.

72
Q

Use on-demand instances for training, allowing the flexibility to scale resources as needed, and use provisioned resources with auto-scaling for inference to handle varying traffic while controlling costs

A

Using on-demand instances for training offers flexibility, allowing you to allocate resources only when needed, which is ideal for sporadic training jobs. For inference, provisioned resources with auto-scaling ensure that the system can handle varying traffic while controlling costs, as it can scale down during periods of low demand.

via - https://aws.amazon.com/ec2/pricing/

73
Q

Applying multiple layers of security measures including input validation, access controls, and continuous monitoring to address vulnerabilities

A

Architecting a defense-in-depth security approach involves implementing multiple layers of security to protect generative AI applications. This includes input validation to prevent malicious data inputs, strict access controls to limit who can interact with the AI models, and continuous monitoring to detect and respond to security incidents. These measures can help address common vulnerabilities and meet the best practices for securing generative AI applications on AWS.

74
Q

SageMaker input mode: Pipe

A

Select the Pipe input mode to stream the data directly from Amazon S3 to the training instances, allowing the model to start processing data immediately without requiring local storage for the entire dataset

In pipe mode, data is pre-fetched from Amazon S3 at high concurrency and throughput, and streamed into a named pipe, which also known as a First-In-First-Out (FIFO) pipe for its behavior. Each pipe may only be read by a single process.

Pipe input mode is designed for large datasets, allowing data to be streamed directly from Amazon S3 into the training instances. This minimizes disk usage and allows training to begin immediately as the data streams in, making it ideal for your scenario where high throughput and efficiency are critical.

via - https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data.html

75
Q

SageMaker input mode: File

A

The File input mode downloads the entire dataset to the training instance before starting the training job.

76
Q

SageMaker input mode: FastFile

A

FastFile mode is useful for scenarios where you need rapid access to data with low latency, but it is best suited for workloads with many small files.

You should note that FastFile mode can be used only while accessing data from Amazon S3 and not with Amazon FSx for Lustre.

77
Q

Amazon SageMaker

A

Use Amazon SageMaker for both training and deployment, leverage auto-scaling endpoints for real-time inference, and apply SageMaker Pipelines for orchestrating end-to-end ML workflows, ensuring scalability and automation

Amazon SageMaker provides a managed service for both training and deployment, which simplifies the infrastructure and reduces operational overhead. Auto-scaling endpoints in SageMaker ensure the system can handle increasing demand without manual intervention. SageMaker Pipelines automates the entire ML workflow, enabling continuous integration and delivery (CI/CD) practices, making the infrastructure scalable, maintainable, and cost-effective.

78
Q

Amazon SageMaker Debugger

A

Use Amazon SageMaker Debugger to debug and improve model performance by addressing underlying problems such as overfitting, saturated activation functions, and vanishing gradients

A machine learning (ML) training job can have problems such as overfitting, saturated activation functions, and vanishing gradients, which can compromise model performance.

SageMaker Debugger provides tools to debug training jobs and resolve such problems to improve the performance of your model. Debugger also offers tools to send alerts when training anomalies are found, take actions against the problems, and identify the root cause of them by visualizing collected metrics and tensors.

SageMaker Debugger:

via - https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html

79
Q

Amazon SageMaker Clarify

A

Focus on feature engineering by creating domain-specific features and use SageMaker Clarify to evaluate feature importance

Feature engineering is one of the most effective ways to boost model performance, particularly in domain-specific applications like credit risk modeling. By creating more informative features, you can provide the model with better signals for prediction. SageMaker Clarify can be used to evaluate feature importance, helping you identify the most impactful features and further refine the model.

via - https://aws.amazon.com/sagemaker/clarify/