Domain 4: ML Implementation and Operations Flashcards
_____is a service that provides a record of actions taken by a user, role, or an AWS service. It simplifies compliance audits, security analysis, and operational troubleshooting by enabling event history, which allows you to view, search, and download recent AWS account activity.
AWS CloudTrail
Event history: View the most recent account activity across your AWS infrastructure and troubleshoot operational issues.
CloudTrail Insights: Automatic detection of unusual activity in your account.
Data events: Record API calls made to specific AWS services such as Amazon S3 object-level APIs or AWS Lambda function execution APIs.
Management events: Record API calls that manage the AWS resources.
Key features of CloudTrail
_____ keeps an eye on every API call made to your AWS account and delivers a log file to an Amazon S3 bucket that you specify. These logs include details such as the identity of the API caller, the time of the API call, the source IP address, and the request parameters.
CloudTrail
_____ is a monitoring and observability service for AWS cloud resources and the applications you run on AWS. It can monitor AWS resources, such as EC2 instances, Amazon DynamoDB tables, and Lambda functions, and you can collect and access all your performance and operational data in the form of logs and metrics from a single platform.
Amazon CloudWatch
Metrics: Collect and store key metrics, which are variables you can measure for your resources and applications.
Logs: Collect, monitor, and analyze log files from different AWS services.
Alarms: Watch for specific metrics and automatically react to changes.
Events: Respond to state changes in your AWS resources with EventBridge.
Key features of CloudWatch
This service allows you to set alarms and automatically react to changes in your AWS resources, and it also integrates with Amazon SNS to notify you when certain thresholds are breached.
CloudWatch
Enable
Choose events
Specify S3 bucket
Turn on insights
How to get started w/ CloudTrail monitoring
Set up metrics
Create alarms
Configure logging
Design dashboard
How to implement monitoring solutions with CloudWatch
To effectively monitor for errors and anomalies within your machine learning environment, you could set up a combination of _____ and _____.
CloudTrail and CloudWatch
By deploying applications across multiple Availability Zones, you can protect your applications from the failure of a single location.
High Availability
Multi-Region deployments can provide a backup in case of a regional service disruption.
Fault Tolerance
Different regions can serve users from geographically closer endpoints, reducing latency and improving the user experience.
Scalability
For machine learning applications, having data processing and storage close to the data sources can reduce transfer times and comply with data sovereignty laws.
Data Locality
One or more discrete data centers within a region with redundant power, networking, and connectivity. They are physically separated by a meaningful distance, many kilometers, from any other.
Availability zone
You can deploy machine learning models using Amazon EC2 instances configured with _____, which can launch instances across multiple Availability Zones to ensure your application can withstand the loss of an AZ.
Auto Scaling
For databases backing machine learning applications, a Multi-AZ deployment with _____ can provide high availability and automatic failover support.
Amazon RDS
Deploying applications _____ can protect against regional outages and provide geographic redundancy.
across multiple AWS Regions
_____ allows you to replicate data between distant AWS Regions.
S3 cross-region replication (CRR)
_____ can route traffic to different regions based on geographical location, which can reduce latency for end-users, and its Geoproximity routing lets you balance traffic loads across multiple regions.
Amazon Route 53
Test Failover Mechanisms: Regularly test your failover to ensure that the systems switch to new regions or zones without issues.
Data Synchronization: Keep data synchronized across regions, considering the cost and traffic implications.
Latency: Use services such as Amazon CloudFront to cache data at edge locations and reduce latency.
Compliance and Data Residency: Be aware of compliance requirements and data residency regulations that may impact data storage and transfer.
Cost Management: Consider the additional costs associated with cross-region data transfer and storage.
Best Practices for Multi-Region and Multi-AZ Deployments
A _____ can be used to package and deploy machine learning applications consistently across different environments. By containerizing machine learning applications, you ensure that the application runs the same way, regardless of where it is deployed.
Docker
_____ provide a lightweight, standalone, and executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries, and settings.
Docker containers
Docker containers can be created and managed using services like _____.
Amazon Elastic Container Service (ECS), Amazon Elastic Kubernetes Service (EKS), or even directly on EC2 instances
A text document that contains all the commands a user could call on the command line to assemble an image.
Dockerfile
An immutable file that’s essentially a snapshot of a container. Images are built from the instructions for a complete and executable version of an application, which relies on the host OS kernel.
Docker Image
A runtime instance of a Docker image.
Docker Container
Steps to create and deploy Docker containers in AWS for machine learning applications
- Install Docker on your local machine or AWS EC2 instance.
- Write a Dockerfile for your machine learning application.
- Build your Docker image with the docker build command.
- Run your Docker container locally to test with the docker run command.
- Push the Docker image to Amazon ECR.
- Deploy your Docker container on Amazon ECS or EKS.
_______ are collections of EC2 instances that are treated as a logical grouping for the purposes of automatic scaling and management.
Auto Scaling groups
_____ monitors your applications and automatically adjusts capacity to maintain steady, predictable performance at the lowest possible cost.
AWS Auto Scaling
T/F: Auto Scaling groups can help mange workload variability (significant computational resources for short periods, while inference may see spikes in demand) by adding/removing resources as needed.
True
How do you deploy Auto Scaling groups?
- Create a launch template or launch configuration: Define the EC2 instance configuration that will be used by the Auto Scaling group. This includes the AMI, instance type, key pair, security groups, etc.
- Define the Auto Scaling group: Specify the minimum, maximum, and desired number of instances, as well as the availability zones for your instances. Attach the previously created launch template or launch configuration.
Configure scaling policies: - Establish the guidelines under which the Auto Scaling group will scale out (add more instances) or scale in (remove instances). Scaling policies can be based on criteria such as CPU utilization, network usage, or custom metrics.
- Set up notifications (optional): Create notifications to alert you when the Auto Scaling group launches or terminates instances, or when instances fail health checks.
AWS provides _____ for Auto Scaling groups that you can use to monitor events such as:
- EC2 instance launch or termination
- Failed health checks
- Scaling activities triggered by your policies
CloudWatch metrics
What are the main resources you’ll be rightsizing?
EC2 Instances: Virtual servers in Amazon’s Elastic Compute Cloud (EC2) service.
Provisioned IOPS: The input/output operations per second that a storage volume can handle.
EBS Volumes: Elastic Block Store (EBS) provides persistent block storage volumes for EC2 instances.
For machine learning tasks, Amazon has specific instances like _____ that are optimized for GPU-based computations, which are ideal for training and inference in deep learning models.
the P and G series
When rightsizing EC2 instances, consider:
Compute Power: Match your instance’s CPU and GPU power to the training and inference needs of your machine learning model.
Memory: Choose an instance with enough RAM to handle your model and dataset.
Network Performance: Ensure the instance provides adequate network bandwidth for data transfer.
_____ are integral to the performance of the storage system, affecting how quickly data can be written to and read from the storage media.
IOPS
ML workloads can be I/O-intensive, particularly during _____ phases.
dataset loading and model training
You should choose the volume type and size based on your _____ requirements.
ML workload
_____is an ongoing process. You should continuously monitor and analyze your AWS resource usage and performance metrics to identify opportunities to resize instances and storage options.
Rightsizing
AWS offers tools such as _____ and _____ to track resource utilization and identify instances that are either over or under-utilized.
AWS CloudWatch and AWS Trusted Advisor
Assess Regularly: Workloads can change over time, requiring different resources.
Use Managed Services: Managed services like Amazon SageMaker can automatically handle some rightsizing for you.
Consider Spot Instances: For flexible workloads, consider using spot instances, which can be cheaper but less reliable than on-demand instances.
Take Advantage of Autoscaling: Use autoscaling to adjust resources in response to changes in demand.
Practical Tips for Rightsizing
What all is involved in rightsizing AWS resources?
- Choosing the appropriate instance types, provisioned IOPS, and EBS volumes for your specific ML workloads.
- Balancing between performance needs and cost optimization, ensuring that you’re using just the right amount of resources without underutilizing or overpaying.
- Regularly monitoring and adjustments are key to maintaining an efficient and cost-effective AWS environment for machine learning applications.
_____ plays a crucial role when it comes to managing incoming traffic across multiple targets, such as EC2 instances, in a reliable and efficient manner.
Load balancing
Suitable for simple load balancing of traffic across multiple EC2 instances.
Classic Load Balancer (CLB)
Best for advanced load balancing of HTTP and HTTPS traffic, providing advanced routing capabilities tailored to application-level content.
Application Load Balancer (ALB)
Ideal for handling volatile traffic patterns and large numbers of TCP flows, offering low-latency performance.
Network Load Balancer (NLB)
Helps you deploy, scale, and manage a fleet of third-party virtual appliances (such as firewalls and intrusion detection/prevention systems).
Gateway Load Balancer (GWLB)
What is the most commonly used load balancer, and why?
Application Load Balancer due to its ability to make routing decisions at the application layer
When deploying ML models, what does distributing inference requests across multiple model servers accomplish?
- Ensures high availability and reduces latency for end users.
- Helps distribute traffic across instances in different Availability Zones for fault tolerance.
How is load balancing typically implemented in ML scenarios?
- Deployment of ML Models: You may have multiple instances of Amazon SageMaker endpoints or EC2 instances serving your machine learning models.
- Configuration of Load Balancer: An Application Load Balancer is configured to sit in front of these instances. ALB supports content-based routing, and with well-defined rules, you can direct traffic based on the inference request content.
- Auto Scaling: You can set up AWS Auto Scaling to automatically adjust the number of inference instances in response to the incoming application load.
To ensure high availability, you will deploy your EC2 instances across multiple Availability Zones.
Availability Zones
The ALB periodically performs health checks on the registered instances and only routes traffic to healthy instances, ensuring reliability.
Health Checks
The EC2 instances will be part of an _____, which can automatically scale the number of instances up or down based on defined metrics such as CPU utilization or the number of incoming requests.
Auto Scaling group
_____ adjusts the number of instances based on a target value for a specific metric.
Target tracking scaling policy
_____ increases or decreases the number of instances based on a set of scaling adjustments.
Step scaling policy
Monitors your load balancer and managed instances, providing metrics such as request count, latency, and error codes.
CloudWatch
ALB can log each request it processes, which can be stored in S3 and used for analysis.
Access Logs
ALB supports request tracing to track HTTP requests from clients to targets.
Request Tracing
Amazon Machine Images (AMIs) serve as the templates for virtual servers on the AWS platform and are crucial for the rapid deployment of scalable, reliable, and secure applications.
Amazon Machine Images (AMIs)
An _____ contains all the necessary information to launch a virtual machine (VM) in AWS, including the operating system (OS), application server, applications, and associated configurations.
AMI
Benefits of Using AMIs
- Consistency: Ensures that each instance you launch has the same setup, reducing variability which leads to fewer errors.
- Scalability: Streamlines the process of scaling applications by allowing new instances to be spun up with the same configuration.
- Security: By pre-installing security patches and configuring security settings, you ensure compliance from the moment each instance is launched.
- Version Control: You can maintain different versions of AMIs to rollback or forward to different configurations if needed.
A _____ is a type of AMI that is pre-configured with an optimal set of software and settings for a particular use case. It’s considered such because it’s a tested and proven baseline which teams can use as a stable starting point.
golden image
Best Practices for Golden Images
- Automation: Automate the creation and maintenance of golden images to reduce manual errors and save time.
- Security Hardening: Implement security best practices within the image, including minimizing unnecessary software to reduce vulnerabilities.
- Regular Updates: Continuously integrate latest security patches and updates.
Versioning: Maintain versions of golden images to track changes over time and for audit purposes. - Immutable Infrastructure: Treat golden images as immutable; any change requires creating a new image rather than updating an existing one.
When working with AMI, you may need include:
- Machine Learning Frameworks: Like TensorFlow, Keras, or PyTorch pre-installed and configured.
- GPU Drivers: If leveraging GPUs for computation, ensure proper drivers and libraries are installed.
- Data Processing Tools: Pre-installation of tools like Apache Spark or Hadoop if needed for data processing.
- Optimized Libraries: Depending on your machine learning tasks, you might need optimized math libraries such as Intel MKL.
Data both in transit and at rest, should be _____.
encrypted
AWS offers several mechanisms for encryption, such as _____ for managing keys and _____ for managing SSL/TLS certificates.
AWS KMS / AWS Certificate Manager
Use the _____ to identify and right-size underutilized instances.
AWS Cost Explorer
_____ to optimize hyperparameters instead of manual experimentation.
SageMaker Automatic Model Tuning
When training machine learning models, _____ can significantly improve input/output operations and reduce training time.
caching data
_____ which are designed to be more efficient and scalable than their open-source equivalents.
Leverage AWS-optimized ML algorithms
Implement _____ and _____ for your machine learning models and datasets to facilitate recovery in case of failures.
automated backups and versioning
For long-running training jobs, use _____to save interim model states, which will allow you to resume from the last checkpoint rather than starting over in the event of a failure.
checkpointing
Implement _____ for automated testing, building, and deployment of ML models.
continuous integration and continuous delivery (CI/CD) pipelines
Use _____ and _____ to automate the deployment of machine learning models trained with SageMaker.
AWS CodePipeline and CodeBuild
Set up _____alarms on SageMaker endpoints to monitor the performance of your deployed models and trigger retraining workflows with ______ if necessary.
CloudWatch / AWS Step Functions
Use _____ in multiple AWS Regions and ____ to route traffic for high availability.
SageMaker Model Hosting Services / Route 53
_____ is a service that turns text into lifelike speech. It utilizes advanced deep learning technologies to synthesize speech that sounds like a human voice, and it supports multiple languages and includes a variety of lifelike voices.
Amazon Polly
- Speech (TTS) in a variety of voices and languages.
- Real-time streaming or batch processing of speech files.
- Support for Speech Synthesis Markup Language (SSML) for adjusting speech parameters like pitch, speed, and volume.
Key features of Amazon Polly
- Creating applications that read out text, such as automated newsreaders or e-learning platforms.
- Generating voiceovers for videos.
- Creating conversational interfaces for devices and applications.
Use cases for Amazon Polly
_____ is an AWS service for building conversational interfaces using voice and text. Powered by the same technology that drives Amazon Alexa, it provides an easy-to-use console for creating sophisticated, natural language chatbots.
Amazon Lex
- Natural lanuage understanding (NLU) and automatic speech recognition (ASR) to interpret user intent.
- Integration with AWS Lambda to execute business logic or fetch data dynamically.
- Seamless deployment across multiple platforms such as mobile apps, web applications, and messaging platforms.
Key features of Amazon Lex
- Customer service chatbots to assist with common requests or questions.
- Voice-enabled application interfaces that allow for hands-free operation.
- Enterprise productivity bots integrated with platforms like Slack or Facebook Messenger.
Amazon Lex use cases
_____ uses deep learning processes to convert speech to text quickly and accurately. It can be used to transcribe customer service calls, automate subtitling, and generate metadata for media assets to create a fully searchable archive.
Amazon Transcribe
- High-quality speech recognition that supports various audio formats.
- Identification of different speakers (speaker diarization) within the audio.
- Supports custom vocabulary and terms specific to particular domains or industries.
Key features of Amazon Transcribe
- Transcribing recorded audio from customer service calls for analysis and insight.
- Automated generation of subtitles for videos.
- Creating text-based records of meetings or legal proceedings.
Amazon Transcribe use cases
T/F: SageMaker covers classification, regression, and clustering.
True
Advantages of using Amazon SageMaker built-in algorithms include:
Ease of Use: These algorithms are pre-implemented and optimized for performance, allowing you to focus on model training and deployment without worrying about the underlying code.
Performance: Amazon SageMaker algorithms are designed to be highly scalable and performant, benefiting from AWS optimizations.
Integration: Built-in algorithms are tightly integrated with other SageMaker features, including model tuning and deployment.
Cost-Effectiveness: They can offer a cost advantage for certain tasks due to the efficiencies gained from optimization.
T/F: Amazon SageMaker supports various built-in algorithms like Linear Learner, XGBoost, and Random Cut Forest, among others.
True
T/F: Building custom machine learning models allows for greater flexibility and control over the architecture, features, and hyperparameters.
True
Building custom models is useful when:
Unique Requirements: Pre-built algorithms might not be suitable for specific tasks or data types.
Innovative Research: Custom experiments and novel architectures are necessary for cutting-edge machine learning research.
Domain Specialization: Highly specialized tasks may require custom-tailored solutions.
Performance Tuning: When the utmost performance is required, and you need to optimize every aspect of the model yourself.
Ease of Use: High
Model Complexity: Low to moderate
Specificity of Application: General use cases
Data Volume: High (optimized for scalability)
Performance Optimization: Pre-optimized, may be limited
Development Time: Shorter
Cost: Potentially lower
Integration with SageMaker: Full
When to use built-in algorithms
Ease of Use: Low to moderate
Model Complexity: High
Specificity of Application: Specialized/niche use cases
Data Volume: Variable
Performance Optimization: Full control
Development Time: Longer
Cost: Potentially higher
Integration with SageMaker: Requires custom setup
When to use custom models
_____ also referred to as limits, are the maximum number of resources you can create in an AWS service. They’re are set by AWS to help with resource optimization, ensuring availability and preventing abuse of services. They can vary by service, and also by regions within a service.
Service quotas
You can view and manage your AWS service quotas using _____, which provides a central location to manage quotas across your account.
the AWS Management Console, AWS CLI, or AWS Service Quotas API.
What command do you use to describe the service quotas using AWS CLI?
service-quotas list-service-quotas