Domain 4: ML Implementation and Operations Flashcards
_____is a service that provides a record of actions taken by a user, role, or an AWS service. It simplifies compliance audits, security analysis, and operational troubleshooting by enabling event history, which allows you to view, search, and download recent AWS account activity.
AWS CloudTrail
Event history: View the most recent account activity across your AWS infrastructure and troubleshoot operational issues.
CloudTrail Insights: Automatic detection of unusual activity in your account.
Data events: Record API calls made to specific AWS services such as Amazon S3 object-level APIs or AWS Lambda function execution APIs.
Management events: Record API calls that manage the AWS resources.
Key features of CloudTrail
_____ keeps an eye on every API call made to your AWS account and delivers a log file to an Amazon S3 bucket that you specify. These logs include details such as the identity of the API caller, the time of the API call, the source IP address, and the request parameters.
CloudTrail
_____ is a monitoring and observability service for AWS cloud resources and the applications you run on AWS. It can monitor AWS resources, such as EC2 instances, Amazon DynamoDB tables, and Lambda functions, and you can collect and access all your performance and operational data in the form of logs and metrics from a single platform.
Amazon CloudWatch
Metrics: Collect and store key metrics, which are variables you can measure for your resources and applications.
Logs: Collect, monitor, and analyze log files from different AWS services.
Alarms: Watch for specific metrics and automatically react to changes.
Events: Respond to state changes in your AWS resources with EventBridge.
Key features of CloudWatch
This service allows you to set alarms and automatically react to changes in your AWS resources, and it also integrates with Amazon SNS to notify you when certain thresholds are breached.
CloudWatch
Enable
Choose events
Specify S3 bucket
Turn on insights
How to get started w/ CloudTrail monitoring
Set up metrics
Create alarms
Configure logging
Design dashboard
How to implement monitoring solutions with CloudWatch
To effectively monitor for errors and anomalies within your machine learning environment, you could set up a combination of _____ and _____.
CloudTrail and CloudWatch
By deploying applications across multiple Availability Zones, you can protect your applications from the failure of a single location.
High Availability
Multi-Region deployments can provide a backup in case of a regional service disruption.
Fault Tolerance
Different regions can serve users from geographically closer endpoints, reducing latency and improving the user experience.
Scalability
For machine learning applications, having data processing and storage close to the data sources can reduce transfer times and comply with data sovereignty laws.
Data Locality
One or more discrete data centers within a region with redundant power, networking, and connectivity. They are physically separated by a meaningful distance, many kilometers, from any other.
Availability zone
You can deploy machine learning models using Amazon EC2 instances configured with _____, which can launch instances across multiple Availability Zones to ensure your application can withstand the loss of an AZ.
Auto Scaling
For databases backing machine learning applications, a Multi-AZ deployment with _____ can provide high availability and automatic failover support.
Amazon RDS
Deploying applications _____ can protect against regional outages and provide geographic redundancy.
across multiple AWS Regions
_____ allows you to replicate data between distant AWS Regions.
S3 cross-region replication (CRR)
_____ can route traffic to different regions based on geographical location, which can reduce latency for end-users, and its Geoproximity routing lets you balance traffic loads across multiple regions.
Amazon Route 53
Test Failover Mechanisms: Regularly test your failover to ensure that the systems switch to new regions or zones without issues.
Data Synchronization: Keep data synchronized across regions, considering the cost and traffic implications.
Latency: Use services such as Amazon CloudFront to cache data at edge locations and reduce latency.
Compliance and Data Residency: Be aware of compliance requirements and data residency regulations that may impact data storage and transfer.
Cost Management: Consider the additional costs associated with cross-region data transfer and storage.
Best Practices for Multi-Region and Multi-AZ Deployments
A _____ can be used to package and deploy machine learning applications consistently across different environments. By containerizing machine learning applications, you ensure that the application runs the same way, regardless of where it is deployed.
Docker
_____ provide a lightweight, standalone, and executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries, and settings.
Docker containers
Docker containers can be created and managed using services like _____.
Amazon Elastic Container Service (ECS), Amazon Elastic Kubernetes Service (EKS), or even directly on EC2 instances
A text document that contains all the commands a user could call on the command line to assemble an image.
Dockerfile
An immutable file that’s essentially a snapshot of a container. Images are built from the instructions for a complete and executable version of an application, which relies on the host OS kernel.
Docker Image
A runtime instance of a Docker image.
Docker Container
Steps to create and deploy Docker containers in AWS for machine learning applications
- Install Docker on your local machine or AWS EC2 instance.
- Write a Dockerfile for your machine learning application.
- Build your Docker image with the docker build command.
- Run your Docker container locally to test with the docker run command.
- Push the Docker image to Amazon ECR.
- Deploy your Docker container on Amazon ECS or EKS.
_______ are collections of EC2 instances that are treated as a logical grouping for the purposes of automatic scaling and management.
Auto Scaling groups
_____ monitors your applications and automatically adjusts capacity to maintain steady, predictable performance at the lowest possible cost.
AWS Auto Scaling
T/F: Auto Scaling groups can help mange workload variability (significant computational resources for short periods, while inference may see spikes in demand) by adding/removing resources as needed.
True
How do you deploy Auto Scaling groups?
- Create a launch template or launch configuration: Define the EC2 instance configuration that will be used by the Auto Scaling group. This includes the AMI, instance type, key pair, security groups, etc.
- Define the Auto Scaling group: Specify the minimum, maximum, and desired number of instances, as well as the availability zones for your instances. Attach the previously created launch template or launch configuration.
Configure scaling policies: - Establish the guidelines under which the Auto Scaling group will scale out (add more instances) or scale in (remove instances). Scaling policies can be based on criteria such as CPU utilization, network usage, or custom metrics.
- Set up notifications (optional): Create notifications to alert you when the Auto Scaling group launches or terminates instances, or when instances fail health checks.
AWS provides _____ for Auto Scaling groups that you can use to monitor events such as:
- EC2 instance launch or termination
- Failed health checks
- Scaling activities triggered by your policies
CloudWatch metrics
What are the main resources you’ll be rightsizing?
EC2 Instances: Virtual servers in Amazon’s Elastic Compute Cloud (EC2) service.
Provisioned IOPS: The input/output operations per second that a storage volume can handle.
EBS Volumes: Elastic Block Store (EBS) provides persistent block storage volumes for EC2 instances.
For machine learning tasks, Amazon has specific instances like _____ that are optimized for GPU-based computations, which are ideal for training and inference in deep learning models.
the P and G series
When rightsizing EC2 instances, consider:
Compute Power: Match your instance’s CPU and GPU power to the training and inference needs of your machine learning model.
Memory: Choose an instance with enough RAM to handle your model and dataset.
Network Performance: Ensure the instance provides adequate network bandwidth for data transfer.
_____ are integral to the performance of the storage system, affecting how quickly data can be written to and read from the storage media.
IOPS
ML workloads can be I/O-intensive, particularly during _____ phases.
dataset loading and model training
You should choose the volume type and size based on your _____ requirements.
ML workload
_____is an ongoing process. You should continuously monitor and analyze your AWS resource usage and performance metrics to identify opportunities to resize instances and storage options.
Rightsizing
AWS offers tools such as _____ and _____ to track resource utilization and identify instances that are either over or under-utilized.
AWS CloudWatch and AWS Trusted Advisor
Assess Regularly: Workloads can change over time, requiring different resources.
Use Managed Services: Managed services like Amazon SageMaker can automatically handle some rightsizing for you.
Consider Spot Instances: For flexible workloads, consider using spot instances, which can be cheaper but less reliable than on-demand instances.
Take Advantage of Autoscaling: Use autoscaling to adjust resources in response to changes in demand.
Practical Tips for Rightsizing
What all is involved in rightsizing AWS resources?
- Choosing the appropriate instance types, provisioned IOPS, and EBS volumes for your specific ML workloads.
- Balancing between performance needs and cost optimization, ensuring that you’re using just the right amount of resources without underutilizing or overpaying.
- Regularly monitoring and adjustments are key to maintaining an efficient and cost-effective AWS environment for machine learning applications.
_____ plays a crucial role when it comes to managing incoming traffic across multiple targets, such as EC2 instances, in a reliable and efficient manner.
Load balancing
Suitable for simple load balancing of traffic across multiple EC2 instances.
Classic Load Balancer (CLB)
Best for advanced load balancing of HTTP and HTTPS traffic, providing advanced routing capabilities tailored to application-level content.
Application Load Balancer (ALB)
Ideal for handling volatile traffic patterns and large numbers of TCP flows, offering low-latency performance.
Network Load Balancer (NLB)
Helps you deploy, scale, and manage a fleet of third-party virtual appliances (such as firewalls and intrusion detection/prevention systems).
Gateway Load Balancer (GWLB)
What is the most commonly used load balancer, and why?
Application Load Balancer due to its ability to make routing decisions at the application layer
When deploying ML models, what does distributing inference requests across multiple model servers accomplish?
- Ensures high availability and reduces latency for end users.
- Helps distribute traffic across instances in different Availability Zones for fault tolerance.
How is load balancing typically implemented in ML scenarios?
- Deployment of ML Models: You may have multiple instances of Amazon SageMaker endpoints or EC2 instances serving your machine learning models.
- Configuration of Load Balancer: An Application Load Balancer is configured to sit in front of these instances. ALB supports content-based routing, and with well-defined rules, you can direct traffic based on the inference request content.
- Auto Scaling: You can set up AWS Auto Scaling to automatically adjust the number of inference instances in response to the incoming application load.
To ensure high availability, you will deploy your EC2 instances across multiple Availability Zones.
Availability Zones
The ALB periodically performs health checks on the registered instances and only routes traffic to healthy instances, ensuring reliability.
Health Checks
The EC2 instances will be part of an _____, which can automatically scale the number of instances up or down based on defined metrics such as CPU utilization or the number of incoming requests.
Auto Scaling group
_____ adjusts the number of instances based on a target value for a specific metric.
Target tracking scaling policy
_____ increases or decreases the number of instances based on a set of scaling adjustments.
Step scaling policy
Monitors your load balancer and managed instances, providing metrics such as request count, latency, and error codes.
CloudWatch
ALB can log each request it processes, which can be stored in S3 and used for analysis.
Access Logs
ALB supports request tracing to track HTTP requests from clients to targets.
Request Tracing
Amazon Machine Images (AMIs) serve as the templates for virtual servers on the AWS platform and are crucial for the rapid deployment of scalable, reliable, and secure applications.
Amazon Machine Images (AMIs)
An _____ contains all the necessary information to launch a virtual machine (VM) in AWS, including the operating system (OS), application server, applications, and associated configurations.
AMI
Benefits of Using AMIs
- Consistency: Ensures that each instance you launch has the same setup, reducing variability which leads to fewer errors.
- Scalability: Streamlines the process of scaling applications by allowing new instances to be spun up with the same configuration.
- Security: By pre-installing security patches and configuring security settings, you ensure compliance from the moment each instance is launched.
- Version Control: You can maintain different versions of AMIs to rollback or forward to different configurations if needed.
A _____ is a type of AMI that is pre-configured with an optimal set of software and settings for a particular use case. It’s considered such because it’s a tested and proven baseline which teams can use as a stable starting point.
golden image
Best Practices for Golden Images
- Automation: Automate the creation and maintenance of golden images to reduce manual errors and save time.
- Security Hardening: Implement security best practices within the image, including minimizing unnecessary software to reduce vulnerabilities.
- Regular Updates: Continuously integrate latest security patches and updates.
Versioning: Maintain versions of golden images to track changes over time and for audit purposes. - Immutable Infrastructure: Treat golden images as immutable; any change requires creating a new image rather than updating an existing one.
When working with AMI, you may need include:
- Machine Learning Frameworks: Like TensorFlow, Keras, or PyTorch pre-installed and configured.
- GPU Drivers: If leveraging GPUs for computation, ensure proper drivers and libraries are installed.
- Data Processing Tools: Pre-installation of tools like Apache Spark or Hadoop if needed for data processing.
- Optimized Libraries: Depending on your machine learning tasks, you might need optimized math libraries such as Intel MKL.
Data both in transit and at rest, should be _____.
encrypted
AWS offers several mechanisms for encryption, such as _____ for managing keys and _____ for managing SSL/TLS certificates.
AWS KMS / AWS Certificate Manager
Use the _____ to identify and right-size underutilized instances.
AWS Cost Explorer
_____ to optimize hyperparameters instead of manual experimentation.
SageMaker Automatic Model Tuning
When training machine learning models, _____ can significantly improve input/output operations and reduce training time.
caching data
_____ which are designed to be more efficient and scalable than their open-source equivalents.
Leverage AWS-optimized ML algorithms
Implement _____ and _____ for your machine learning models and datasets to facilitate recovery in case of failures.
automated backups and versioning
For long-running training jobs, use _____to save interim model states, which will allow you to resume from the last checkpoint rather than starting over in the event of a failure.
checkpointing
Implement _____ for automated testing, building, and deployment of ML models.
continuous integration and continuous delivery (CI/CD) pipelines
Use _____ and _____ to automate the deployment of machine learning models trained with SageMaker.
AWS CodePipeline and CodeBuild
Set up _____alarms on SageMaker endpoints to monitor the performance of your deployed models and trigger retraining workflows with ______ if necessary.
CloudWatch / AWS Step Functions
Use _____ in multiple AWS Regions and ____ to route traffic for high availability.
SageMaker Model Hosting Services / Route 53
_____ is a service that turns text into lifelike speech. It utilizes advanced deep learning technologies to synthesize speech that sounds like a human voice, and it supports multiple languages and includes a variety of lifelike voices.
Amazon Polly
- Speech (TTS) in a variety of voices and languages.
- Real-time streaming or batch processing of speech files.
- Support for Speech Synthesis Markup Language (SSML) for adjusting speech parameters like pitch, speed, and volume.
Key features of Amazon Polly
- Creating applications that read out text, such as automated newsreaders or e-learning platforms.
- Generating voiceovers for videos.
- Creating conversational interfaces for devices and applications.
Use cases for Amazon Polly
_____ is an AWS service for building conversational interfaces using voice and text. Powered by the same technology that drives Amazon Alexa, it provides an easy-to-use console for creating sophisticated, natural language chatbots.
Amazon Lex
- Natural lanuage understanding (NLU) and automatic speech recognition (ASR) to interpret user intent.
- Integration with AWS Lambda to execute business logic or fetch data dynamically.
- Seamless deployment across multiple platforms such as mobile apps, web applications, and messaging platforms.
Key features of Amazon Lex
- Customer service chatbots to assist with common requests or questions.
- Voice-enabled application interfaces that allow for hands-free operation.
- Enterprise productivity bots integrated with platforms like Slack or Facebook Messenger.
Amazon Lex use cases
_____ uses deep learning processes to convert speech to text quickly and accurately. It can be used to transcribe customer service calls, automate subtitling, and generate metadata for media assets to create a fully searchable archive.
Amazon Transcribe
- High-quality speech recognition that supports various audio formats.
- Identification of different speakers (speaker diarization) within the audio.
- Supports custom vocabulary and terms specific to particular domains or industries.
Key features of Amazon Transcribe
- Transcribing recorded audio from customer service calls for analysis and insight.
- Automated generation of subtitles for videos.
- Creating text-based records of meetings or legal proceedings.
Amazon Transcribe use cases
T/F: SageMaker covers classification, regression, and clustering.
True
Advantages of using Amazon SageMaker built-in algorithms include:
Ease of Use: These algorithms are pre-implemented and optimized for performance, allowing you to focus on model training and deployment without worrying about the underlying code.
Performance: Amazon SageMaker algorithms are designed to be highly scalable and performant, benefiting from AWS optimizations.
Integration: Built-in algorithms are tightly integrated with other SageMaker features, including model tuning and deployment.
Cost-Effectiveness: They can offer a cost advantage for certain tasks due to the efficiencies gained from optimization.
T/F: Amazon SageMaker supports various built-in algorithms like Linear Learner, XGBoost, and Random Cut Forest, among others.
True
T/F: Building custom machine learning models allows for greater flexibility and control over the architecture, features, and hyperparameters.
True
Building custom models is useful when:
Unique Requirements: Pre-built algorithms might not be suitable for specific tasks or data types.
Innovative Research: Custom experiments and novel architectures are necessary for cutting-edge machine learning research.
Domain Specialization: Highly specialized tasks may require custom-tailored solutions.
Performance Tuning: When the utmost performance is required, and you need to optimize every aspect of the model yourself.
Ease of Use: High
Model Complexity: Low to moderate
Specificity of Application: General use cases
Data Volume: High (optimized for scalability)
Performance Optimization: Pre-optimized, may be limited
Development Time: Shorter
Cost: Potentially lower
Integration with SageMaker: Full
When to use built-in algorithms
Ease of Use: Low to moderate
Model Complexity: High
Specificity of Application: Specialized/niche use cases
Data Volume: Variable
Performance Optimization: Full control
Development Time: Longer
Cost: Potentially higher
Integration with SageMaker: Requires custom setup
When to use custom models
_____ also referred to as limits, are the maximum number of resources you can create in an AWS service. They’re are set by AWS to help with resource optimization, ensuring availability and preventing abuse of services. They can vary by service, and also by regions within a service.
Service quotas
You can view and manage your AWS service quotas using _____, which provides a central location to manage quotas across your account.
the AWS Management Console, AWS CLI, or AWS Service Quotas API.
What command do you use to describe the service quotas using AWS CLI?
service-quotas list-service-quotas
AWS provides two ways to request a quota increase:
through the AWS Service Quotas console or by opening a support case
- Navigate to the Service Quotas console.
- Select the service you need a quota increase for.
- Choose the specific quota.
- Click on the “Request quota increase” button and fill out the required form.
How to increase quota using AWS Service Quota
Best Practices for Managing Quotas
Monitor Usage: Regularly monitor your usage against your service quotas with CloudWatch metrics, AWS Budgets, or custom scripts.
Set Alarms: Create alarms in CloudWatch to notify you when you are approaching a service quota.
Clean Up Resources: Terminate resources you no longer need to free up your service quota for other tasks.
Implement Cost Control: By staying within limits, you also keep costs down, ensuring that you are only paying for the resources you need.
Plan for Scale: Understand how your quotas scale across multiple regions, as some resources might have regional service quotas.
When selecting an instance for machine learning workloads on AWS, it’s important to _____.
match the instance type to the specific needs of your application
Perfect for a variety of workloads, these instances offer a balance of compute, memory, and networking resources. The T and M series fall into this category and can be a cost-effective solution for smaller machine learning tasks.
General Purpose Instances
The R and X series fall under this umbrella and are designed for memory-intensive applications like large in-memory databases. They can also be useful for ML tasks that require substantial memory, such as certain types of clustering and large-scale linear algebra operations.
Memory Optimized Instances
For ML workloads involving deep learning and large-scale tensor calculations, GPU instances like the P and G series are highly recommended. They are equipped with powerful GPUs that significantly accelerate model training and inference.
GPU Instances
These instances are billed by the second, with no long-term commitments or upfront payments. While they offer flexibility for sporadic or unpredictable workloads, they are often the most expensive option over time.
On-demand instances
By committing to a one or three-year term, you can save up to 75% over equivalent on-demand capacity, and these instances are ideal for predictable workloads with steady state usage.
Reserved Instances
Spot instances offer the opportunity to take advantage of unused EC2 capacity at discounts of up to 90% compared to on-demand prices. These instances can be interrupted by AWS with two-minute notification, making them suitable for flexible or fault-tolerant ML workloads.
Spot Instances
- Rightsizing instances by benchmarking your ML workloads to select the instance that best matches your performance and cost requirements.
- Use managed services like Amazon SageMaker that can abstract away the complexities of managing infrastructure and provide cost efficiencies through features like SageMaker Managed Spot Training.
- Continuously monitor and adjust your usage with tools like AWS Cost Explorer and AWS Budgets.
Cost Optimization Strategies for ML
Using ____ is a cost-effective strategy for training deep learning models, especially when the workload has flexible start and end times.
AWS Spot Instances
_____ simplifies the process of deploying batch computing jobs on AWS. When combined, these technologies facilitate an efficient and scalable approach to handle extensive computational tasks such as deep learning training at a fraction of the cost.
AWS Batch
_____ allow you to take advantage of unused EC2 computing capacity at up to a 90% discount compared to On-Demand prices. However, these instances can be interrupted by AWS with two minutes of notification when AWS needs the capacity back.
Spot Instances
Despite the possibility of interruptions, _____ are well-suited for deep learning workloads as they are often resilient to interruptions – models can save checkpoints to persistent storage, allowing training to resume from the last saved state.
Spot Instances
_____ automates the deployment, management, and scaling of batch jobs. It dynamically provisions the optimal quantity and type of compute resources based on the volume and specific resource requirements of the batch jobs submitted.
AWS Batch
- Define a compute environment: Choose ‘Spot’ as the ‘Provisioning model’. Set the ‘Maximum price’ you’re willing to pay per instance hour, which can be up to the On-Demand rate.
- Create a job queue: Link your compute environment to a job queue by specifying priority levels.
- Define job definitions: Specify the Docker image to use, vCPUs, memory requirements, and the job role, which should include necessary permissions for the AWS resources your job will access.
- Submit jobs to the queue: Jobs submitted to this queue are then placed into the Spot Instance-based compute environment you’ve configured.
How to configure your compute environment within AWS Batch to use spot resources
Implement _____ in your training code so that your models can resume from the last saved state if interrupted. Store _____in Amazon S3 or EFS for durability.
Checkpoints
Use Amazon S3 for storing your datasets. It’s a highly available and durable storage service that integrates well with AWS Batch and Spot Instances.
Data Locality
Set your maximum spot bid price. If the spot market price exceeds your bid, your Spot Instance may be reclaimed.
Bid Pricing
Spread your spot requests across multiple instance types and Availability Zones to reduce the likelihood of simultaneous interruptions.
Diverse Spot Requests
Spot Fleet: Use Spot Fleet to manage a group of Spot Instances and On-Demand Instances to optimize for cost and availability.
Spot Fleet
Have a strategy to automatically fall back to On-Demand Instances when Spot Instances are not available for extended periods
Fallback to On-Demand
Monitor your Spot Instances and job execution using _____ to alert you when critical events occur (e.g., Spot Interruption notices).
AWS CloudWatch
Use _____ in conjunction with CloudWatch alarms to automate checkpointing and job restarts.
AWS Lambda functions
_____ control inbound and outbound traffic and ensure that the resources, such as Amazon SageMaker instances, Amazon EC2 instances hosting ML models, or databases storing training data, are secure and accessible only by authorized entities.
Security groups
_____ control the traffic based on rules that you define. You can specify rules that allow traffic to and from your instances, typically configured as a list of allowed IP protocols, ports, and source or destination IP ranges.
Security groups
T/F: Security groups deny all inbound traffic and allow all outbound traffic.
True
These rules govern incoming traffic to your instance.
Inbound rules
These rules control the network traffic that leaves your instance, which may include allowing instances to call external APIs or access other AWS services.
Outbound
Protocol: TCP
Port Range: 22
Source: Your IP
For secure shell access to an instance.
SSH
Protocol: TCP
Port Range: 8888
Source: Specific IP Range
Jupyter notebooks for ML.
Custom TCP
Protocol: TCP
Port Range: 80
Source: 0.0.0.0/0
Allow web access to ML Dashboards.
HTTP
T/F: It is a common practice to allow all outbound traffic, but for enhanced security, you should limit it to only the ports and protocols necessary for your application to function.
True
Only open up the ports that are necessary for your application to function. For instance, if your ML model only requires HTTP access, avoid opening the SSH port.
Principle of Least Privilege
IP Restrictions: Restrict the IP addresses able to access your instance. For business-critical ML systems, access should ideally be from known IP ranges.
IP Restrictions
Use different security groups for different roles within your infrastructure. For example, an Amazon RDS instance holding your data might have different security requirements compared to your Amazon SageMaker endpoint.
Separate Groups for Different Roles
Regularly review and update your security group rules to ensure they reflect your current requirements and are free from any legacy configurations that may introduce risks.
Regular Reviews and Updates
Some AWS services, like AWS PrivateLink for Amazon SageMaker, allow you to keep traffic between your VPC and the service within the AWS network, which reduces exposure to the internet and improves security.
Integration with AWS Services
Security groups are _____, meaning that if you send a request from your instance, the response traffic for that request is allowed to flow in regardless of inbound security group rules.
stateful
Additionally, security groups are associated with _____, which means that you can assign multiple security groups to a single network interface for granular control.
network interfaces
_____ enables you to manage access to AWS services and resources securely by creating and managing AWS users and groups, and use permissions to allow and deny their access to AWS resources.
IAM
Individuals or services who are granted access to resources in your AWS account.
Users
A collection of users under a set of permissions. Adding a user to a _____ grants them the permissions of that _____.
Groups
A set of permissions that grant access to actions and resources in AWS. It does not have standard long-term credentials (password or access keys) associated with it. Instead, when you assume a _____, it provides you with temporary security credentials for your _____session.
Roles
Documents that define permissions and can be attached to users, groups, or roles. They are written in JSON and specify what actions are allowed or denied.
Policies
- Least Privilege
- Rotate Credentials
- Enable MFA
- Audit/Log IAM Events
Security best practices for IAM
- Use IAM access advisor to check service last accessed information, thereby identifying unused permissions that can be revoked.
- Conditionally apply permissions based on tags attached to users or resources, minimizing overly broad permissions.
How to maintain compliant and secure ML environments
_____ are resource-based policies that allow you to manage permissions for your S3 resources. They enable you to grant or deny access to your S3 buckets and objects to both AWS accounts and AWS services. Typically, these permissions revolve around operations such as s3:GetObject, s3:PutObject, and s3:DeleteObject, which relate to reading, writing, and deleting objects within an S3 bucket.
S3 bucket policies
- Granting cross-account access: With bucket policies, you can define permissions that allow these external entities to access the required resources.
- Restricting access based on IP address: You might want to restrict access to your ML resources to requests originating from specific IP ranges, especially when dealing with sensitive data.
- Enforcing data encryption: Enforcing the use of encryption on uploads ensures that your machine learning data remains secure at rest, which is essential for maintaining data privacy and compliance with various regulations.
- Preserving data integrity and versioning: For ML models that rely on consistent data, you may use bucket policies to prevent accidental deletion and ensure versioning is enabled, keeping a record of all changes to the objects in an S3 bucket.
Use Cases for S3 Bucket Policies in ML
Attached directly to S3 buckets vs.
Attached to IAM users, groups, or roles
S3/IAM Policy difference on scope
Broad access control, Cross-account sharing vs.
Fine-grained permissions, User-specific access
S3/IAM Policy difference on use case
Implicit (the attached bucket) vs.
Must explicitly include S3 resource ARNs
S3/IAM Policy difference on resource definition
Up to 20 KB per bucket policy vs.
Up to 2 KB per IAM policy (inline or managed)
S3/IAM Policy difference on size limit
- Grant least privilege
- Regularly audit and rotate keys
- Validate your JSON
- Use conditions for extra security
S3 Bucket Policy Best Practices
A _____ is an isolated section of the AWS cloud where you can launch AWS resources in a virtual network that you define, and it closely resembles a traditional network that you might operate in your own data center, with the benefits of using the scalable infrastructure of AWS.
VPC
- Isolation: They provide a logically isolated area within the AWS cloud, ensuring that resources launched within them are not accessible by other ____ by default.
- Customization: Users have complete control over the virtual networking environment, including selection of IP address range, creation of subnets, and configuration of route tables and network gateways.
- Security: Groups of rules, known as security groups and network access control lists (ACLs), provide security at the protocol and port access level.
- Connectivity: Options include connecting to the internet, to your own data centers, or to other VPCs, providing flexibility for various deployment scenarios.
Key Features of VPCs
Dividing a VPC into _____ allows for efficient allocation of IP ranges based on the network design. These can be public (internet-facing) or private (no direct internet access).
subnets
_____ act as a virtual firewall for instances, controlling inbound and outbound traffic. _____ provide an additional layer of security, controlling traffic at the subnet level.
Security groups / Network ACLs
To access AWS services securely without traversing the internet, _____ can be used which allows private connections to AWS services.
VPC endpoints
For instances in a private subnet that need to initiate outbound internet traffic, a _____ is necessary, ensuring that the instances can connect to the internet while remaining private and secure.
NAT (Network Address Translation) instance or gateway
Data Transfer Costs: Transferring data between different AWS services or the internet can incur costs. _____ can help reduce data transfer charges.
Efficient VPC design
Performance: The choice of VPC components can impact the performance of ML models, especially during training and inference. Carefully consider the _____.
placement of resources and routing choices
T/F: As machine learning workloads grow, the VPC design should facilitate easy scalability, without compromising on security or performance.
True
_____ is the process of converting data into a code to prevent unauthorized access. On AWS, encryption ensures the confidentiality and integrity of your data both at rest and in transit.
Encryption
_____ provides server-side encryption with Amazon S3-managed keys (SSE-S3), AWS KMS-managed keys (SSE-KMS), or customer-provided keys (SSE-C).
Amazon S3
_____ encrypts volumes with keys managed by the AWS Key Management Service (KMS) or customer-managed keys.
Amazon EBS
Amazon RDS and Amazon Redshift also support encryption at rest using _____.
AWS KMS
Encrypting data in transit protects your data as it moves between services or locations. Common protocols include _____.
Secure Sockets Layer (SSL) or Transport Layer Security (TLS)
Amazon API Gateway for encrypting API calls
Amazon Elastic Load Balancing (ELB) for SSL/TLS encryption
AWS Direct Connect with VPN for secure connections to AWS
AWS services that support encryption in transit
_____ is the process of either encrypting or removing personally identifiable information from a dataset so that the identity of data subjects cannot be readily inferred.
Anonymization
Replacing sensitive data with unique identification symbols that retain essential information without compromising its security.
Tokenization
Obfuscation of specific data within a database so that the data structure remains intact but the information is not easily identifiable.
Masking
Reducing the granularity of the data, for example, by reporting age in ranges rather than specific values.
Generalization
The process of replacing private identifiers with fake identifiers or pseudonyms.
Pseudonymization
By comparing encryption with anonymization, we can see that _____ is reversible, provided you have the necessary keys, while _____ is designed to be irreversible in order to protect identity:
encryption / anonymization
Common encryption algorithms:
AES, RSA, ECC
Common anonymization algorithms:
Tokenization, Masking, Generalization
Before you expose an endpoint, you first need to _____.
Train a model
Once the endpoint is active, you can _____.
Send data for real-time predictions
Use _____ functions to manage endpoints.
Boto3
Boto3 functions can:
- list all endpoints
- describe a specific endpoint
- update an endpoint
- delete an endpoint
T/F: SageMaker endpoints come with IAM role-based access control and data passed to and from is encrypted.
True
You can run endpoints within a _____for additional network isolation.
VPC
Exposing endpoints within AWS using Amazon SageMaker consists of _____, _____, _____, and _____.
- training models
- deploying them to endpoints
- interacting with these endpoints for predictions
- managing and securing these endpoints
Machine learning models can be broadly classified into three categories:
supervised, unsupervised, and reinforcement learning models. These categories are based on how the models interact with the data presented to them.
- Utilize labeled datasets to predict outcomes.
- Frequently used models include linear regression for continuous outputs, logistic regression for classification tasks, decision trees, and neural networks.
Example: Predicting house prices based on features such as square footage, number of bedrooms, and location.
Supervised Learning Models
- Work with unlabeled data to uncover hidden patterns.
- Common models are clustering algorithms like K-means and hierarchical clustering, and dimensionality reduction techniques like PCA (Principal Component Analysis).
Example: Segmenting customers into groups based on purchasing behavior.
Unsupervised Learning Models
- Learn optimal actions through trial and error by maximizing a reward function.
- Used in scenarios where decision-making is sequential and the environment is dynamic.
Example: A chess-playing AI that improves by playing numerous games.
Reinforcement Learning Models
For _____tasks, you could use metrics like accuracy, precision, recall, F1 score, and ROC AUC (Receiver Operating Characteristic Area Under the Curve).
classification
For _____ tasks, common metrics include mean squared error (MSE), mean absolute error (MAE), and R-squared.
regression
_____ are the parameters of the learning algorithm itself, and tuning them is essential to optimize model performance.
Hyperparameters
_____ can be used to perform hyperparameter optimization by running multiple training jobs with different hyperparameter combinations to find the best version of a model.
AWS SageMaker Automatic Model Tuning
- Provides a fully managed service for building, training, and deploying machine learning models.
- Includes features like Jupyter notebook instances, built-in high-performance algorithms, model tuning, and automatic model deployment in a scalable environment.
Amazon SageMaker
- Allows running code in response to events without provisioning or managing servers.
- Can be used to trigger machine learning model inferences based on real-time data.
AWS Lambda
- Provides the ability to attach low-cost GPU-powered inference acceleration to Amazon SageMaker instances or EC2 instances.
- Useful for reducing costs for compute-intensive inference workloads.
AWS Elastic Inference
- Preprocess data efficiently using Amazon SageMaker Processing.
- Use AWS Glue for data cataloging and ETL (extract, transform, load) processes.
- Store and retrieve datasets with Amazon S3 (Simple Storage Service).
- Monitor model performance over time with Amazon SageMaker Model Monitor.
- Enhance security by using AWS Identity and Access Management (IAM) to control access to AWS resources.
Best Practices for ML Models on AWS
_____ is commonly used to validate the effectiveness of predictive models. It’s a way to compare two or more versions of a model in parallel by exposing them to a real-time environment where they can be evaluated based on actual performance metrics.
A/B testing
Before you can perform _____, you must train at least two variants of your machine learning model. Using SageMaker, you can train models using built-in algorithms or bring your own custom algorithms.
A/B testing
After training your models, you can deploy them to an _____for A/B testing. SageMaker lets you deploy multiple models to a single endpoint and split the traffic between them.
endpoint
Use _____ to monitor their performance in terms of accuracy, latencies, error rates, and other relevant metrics.
CloudWatch
What are the three steps involved in implementing A/B testing?
- Model training
- Model deployment for A/B testing
- Monitor and analyze results
A _____ is a sequence of steps designed to automatically refresh your machine learning model with new data, helping to keep the model relevant as the data changes.
retraining pipeline
- Data Collection & Preprocessing: Collecting new data samples and applying the same preprocessing steps as the initial model training.
- Model Retraining: Using the updated dataset to retrain the model or incrementally update it using online learning techniques.
- Validation & Testing: Evaluating the performance of the model on a validation set to ensure it meets performance thresholds before deployment.
- Deployment: Replacing the existing model with the updated one in the production environment.
Steps involved in retraining pipeline
Stores data and model artifacts securely.
Amazon S3
Runs code in response to triggers such as a schedule event or a change in data.
AWS Lambda
Offers a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning models.
Amazon SageMaker
Coordinates multiple AWS services into serverless workflows so you can build and update apps quickly.
AWS Step Functions
Monitors your pipeline, triggering retraining based on a schedule or a specific event.
Amazon CloudWatch
Prepares and loads data, perfect for the ETL (extract, transform, load) jobs required during preprocessing.
AWS Glue
- New data arrives in an Amazon S3 bucket, triggering an AWS Lambda function.
- The Lambda function invokes an AWS Step Functions workflow which starts the retraining process.
- AWS Glue is used to prepare the data by performing ETL tasks and depositing the processed data back into S3.
- An Amazon SageMaker training job is initiated by Step Functions to retrain the model with the new data. The model artifacts are stored in another S3 bucket.
- Once the model is retrained, validation and testing are performed using SageMaker’s batch transform feature or live endpoint testing.
- If the performance of the new model meets predefined thresholds, a deployment is initiated, where the new model replaces the old one at the SageMaker endpoint.
- Amazon CloudWatch is used for monitoring the model’s performance and logging the entire pipeline’s steps for compliance and debugging purposes.
How a retraining pipeline might be implemented on AWS
Periodically retrain the model with a batch of new data. This can be scheduled daily, weekly, etc., based on the problem requirements.
Batch Retraining
Implement a streaming data solution where the model is continuously updated in near-real-time.
Continuous Retraining
Use triggers such as deterioration in model performance or significant changes in incoming data to initiate retraining.
Trigger-based Retraining
AWS CodePipeline and AWS CodeBuild, combined with SageMaker, can facilitate
Combining the concept of CI/CD with ML models to ensure that retraining pipelines have an automated, reliable flow.
Tools like _____ and _____ can help detect when models start to perform poorly compared to their benchmarks, signaling when a retraining cycle should be initiated.
Amazon CloudWatch and SageMaker Model Monitoring
- Missing Values: Ensure that your dataset does not have significant amounts of missing data. If it does, you can use techniques like imputation to fill in the gaps.
- Outliers: Extreme values can distort training and lead to poor performance. Detecting and handling outliers is a crucial part of the data preprocessing step.
- Feature Distribution: Check if your features have a distribution that your algorithm can work with effectively. Some algorithms, like neural networks, may require data normalization or standardization
Various data issues
_____ occurs when the model performs well on training data but poorly on unseen data.
Overfitting
_____ is when the model is too simple to capture the patterns in the data.
Underfitting
Analyze _____ to diagnose underfitting or overfitting.
Learning Curves
Plotting training and validation accuracy or loss over epochs can tell you _____.
if the model is learning as expected
The learning rate in gradient descent algorithms must be chosen carefully to avoid _____.
overshooting the minimum, this is why tuning hyperparameters is crucial for model performance
Debugging tools
SageMaker Debugger
CloudWatch Logs
_____ makes it easy to monitor and visualize the training of machine learning models in real-time. It allows you to detect and analyze issues like vanishing gradients, overfitting, and poor weight initialization.
SageMaker Debugger
_____ help you monitor and troubleshoot your models. It can collect and track metrics, collect and monitor log files, set alarms, and automatically react to changes in your AWS resources.
CloudWatch Logs
- Divergent loss
- Slow training
- Poor generalization
Common problems when training models
If the loss diverges instead of converging, it’s often due to _____.
a high learning rate or unstable optimization algorithm
Slow training could be due to _____. Optimizing the computation graph or using a more powerful instance type may help.
inefficient data loading, suboptimal model architecture, or lack of hardware resources
If the model performs well on the training data but poorly on test data, consider _____.
using regularization techniques, obtaining more training data, or simplifying the model
Changes in data input patterns, which can degrade model performance over time.
data drift
When model predictions become less accurate due to real-world changes.
model drift
_____ is a monitoring service that provides you with data and actionable insights to monitor your applications. You can set alarms in this to notify you when certain thresholds are breached, and use it to to monitor the compute resources your models are using, like CPU and memory utilization.
AWS CloudWatch
For monitoring machine learning models, _____ can detect and alert on data drift and other issues that may impact model performance. It continually checks deployed models against a baseline to detect deviations in data quality and automatically alerts you.
Amazon SageMaker Model Monitor
You can also create customized alarms by _____.
defining specific metrics that are most indicative of your application’s health
If the performance drop is due to resource constraints, you can use _____ to adjust the number of instances or compute capacity based on demand automatically. This can respond to increased latency or load by adding additional resources.
AWS Auto Scaling
Data drift or changes in the external environment could lead to performance degradation. Amazon SageMaker can be set up for automatic retraining pipelines using _____. You can set triggers based on drift detection metrics indicating when retraining should occur.
SageMaker Pipelines
You may need to update your model endpoints if a new model has been trained that better reflects current data trends. _____ make it possible to perform A/B testing or directly replace the existing model with minimal downtime.
Amazon SageMaker Endpoints
If cost-related performance drops are an issue (e.g., due to downscaling instances for budget reasons), you can use _____for training or inferencing tasks, and this will let you take advantage of unused EC2 capacity at a discount.
Amazon EC2 Spot Instances
- Accuracy: The ratio of correctly predicted instances to total instances.
- Precision: The ratio of true positives to the sum of true and false positives.
- Recall: The ratio of true positives to the sum of true positives and false negatives.
- F1 score: The harmonic mean of precision and recall.
- AUC-ROC: Area Under the Receiver Operating Characteristic Curve.
Metrics used to measure performance
- Invocations: The number of times the model endpoint is called.
- Invocation errors: Errors encountered when calling a model.
- Latency: The response time of invocations.
Metrics monitored by CloudWatch
AWS SageMaker is an end-to-end machine learning service, which also offers a specific feature for monitoring models called _____. This enables continuous monitoring of your machine learning models for data quality, model quality, and operational aspects, alerting on issues in real time.
Model Monitor
- Data drift
- Model performance (e.g., accuracy, AUC-ROC)
- Feature attribute importance
Metrics tracked by Model Monitor
_____ helps improve model transparency and explains predictions by identifying feature attributions that contribute to the prediction outcomes. Such insights can be instrumental in monitoring if the model is relying on the correct features and if it’s making predictions for the right reasons, which can impact performance.
Amazon SageMaker Clarify
- Define clear metrics and thresholds: Know what “good” performance means for your particular model and use-case.
- Monitor in real time: Catch issues as they arise by monitoring model performance continuously.
- Implement robust logging: Capture all relevant data points to facilitate thorough analysis later.
- Handle model drift: Re-evaluate and re-train your models if the input data changes considerably over time.
- Ensure model explainability: Being able to explain your model’s decisions is essential for end-user trust and regulatory compliance.
Best practices for model monitoring