Well Architected Framework WP - Operational Excellence Flashcards
Operational Excellence
practices and procedures for managing production workloads
how planned changes are executed and responses to unexpected events
change execution and responses should be automated. All processes should be documented, tested, reviews
Design Principles (PAMRAL)
Perform operations with code
Annotated documentation
Make frequent, small reversible changes
Refine operations procedures frequently
Anticipate Failure
Learn from all operational failures
Definition of Operational Excellence (POE)
Prepare
Operate
Evolve
Preparation for Operational Excellence
To prepare consider:
operational priorities
design for operations
operational readiness
======
use checklists to ensure workloads are ready for production
Workloads should have runbooks and playbooks
runbook - operations guidnace
playbook - for responding to unexpected events
Preparation best practices
In AWS use Cloudformation to ensure environments have all required resources and configuration is based on tested best practices
Use Autoscaling
Use AWS Config to make rules for automatically tracking and responding to changes
Use tagging
Preparation questions
what best practices are you using
how are you doing configuration management
Keep documentation current
Operational Excellence - Operations
operations should be standardized and managemable
Focus on automation, small frequent changes, QA testing
Use logs and metrics
Setup pipelines for continuous integration and deployment
Should be able to revert changes
Operations - questions
How are you evolving your workload while minimizing impact of change
how do you monitor workload
Operational Excellence - Responses
responses should be automated
for alerting, mitigation, remediation, rollback and recovery
responses should follow a predefined playbook
in AWS you can use SNS for some of this
responses questions
how do you respond to unplanned events
how is escalation managed when responding to unplanned events
Key AWS Services for defining priorities
AWS Config inventories your AWS resources and configurations
Service Catalog creates stand set of service offerings
Use autoscaling, SQS to increase automation
Key AWS Services for Operations
Codecommit
Code Deploy
Code Pipeline to manage code changes
Cloud Trail to audit
Key AWS Services for Responses
Cloudwatch alarms for setting thresholds for alerting, notification
Cloudwatch events for triggering notifications and automated responses
Key AWS Services for defining priorities / preparation
AWS Support, including support center. Business and Enterprise Support customers get access to additional checks and reviews
AWS Cloud compliance for regulatory, compliance requirements
AWS Trusted Advisor for optimizations
Key AWS Services for designing for operations
Cloudwatch to monitor resources and applications
CloudFormation to create version-controlled templates for your infrastructure
DeveloperTools to enable safe, rapid delivery of software
AWS X-Ray to trace user requests through entire application for analysis, debugging
Design for Operations - Key Points
View entire workload as code, define and update as code.
Align engineering practices for defect reduction, rapid fixes. Use logging for visibility into architecture
Key AWS Services for operational readiness
AWS Lambda to enable operational procedures as code that can be triggered by events
AWS Config to track changes to CloudFormation Stacks
EC2 Systems Manager to automate management tasks on EC2 instances
2 Considerations for Operational Success
Understanding operational health
responding to events
Operate - Understanding Operational Health
To understand Operational Health, use metrics to implement dashboards
Send log data to CloudWatch Logs, define baseline metrics
Send CloudWatch Logs to Elasticsarch and use Kibana
AWS Service health and personal health dashboards
In the AWS Shared Responsibility Model, these provide portions of monitoring to you for alerting and premeditation guidance when AWS experiences events
Key AWS Services for understanding Operational Health
CloudWatch - metrics, dashboards
CloudWatch Logs - monitor, store logs from various sources
ElasticSearch (ES) - use for log analytics, monitoring
Personal Health Dashboard - alerts, remediation when AWS experiences issues
Service Health Dashboard - shows realtime AWS service availability
Operate - Responding to Events
Anticipate planned and unplanned operational events
AWS lets you script responses and trigger their execution via code
Automate execution of runbook and playbook actions
Ways to automatically respond to events
create CloudWatch rules to trigger responses through CloudWatch Targets like Lambda functions, SNS Topics, ECS Tasks
CloudWatch alarms that perform actions using EC2 actions, AutoScaling actions or sending SNS notifications to SNS topic
Use SNS to invoke Lambda
Key AWS Service for responding to events
AWS Lambda to define operational procedures as code that can be triggered
also:
CloudWatch - collect logs, metrics, enables triggered execution of events
CloudWatch Events - deliver realtime stream of events that can be matched to rules
SNS - lets you invoke Lambda
AutoScaling
EC2 Systems Manager - automate management tasks on EC2 instances
Evolve
Continuously improve over time
implement small, frequent changes
Learn from experience
Share learnings
Evolve - what to do with aggregated logs in AWS?
create detailed history of all your operational activities, workloads and infrastructure to analyze operations over time
Evolve - how to use CloudTrail?
track API activity to know what’s happening across your accounts
Track AWS developer tools activities with CloudTrail and CloudWatch
These add detailed activity history of deployments and outcomes to CloudWatch Logs data
Evolve - why ingest CloudWatch Logs data into ElasticSearch?
to use built in support for Kibana to create visualizations and perform analysis
Evolve - why export CloudWatch data to S3?
To analyze it with Amazon Athena and use Quicksight to perform analysis, create visualizations
Key AWS Services for Evolving
ElasticSearch to analyze log data and gain insights
also:
Amazon Quicksight - BA service for visualization, analysis
Amazon Athena - serverless interactive query service to analyze data in S3
S3 - collect and archive logs
CloudWatch - collect logs and metrics, create dashboards
Evolve - ways to share learnings
Use IAM to give access to resources across accounts
Use AWS CodeCommit to share applications, procedures, libararies, documentation
Share compute standards by giving access to AMIs
Share CloudFormation templates
Authorize lambda functions across accounts
Key AWS Services for sharing learnings
IAM
also:
SNS - notify subscribers when resources are published
CodeCommit - version controlled repository for operations as code
Lambda
CloudFormation - standardized templates
AMI’s