Well Architected Framework - Operational Excellence Flashcards
What is the Operational Excellence Pillar?
It includes the ability to support development and run workloads
effectively, gain insight into their operations, and to continuously improve supporting processes and procedures to deliver business value.
The operational excellence pillar provides an overview of design principles, best practices, and questions.
You can find prescriptive guidance on implementation in the Operational Excellence Pillar whitepaper.
What are the Operational Excellence Design Principles?
Perform operations as code Make frequent, small, reversible changes Refine operations procedures frequently Anticipate failure Learn from all operational failures
Perform Operations as code
In the cloud, you can apply the same engineering discipline that you use
for application code to your entire environment. You can define your entire workload (applications, infrastructure) as code and update it with code. You can implement your operations procedures as code and automate their execution by triggering them in response to events. By performing operations
as code, you limit human error and enable consistent responses to events
Make frequent, small, reversible changes
Design workloads to allow components to be updated regularly. Make changes in small increments that can be reversed if they fail (without affecting customers when possible).
Refine operations procedures frequently
As you use operations procedures, look for opportunities to improve them. As you evolve your workload, evolve your procedures appropriately. Set up regular game days to review and validate that all procedures are effective and that teams are familiar with them.
Anticipate failure
Perform “pre-mortem” exercises to identify potential sources of failure so that
they can be removed or mitigated. Test your failure scenarios and validate your understanding of their impact. Test your response procedures to ensure that they are effective, and that teams are familiar with their execution. Set up regular game days to test workloads and team responses to simulated events.
Learn from all operational failures
Drive improvement through lessons learned from all operational events and failures. Share what is learned across teams and through the entire organization.
Four best practice areas for operational excellence in the cloud
- Organization
- Prepare
- Operate
- Evolve
OPS 1: How do you determine what your priorities are?
Everyone needs to understand their part in enabling business success. Have shared goals in order to set priorities for resources. This will maximize the benefits of your efforts.
OPS 2: How do you structure your organization to support your business outcomes?
Your teams must understand their part in achieving business outcomes. Teams need to understand their roles in the success of other teams, the role of other teams in their success, and have shared goals. Understanding responsibility, ownership, how decisions are made, and who has authority to make decisions will help focus efforts and maximize the benefits from your teams.
OPS 3: How does your organizational culture support your business outcomes?
Provide support for your team members so that they can be more effective in taking action and supporting your business outcome
Prepare for operational excellence
Understand your workloads and their expected behaviors
Design workload providing necessary info to understand internal state (for example, metrics, logs, events, and traces)
Iterate - develop monitoring for the health of your workload
identify when outcomes are at risk, and enable effective responses.
enable situational awareness (changes in state, user activity, privilege access, utilization counters)
improve flow of changes into production that enable refactoring, fast
feedback on quality, and bug fixing.
Provide fast feedback on quality-enable rapid recovery undesired outcomes
Mitigate impact of issues introduced through deployment changes.
Plan for unsuccessful changes so that you are able to respond faster if necessary
Test and validate the changes you make.
Be aware of planned activities in your environments to manage risk
Emphasize frequent, small, reversible changes to limit the scope of change.
Evaluate the operational readiness of your workload, processes, procedures, and personnel to understand the operational risks related to your workload. You should use a consistent process (including manual or automated checklists) to know when you are ready to go live with your workload or a change.
This will also enable you to find any areas that you need to make plans to address. Have runbooks that document your routine activities and playbooks that guide your processes for issue resolution.
Understand the benefits and risks to make informed decisions to allow changes to enter production.
AWS enables you to view your entire workload (applications, infrastructure, policy, governance, and operations) as code. This means you can apply the same engineering discipline that you use for application code to every element of your stack and share these across teams or organizations to magnify the benefits of development efforts. Use operations as code in the cloud and the ability to safely experiment to develop your workload, your operations procedures, and practice failure. Using AWS CloudFormation enables you to have consistent, templated, sandbox development, test, and production environments with increasing levels of operations control.
OPS 4: How do you design your workload so that you can understand its state?
Design your workload so that it provides the information necessary across all components (for example, metrics, logs, and traces) for you to understand its internal state. This enables you to provide effective responses when appropriate
OPS 5: How do you reduce defects, ease remediation, and improve flow into production?
Adopt approaches that improve flow of changes into production, that enable refactoring, fast feedback on quality, and bug fixing. These accelerate beneficial changes entering production, limit issues deployed, and enable rapid identification and remediation of issues introduced through deployment activities.
OPS 6: How do you mitigate deployment risks?
Adopt approaches that provide fast feedback on quality and enable rapid recovery from changes that do not have desired outcomes. Using these practices mitigates the impact of issues introduced through the deployment of changes