Well Architected Framework - Operational Excellence Flashcards

1
Q

What is the Operational Excellence Pillar?

A

It includes the ability to support development and run workloads
effectively, gain insight into their operations, and to continuously improve supporting processes and procedures to deliver business value.

The operational excellence pillar provides an overview of design principles, best practices, and questions.

You can find prescriptive guidance on implementation in the Operational Excellence Pillar whitepaper.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the Operational Excellence Design Principles?

A
Perform operations as code
Make frequent, small, reversible changes
Refine operations procedures frequently
Anticipate failure
Learn from all operational failures
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Perform Operations as code

A

In the cloud, you can apply the same engineering discipline that you use
for application code to your entire environment. You can define your entire workload (applications, infrastructure) as code and update it with code. You can implement your operations procedures as code and automate their execution by triggering them in response to events. By performing operations
as code, you limit human error and enable consistent responses to events

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Make frequent, small, reversible changes

A

Design workloads to allow components to be updated regularly. Make changes in small increments that can be reversed if they fail (without affecting customers when possible).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Refine operations procedures frequently

A

As you use operations procedures, look for opportunities to improve them. As you evolve your workload, evolve your procedures appropriately. Set up regular game days to review and validate that all procedures are effective and that teams are familiar with them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Anticipate failure

A

Perform “pre-mortem” exercises to identify potential sources of failure so that
they can be removed or mitigated. Test your failure scenarios and validate your understanding of their impact. Test your response procedures to ensure that they are effective, and that teams are familiar with their execution. Set up regular game days to test workloads and team responses to simulated events.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Learn from all operational failures

A

Drive improvement through lessons learned from all operational events and failures. Share what is learned across teams and through the entire organization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Four best practice areas for operational excellence in the cloud

A
  • Organization
  • Prepare
  • Operate
  • Evolve
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

OPS 1: How do you determine what your priorities are?

A

Everyone needs to understand their part in enabling business success. Have shared goals in order to set priorities for resources. This will maximize the benefits of your efforts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

OPS 2: How do you structure your organization to support your business outcomes?

A

Your teams must understand their part in achieving business outcomes. Teams need to understand their roles in the success of other teams, the role of other teams in their success, and have shared goals. Understanding responsibility, ownership, how decisions are made, and who has authority to make decisions will help focus efforts and maximize the benefits from your teams.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

OPS 3: How does your organizational culture support your business outcomes?

A

Provide support for your team members so that they can be more effective in taking action and supporting your business outcome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Prepare for operational excellence

A

Understand your workloads and their expected behaviors
Design workload providing necessary info to understand internal state (for example, metrics, logs, events, and traces)
Iterate - develop monitoring for the health of your workload
identify when outcomes are at risk, and enable effective responses.
enable situational awareness (changes in state, user activity, privilege access, utilization counters)
improve flow of changes into production that enable refactoring, fast
feedback on quality, and bug fixing.
Provide fast feedback on quality-enable rapid recovery undesired outcomes
Mitigate impact of issues introduced through deployment changes.
Plan for unsuccessful changes so that you are able to respond faster if necessary
Test and validate the changes you make.
Be aware of planned activities in your environments to manage risk
Emphasize frequent, small, reversible changes to limit the scope of change.
Evaluate the operational readiness of your workload, processes, procedures, and personnel to understand the operational risks related to your workload. You should use a consistent process (including manual or automated checklists) to know when you are ready to go live with your workload or a change.
This will also enable you to find any areas that you need to make plans to address. Have runbooks that document your routine activities and playbooks that guide your processes for issue resolution.
Understand the benefits and risks to make informed decisions to allow changes to enter production.
AWS enables you to view your entire workload (applications, infrastructure, policy, governance, and operations) as code. This means you can apply the same engineering discipline that you use for application code to every element of your stack and share these across teams or organizations to magnify the benefits of development efforts. Use operations as code in the cloud and the ability to safely experiment to develop your workload, your operations procedures, and practice failure. Using AWS CloudFormation enables you to have consistent, templated, sandbox development, test, and production environments with increasing levels of operations control.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

OPS 4: How do you design your workload so that you can understand its state?

A

Design your workload so that it provides the information necessary across all components (for example, metrics, logs, and traces) for you to understand its internal state. This enables you to provide effective responses when appropriate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

OPS 5: How do you reduce defects, ease remediation, and improve flow into production?

A

Adopt approaches that improve flow of changes into production, that enable refactoring, fast feedback on quality, and bug fixing. These accelerate beneficial changes entering production, limit issues deployed, and enable rapid identification and remediation of issues introduced through deployment activities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

OPS 6: How do you mitigate deployment risks?

A

Adopt approaches that provide fast feedback on quality and enable rapid recovery from changes that do not have desired outcomes. Using these practices mitigates the impact of issues introduced through the deployment of changes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

OPS 7: How do you know that you are ready to support a workload?

A

Evaluate the operational readiness of your workload, processes and procedures, and personnel to understand the operational risks related to your workload

17
Q

OPS 8: How do you understand the health of your workload?

A

Define, capture, and analyze workload metrics to gain visibility to workload events so that you can take appropriate action.

18
Q

OPS 9: How do you understand the health of your operations?

A

Define, capture, and analyze operations metrics to gain visibility to operations events so that you can take appropriate action.

19
Q

OPS 10: How do you manage workload and operations events?

A

Prepare and validate procedures for responding to events to minimize their disruption to your workload.

20
Q

OPS 11: How do you evolve operations?

A

Dedicate time and resources for continuous incremental improvement to evolve the effectiveness and efficiency of your operations.