Operational Excellence Flashcards

1
Q

Design principals

A
  1. Perform operations as code: Run infraestructure as code, scripts.
  2. Make frequent, smaill, reversible changes.
  3. Refine operations procedures frecuently.
  4. Anticipate failute: Test failure scenarios, Fail fast.
  5. Learn from all operational failures: document and share.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Practice areas

A
  1. Organization
  2. Prepare,
    3, Operate,
  3. Evolve
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Organization practice area

A

need to understand organization priorities, structure, how organization supports team members so they can support business outcomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Organization Priorities

A
  1. Evaluate external customers needs.
  2. Evaluate internal customers needs.
  3. Evaluate governance requirements,
  4. Evaluate compliance requirements.
  5. Evaluate threat landscape.
  6. Evaluate tradeoffs.
  7. Manage benefits and risks,
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Organization Operation model

A

understand roles, responsability, how decisions are made. Models that rule the company.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Operating model 2 by 2 representations

A

understand relationshipe between teams in your environment. WHO does WHAT.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Operating model - Fully separated model

A

Application and platform are managed by a fully separed team. Work is passed between teams through mechanisms such as work requests, work queues, tickets, or by using an IT service management (ITSM) system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Operating model - Separated AEO and IEO

A

Here we follow the “you build it, you run it” methodology. The engineers are responsible for the engineering and operation of their workload. To organize the teams, you should use AWS Organizations and AWS Control Tower. The platform engineering team provides a standardized set of services (e.g. development or monitoring tools) and access to cloud services to the application team. The AWS Service Catalog can be used to govern the tooling.

PRO
Standards are distributed, provided, or shared
Strong feedback loop
Platform team supports Application team
Adopting standards may reduce reviews to enter production

CON
When changes or additions, Application Team always needs to discuss with Platform Team

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

AEO

A

Application Engineering and Operations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

IEO

A

Infraestructure Engineering and Operations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Operating model - Separated AEO and IEO with centralized governance and a Service Provider

A

Similar to the centralized governance, but you offload some operations tasks such a patching and updating to Managed Services. These service is handled by AWS and they take care of these tasks

PRO
Offload “boring” operational tasks
Gain advantage of your providers’ standards, best practices, processes, and expertise
Latest service offerings

CON
Does not address the bottlenecks and delays created by transition of tasks between teams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Operating model - Separated AEO and IEO with centralized governance and an internal service provider consulting partner

A

This model also establishes the “you build it, you run it” methodology. But the difference to the previous model, this enables a Cloud Operations and Platform Enablement (COPE) team which supports when there are no cloud related topics. It provides a forum to ask questions, discuss needs, and identify solutions. The platform engineering team builds the core shared platform capabilities governance via the AWS Service Catalog.

PRO
Adopting more DevOps culture
Enabling cloud transformation for teams, establishes centralized cloud governance, and defines account and organization management standards
Application Team get CI/CD-pipeline from COPE
Remove Barriers that slow application team adoption of beneficial cloud capabilities

CON
involves huge effort to facilitate cloud adoption and organization standards

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

CCoE

A

Cloud Center of Enablement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

COPE

A

Cloud Operations and Platform Enablement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Operating model - Separated AEO and IEO with decentralized governance

A

In this model the application engineers and developers perform both platform and application for engineering and operational workloads. Standards are still distributed by the platform team but the application teams are more free to engineer and operate their own capabilities in support of their workload.

PRO
Fewer constraints
More free in choosing own tooling

CON
Higher responsibilities of Application Engineer
Risk of rework is higher
Enforce policies (Governance via AWS Organizations and AWS Control Tower)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Operating model - relationship and ownership - Resources have identified owners

A

Understand who has ownership of each application, workload, platform, and infrastructure component, what business value is provided by that component, and why that ownership exists.

1.Define forms of ownership and how they are assigned
2.Define who owns an organization, account, collection of resources, or individual components
3.Capture ownership in the metadata for the resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Operating model - relationship and ownership - Processes and procedures have identified owners

A

Understand who has ownership of the definition of individual processes and procedures, why those specific process and procedures are used, and why that ownership exists.
1.Identify process and procedures
2.Define who owns the definition of a process or procedure
3.Capture ownership in the metadata of the activity artifact

18
Q

Operating model - relationship and ownership - Operations activities have identified owners responsible for their performance

A

Understand who has responsibility to perform specific activities on defined workloads and why that responsibility exists. Understanding who has responsibility to perform activities informs who will conduct the activity, validate the result, and provide feedback to the owner of the activity.

19
Q

Operating model - relationship and ownership - Team members know what they are responsible for

A

Understanding the responsibilities of your role and how you contribute to business outcomes informs the prioritization of your tasks and why your role is important. This enables team members to recognize needs and respond appropriately.

20
Q

Operating model - relationship and ownership - Mechanisms exist to identify responsibility and ownership

A

Where no individual or team is identified, there are defined escalation paths to someone with the authority to assign ownership or plan for that need to be addressed.

21
Q

Operating model - relationship and ownership - Mechanisms exist to request additions, changes, and exceptions

A

You are able to make requests to owners of processes, procedures, and resources. Make informed decisions to approve requests where viable and determined to be appropriate after an evaluation of benefits and risks.

22
Q

Operating model - relationship and ownership - Responsibilities between teams are predefined or negotiated

A

Have defined or negotiated agreements between teams describing how they work with and support each other (for example, response times, service level objectives, or service level agreements).

23
Q

Organizational culture

A
  1. Executive Sponsorship
  2. Team members are empowered to take action when outcomes are at risk
  3. Escalation is encouraged
  4. Communications are timely, clear, and actionable
  5. Experimentation is encouraged
  6. Team members are enabled and encouraged to maintain and grow their skill sets
  7. Resource teams appropriately
  8. Diverse opinions are encouraged and sought within and across teams
24
Q

Prepare

A

understand your workloads and their expected behaviors. You will then be able to design them to provide insight to their status and build the procedures to support them.

To prepare for operational excellence, you need to perform the following:

1.Design telemetry
2.Design for operations
3.Mitigate deployment risks
4.Operational readiness and change management

25
Q

Prepare - Design telemetry

A

provides the information necessary for you to understand its internal state (for example, metrics, logs, events, and traces) across all components in support of observability and investigating issues.

26
Q

Prepare - Design telemetry - Implement application telemetry

A

Application telemetry is the foundation for observability of your workload.
Application telemetry consists of metrics and logs.
Metrics are diagnostic information
Logs are messages that the application sends about its internal state or events that occur

27
Q

Implementing application telemetry three steps

A

Implementing application telemetry consists of three steps: identifying a location to store telemetry, identifying telemetry that describes the state of the application, and instrumenting the application to emit telemetry.

28
Q

Prepare - Design telemetry - Implement and configure workload telemetry

A

Design and configure your workload to emit information about its internal state and current status, for example, API call volume, HTTP status codes, and scaling events

29
Q

Prepare - Design telemetry - Implement user activity telemetry

A

Instrument your application code to emit information about user activity, for example, click streams, or started, abandoned, and completed transactions. Use this information to help understand how the application is used, patterns of usage, and to determine when a response is required.

30
Q

Prepare - Design telemetry - Implement dependency telemetry

A

Implement dependency telemetry: Design and configure your workload to emit information about the state and status of systems it depends on. Some examples include: external databases, DNS, network connectivity, and external credit card processing services.

31
Q

Prepare - Design telemetry - Implement transaction traceability

A

Implement transaction traceability: Design your application and workload to emit information about the flow of transactions across system components, such as transaction stage, active component, and time to complete activity. Use this information to determine what is in progress, what is complete, and what the results of completed activities are.

x-ray

32
Q

Prepare - Design for operations

A

Adopt approaches that improve the flow of changes into production and that enable refactoring, fast feedback on quality, and bug fixing.

  1. Use version control: AWS CodeCommit
  2. Test and validate changes: AWS CodeBuild
  3. Use configuration management systems: AWS AppConfig, Config
  4. Use build and deployment management systems: (CI/CD) pipelines -> AWS CodeCommit, AWS CodeBuild, AWS CodePipeline, AWS CodeDeploy, and AWS CodeStar.
  5. Perform patch management: AWS Systems Manager Patch Manager
  6. Share design standards.
  7. Implement practices to improve code quality: Amazon CodeGuru
  8. Use multiple environments: sandbox environments with minimized controls to enable experimentation.
  9. Make frequent, small, reversible changes
  10. Fully automate integration and deployment
33
Q

Prepare - Mitigate deployment risks

A

Adopt approaches that provide fast feedback on quality and enable rapid recovery from changes that do not have desired outcomes.

  1. Plan for unsuccessful changes: Plan to revert to a known good state (that is, roll back the change), or remediate in the production environment (that is, roll forward the change)
  2. Test and validate changes: Test changes and validate the results at all lifecycle stages (for example, development, test, and production), to confirm new features and minimize the risk and impact of failed deployments.
  3. Use deployment management systems: Continuous Integration/Continuous Deployment (CI/CD) pipelines
  4. Test using limited deployments: Test with limited deployments alongside existing systems to confirm desired outcomes prior to full scale deployment. For example, use deployment canary testing or one-box deployments.
  5. Deploy using parallel environments: Implement changes onto parallel environments, and transition or cut over to the new environment. Maintain the prior environment until there is confirmation of successful deployment. This minimizes recovery time by enabling rollback to the previous environment. For example, use immutable infrastructures with blue/green deployments.
  6. Deploy frequent, small, reversible changes: Use frequent, small, and reversible changes to reduce the scope of a change. This results in easier troubleshooting and faster remediation with the option to roll back a change.
  7. Fully automate integration and deployment: Fully automate the integration and deployment pipeline from code check-in through build, testing, deployment, and validation.
  8. Automate testing and rollback: Automate testing of deployed environments to confirm desired outcomes. Automate rollback to a previous known good state when outcomes are not achieved to minimize recovery time and reduce errors caused by manual processes.
34
Q

Prepare - Operational readiness and change management

A

You should use a consistent process (including manual or automated checklists) to know when you are ready to go live with your workload or a change. You will have runbooks that document your routine activities and playbooks that guide your processes for issue resolution.

  1. Ensure personnel capability
  2. Ensure consistent review of operational readiness
  3. Use runbooks to perform procedures
  4. Use playbooks to investigate issues
  5. Make informed decisions to deploy systems and changes
35
Q

Operate

A

By understanding the health of your workload and operations, you can identify when organizational and business outcomes may become at risk, or are at risk, and respond appropriately.

36
Q

Operate - Understanding workload health

A

Define, capture, and analyze workload metrics to gain visibility to workload events so that you can take appropriate action.

  1. Identify key performance indicators.
  2. Define workload metrics.
  3. Collect and analyze workload metrics.
  4. Establish workload metrics baselines
  5. learn expected patterns of activity for workload.
  6. Alert when workload outcomes are at risk.
  7. Alert when workload anomalies are detected
  8. Validate the achivement of outcomes and the effectiveness of KPIs and metrics
37
Q

KPI

A

key performance indicators

38
Q

Operate - Understanding operational health

A
  1. identify key performance indicators.
  2. define operations metrics.
  3. collect and analyze operations metrics.
  4. establish operations metrics baselines
  5. learn the expected patterns of activity for operations
  6. Alert when operations outcomes are detected
  7. alert when operations anomalies are detected

8.validate the achivements of outcomes and effectiveness of KPIs and metrics

39
Q

Operate - Responding to events

A
  1. Use processes for event, incidentm and problem management
  2. Have a process per alert.
  3. Prioritize operational events based on business impact.

4.

39
Q

Operate - Responding to events

A
  1. Use processes for event, incidentm and problem management
  2. Have a process per alert.
  3. Prioritize operational events based on business impact.
  4. Define scalation paths
  5. Enable push notifications.
  6. Communicate status through dashboards.
  7. automate responses to events
40
Q

Evolve

A
  1. Have process for continuos improvement.
  2. Perform post-incident analysis
  3. implement feedback
  4. Perform knowledge management
  5. Define drivers for improvement.
  6. validate insights
  7. Perform operations metrics reviews
  8. Document and share lessons learned.
  9. Allocate time to make improvements