Operational Excellence Flashcards

Question 1

Q

Design principals

Answer

A

Perform operations as code: Run infraestructure as code, scripts.
Make frequent, smaill, reversible changes.
Refine operations procedures frecuently.
Anticipate failute: Test failure scenarios, Fail fast.
Learn from all operational failures: document and share.

Question 2

Q

Practice areas

Answer

A

Organization
Prepare,
3, Operate,
Evolve

Question 3

Q

Organization practice area

Answer

A

need to understand organization priorities, structure, how organization supports team members so they can support business outcomes

Question 4

Q

Organization Priorities

Answer

A

Evaluate external customers needs.
Evaluate internal customers needs.
Evaluate governance requirements,
Evaluate compliance requirements.
Evaluate threat landscape.
Evaluate tradeoffs.
Manage benefits and risks,

Question 5

Q

Organization Operation model

Answer

A

understand roles, responsability, how decisions are made. Models that rule the company.

Question 6

Q

Operating model 2 by 2 representations

Answer

A

understand relationshipe between teams in your environment. WHO does WHAT.

Question 7

Q

Operating model - Fully separated model

Answer

A

Application and platform are managed by a fully separed team. Work is passed between teams through mechanisms such as work requests, work queues, tickets, or by using an IT service management (ITSM) system.

Question 8

Q

Operating model - Separated AEO and IEO

Answer

A

Here we follow the “you build it, you run it” methodology. The engineers are responsible for the engineering and operation of their workload. To organize the teams, you should use AWS Organizations and AWS Control Tower. The platform engineering team provides a standardized set of services (e.g. development or monitoring tools) and access to cloud services to the application team. The AWS Service Catalog can be used to govern the tooling.

PRO
Standards are distributed, provided, or shared
Strong feedback loop
Platform team supports Application team
Adopting standards may reduce reviews to enter production

CON
When changes or additions, Application Team always needs to discuss with Platform Team

Question 9

Q

AEO

Answer

A

Application Engineering and Operations

Question 10

Q

IEO

Answer

A

Infraestructure Engineering and Operations

Question 11

Q

Operating model - Separated AEO and IEO with centralized governance and a Service Provider

Answer

A

Similar to the centralized governance, but you offload some operations tasks such a patching and updating to Managed Services. These service is handled by AWS and they take care of these tasks

PRO
Offload “boring” operational tasks
Gain advantage of your providers’ standards, best practices, processes, and expertise
Latest service offerings

CON
Does not address the bottlenecks and delays created by transition of tasks between teams

Question 12

Q

Operating model - Separated AEO and IEO with centralized governance and an internal service provider consulting partner

Answer

A

This model also establishes the “you build it, you run it” methodology. But the difference to the previous model, this enables a Cloud Operations and Platform Enablement (COPE) team which supports when there are no cloud related topics. It provides a forum to ask questions, discuss needs, and identify solutions. The platform engineering team builds the core shared platform capabilities governance via the AWS Service Catalog.

PRO
Adopting more DevOps culture
Enabling cloud transformation for teams, establishes centralized cloud governance, and defines account and organization management standards
Application Team get CI/CD-pipeline from COPE
Remove Barriers that slow application team adoption of beneficial cloud capabilities

CON
involves huge effort to facilitate cloud adoption and organization standards

Question 13

Q

CCoE

Answer

A

Cloud Center of Enablement

Question 14

Q

COPE

Answer

A

Cloud Operations and Platform Enablement

Question 15

Q

Operating model - Separated AEO and IEO with decentralized governance

Answer

A

In this model the application engineers and developers perform both platform and application for engineering and operational workloads. Standards are still distributed by the platform team but the application teams are more free to engineer and operate their own capabilities in support of their workload.

PRO
Fewer constraints
More free in choosing own tooling

CON
Higher responsibilities of Application Engineer
Risk of rework is higher
Enforce policies (Governance via AWS Organizations and AWS Control Tower)

Question 16

Q

Operating model - relationship and ownership - Resources have identified owners

Answer

A

Understand who has ownership of each application, workload, platform, and infrastructure component, what business value is provided by that component, and why that ownership exists.

1.Define forms of ownership and how they are assigned
2.Define who owns an organization, account, collection of resources, or individual components
3.Capture ownership in the metadata for the resources

Question 17

Q

Operating model - relationship and ownership - Processes and procedures have identified owners

Answer

A

Understand who has ownership of the definition of individual processes and procedures, why those specific process and procedures are used, and why that ownership exists.
1.Identify process and procedures
2.Define who owns the definition of a process or procedure
3.Capture ownership in the metadata of the activity artifact

Question 18

Q

Operating model - relationship and ownership - Operations activities have identified owners responsible for their performance

Answer

A

Understand who has responsibility to perform specific activities on defined workloads and why that responsibility exists. Understanding who has responsibility to perform activities informs who will conduct the activity, validate the result, and provide feedback to the owner of the activity.

Question 19

Q

Operating model - relationship and ownership - Team members know what they are responsible for

Answer

A

Understanding the responsibilities of your role and how you contribute to business outcomes informs the prioritization of your tasks and why your role is important. This enables team members to recognize needs and respond appropriately.

Question 20

Q

Operating model - relationship and ownership - Mechanisms exist to identify responsibility and ownership

Answer

A

Where no individual or team is identified, there are defined escalation paths to someone with the authority to assign ownership or plan for that need to be addressed.

Question 21

Q

Operating model - relationship and ownership - Mechanisms exist to request additions, changes, and exceptions

Answer

A

You are able to make requests to owners of processes, procedures, and resources. Make informed decisions to approve requests where viable and determined to be appropriate after an evaluation of benefits and risks.

Question 22

Q

Operating model - relationship and ownership - Responsibilities between teams are predefined or negotiated

Answer

A

Have defined or negotiated agreements between teams describing how they work with and support each other (for example, response times, service level objectives, or service level agreements).

Question 23

Q

Organizational culture

Answer

A

Executive Sponsorship
Team members are empowered to take action when outcomes are at risk
Escalation is encouraged
Communications are timely, clear, and actionable
Experimentation is encouraged
Team members are enabled and encouraged to maintain and grow their skill sets
Resource teams appropriately
Diverse opinions are encouraged and sought within and across teams

Question 24

Q

Prepare

Answer

A

understand your workloads and their expected behaviors. You will then be able to design them to provide insight to their status and build the procedures to support them.

To prepare for operational excellence, you need to perform the following:

1.Design telemetry
2.Design for operations
3.Mitigate deployment risks
4.Operational readiness and change management

Question 25

Q

Prepare - Design telemetry

Answer

A

provides the information necessary for you to understand its internal state (for example, metrics, logs, events, and traces) across all components in support of observability and investigating issues.

Question 26

Q

Prepare - Design telemetry - Implement application telemetry

Answer

A

Application telemetry is the foundation for observability of your workload.
Application telemetry consists of metrics and logs.
Metrics are diagnostic information
Logs are messages that the application sends about its internal state or events that occur

Question 27

Q

Implementing application telemetry three steps

Answer

A

Implementing application telemetry consists of three steps: identifying a location to store telemetry, identifying telemetry that describes the state of the application, and instrumenting the application to emit telemetry.

Question 28

Q

Prepare - Design telemetry - Implement and configure workload telemetry

Answer

A

Design and configure your workload to emit information about its internal state and current status, for example, API call volume, HTTP status codes, and scaling events

Question 29

Q

Prepare - Design telemetry - Implement user activity telemetry

Answer

A

Instrument your application code to emit information about user activity, for example, click streams, or started, abandoned, and completed transactions. Use this information to help understand how the application is used, patterns of usage, and to determine when a response is required.

Question 30

Q

Prepare - Design telemetry - Implement dependency telemetry

Answer

A

Implement dependency telemetry: Design and configure your workload to emit information about the state and status of systems it depends on. Some examples include: external databases, DNS, network connectivity, and external credit card processing services.

Question 31

Q

Prepare - Design telemetry - Implement transaction traceability

Answer

A

Implement transaction traceability: Design your application and workload to emit information about the flow of transactions across system components, such as transaction stage, active component, and time to complete activity. Use this information to determine what is in progress, what is complete, and what the results of completed activities are.

x-ray

Question 32

Q

Prepare - Design for operations

Answer

A

Adopt approaches that improve the flow of changes into production and that enable refactoring, fast feedback on quality, and bug fixing.

Use version control: AWS CodeCommit
Test and validate changes: AWS CodeBuild
Use configuration management systems: AWS AppConfig, Config
Use build and deployment management systems: (CI/CD) pipelines -> AWS CodeCommit, AWS CodeBuild, AWS CodePipeline, AWS CodeDeploy, and AWS CodeStar.
Perform patch management: AWS Systems Manager Patch Manager
Share design standards.
Implement practices to improve code quality: Amazon CodeGuru
Use multiple environments: sandbox environments with minimized controls to enable experimentation.
Make frequent, small, reversible changes
Fully automate integration and deployment

Question 33

Q

Prepare - Mitigate deployment risks

Answer

A

Adopt approaches that provide fast feedback on quality and enable rapid recovery from changes that do not have desired outcomes.

Plan for unsuccessful changes: Plan to revert to a known good state (that is, roll back the change), or remediate in the production environment (that is, roll forward the change)
Test and validate changes: Test changes and validate the results at all lifecycle stages (for example, development, test, and production), to confirm new features and minimize the risk and impact of failed deployments.
Use deployment management systems: Continuous Integration/Continuous Deployment (CI/CD) pipelines
Test using limited deployments: Test with limited deployments alongside existing systems to confirm desired outcomes prior to full scale deployment. For example, use deployment canary testing or one-box deployments.
Deploy using parallel environments: Implement changes onto parallel environments, and transition or cut over to the new environment. Maintain the prior environment until there is confirmation of successful deployment. This minimizes recovery time by enabling rollback to the previous environment. For example, use immutable infrastructures with blue/green deployments.
Deploy frequent, small, reversible changes: Use frequent, small, and reversible changes to reduce the scope of a change. This results in easier troubleshooting and faster remediation with the option to roll back a change.
Fully automate integration and deployment: Fully automate the integration and deployment pipeline from code check-in through build, testing, deployment, and validation.
Automate testing and rollback: Automate testing of deployed environments to confirm desired outcomes. Automate rollback to a previous known good state when outcomes are not achieved to minimize recovery time and reduce errors caused by manual processes.

Question 34

Q

Prepare - Operational readiness and change management

Answer

A

You should use a consistent process (including manual or automated checklists) to know when you are ready to go live with your workload or a change. You will have runbooks that document your routine activities and playbooks that guide your processes for issue resolution.

Ensure personnel capability
Ensure consistent review of operational readiness
Use runbooks to perform procedures
Use playbooks to investigate issues
Make informed decisions to deploy systems and changes

Question 35

Q

Operate

Answer

A

By understanding the health of your workload and operations, you can identify when organizational and business outcomes may become at risk, or are at risk, and respond appropriately.

Question 36

Q

Operate - Understanding workload health

Answer

A

Define, capture, and analyze workload metrics to gain visibility to workload events so that you can take appropriate action.

Identify key performance indicators.
Define workload metrics.
Collect and analyze workload metrics.
Establish workload metrics baselines
learn expected patterns of activity for workload.
Alert when workload outcomes are at risk.
Alert when workload anomalies are detected
Validate the achivement of outcomes and the effectiveness of KPIs and metrics

Question 37

Q

KPI

Answer

A

key performance indicators

Question 38

Q

Operate - Understanding operational health

Answer

A

identify key performance indicators.
define operations metrics.
collect and analyze operations metrics.
establish operations metrics baselines
learn the expected patterns of activity for operations
Alert when operations outcomes are detected
alert when operations anomalies are detected

8.validate the achivements of outcomes and effectiveness of KPIs and metrics

Question 39

Q

Operate - Responding to events

Answer

A

Use processes for event, incidentm and problem management
Have a process per alert.
Prioritize operational events based on business impact.

4.

Question 40

Q

Operate - Responding to events

Answer

A

Use processes for event, incidentm and problem management
Have a process per alert.
Prioritize operational events based on business impact.
Define scalation paths
Enable push notifications.
Communicate status through dashboards.
automate responses to events

Question 41

Q

Evolve

Answer

A

Have process for continuos improvement.
Perform post-incident analysis
implement feedback
Perform knowledge management
Define drivers for improvement.
validate insights
Perform operations metrics reviews
Document and share lessons learned.
Allocate time to make improvements