Operational Excellence Flashcards
What are the benefits of infrastructure as code?
IaC provides consistency - admins run the same automated steps for each deploy and can add comments in source control, making it self-documenting as well.
It also simplifies the deployment of complex infrastructures and provides scalability (you can push out a new instance within min or seconds).
What are the benefits of microservices?
Microservices are smaller and more modular than monolithic
What is meant by canary testing? What are tools GC provides to help with this?
Canary testing is the practice of rolling out changes to just a small sample of users to reduce risk and validate functionality.
You can use GCE managed instance groups, collections of VM instances, that are managed as a single entity.
What are questions to ask around release engineering?
How does your dev team manage builds and releases?
What’s your process for rolling back changes?
How do you test your applications before deployment?
What are some strategies for achieving operational excellence?
Automate your build, test, and deploy processes - perform operations as code; make frequent, small, reversible changes (CI/CD practices)
Monitor business-driven metrics (and system health metrics that align)
Refine operations/processes frequently
Conduct DR testing regularly
Review lessons learned
What tools help with CI/CD?
Cloud Source Repositories, Container Registry, Cloud Build
What are the four golden signals for monitoring your system?
Latency - time it takes to service a request
Traffic - how much demand is being placed on your system
Errors - rate of requests that fail
Saturation - how full your service is (i.e. I/O-constrained or memory-constrained)
What tools help with monitoring business/system health?
Cloud Monitoring - metrics collection/aggregation, dashboards, alerts
Cloud Logging - search and export to BigQuery, Cloud Storage, or Pub/Sub
Cloud Trace
What metrics are key to DR planning?
Recovery time objective (RTO) - the maximum acceptable length of time that your application can be offline. Usually defined as part of an SLA.
Recovery point objective (RPO) - the maximum acceptable length of time during which data might be lost from your application
What’s the relationship between cost to run an application and RTO/RPO values?
The smaller the RTO/RPO values, the more your application costs to run.
What’s the difference between SLAs and SLOs?
An SLA is the entire agreement that specifies what services are to be provided and details around support, cost, performance, penalties, times, etc.
SLOs are specific, measurable characteristics of the SLA, such as availability, throughput, response time, quality.
How does GC help with DR planning?
GCE offers incremental backups/snapshots using on Persistent Disk that you can copy across regions in the event of a disaster.
Live Migration keeps your VMs running even when a host system occurs, such as a software or hardware update.
Cloud Storage offers object storage in different classes, such as Nearline and Coldline, for backup.
Cloud DNS uses Google’s global network to serve DNS zones from redundant locations around the world. Allows you to manage DNS entries during recovery process.
The Story
Cloud service providers can help make sure your people, processes, and technology run effectively to meet your business objectives. GC provides services that help reduce the complexities and costs that are common with application deployments, monitoring your business- and system-level KPIs, managing risk, and business continuity planning.