Module 6b - Azure SLAs and Service Lifecycle Flashcards
What is an SLA?
What are the min and max SLA uptimes?
A formal agreement between a service provider and a customer, guaranteeing service (the % of UPTIME) and the definition and compensation for DOWNTIME
Minimum: 99%
Maximum: 99.999%
Why are SLAs important?
Understanding them helps you understand what guarantees you can expect w.r.t. uptime, downtime, compensation for service outages, etc.
EACH Azure Service has its own SLA. Be familiar with it BEFORE provisioning a Service.
What are the Contents of an SLA (3 items)? What details does each contain?
- Introduction:
- Expectation overview
- Scope
- How subscription renewals can affect the terms
- General Terms:
- Defines Vocabulary of Terms
- Claim Submission Process
- Limitations, downtime compensation processes, etc.
- SLA Details:
* Percentage commitments (the Nines). Generally you want the 3-Nines or 4-Nines uptimes.
What do percentages (i.e. the “nines”) indicate for uptime and downtime respectively?
What four (4) types of events are considered downtime?
Is it possible to achieve 100% uptime?
The more “9”s the SLA has, the better it is w.r.t. uptime. For example: 99.999% Downtime per week is 6 SECONDS whereas 99.99% Downtime per week is 1.01 MINUTES…
Downtime:
- Time to upgrade
- Time to restart/redeploy
- Disaster recovery time
- System failure recovery time
Given the above, it is near impossible and far to expensive to guarantee 100% uptime
What’s the SLA for Free Services?
Typically NONE and it’s not recommended you use Free Services in production since it makes SLA determinations impossible, and prevent you from recovering any compensation for SLA violations on production downtime. FREE == No financially backed SLA
What’s the diff between Azure Status vs Azure Service Health?
Azure Status provides a global “Status” view for ALL Service across ALL Regions. You can subscribe to the RSS feed for updates.
Azure Service Health provides a personalized view of Azure Services health for Services YOU ARE USING in Regions containing those Services. The dashboard in Azure Portal provides access to Service Issues, health and security advisories and health alerts.
Azure Status provides direct access to Azure Service Health via button on its front page LOL~
Hint: 9*9
What is Application SLA?
SLA requirements for a specific application, referring to applications that YOU build on Azure. It is the aggregate of SLAs across all Services used to create that application, deployed to ONE Region… Can’t emphasize that “ONE Region” enough. Deploying to multiple Regions is a different calculation that improves overall SLA…
What are Usage Patterns w.r.t. SLAs?
Defines when and how your users access your application. This is important when determining the appropriate SLA and how to support it…
Ex.
- Tax-filing applications, especially during Tax Season, must have a 99.999% SLA uptime during tax season
- Saturday nights for online banking applications; people are sleeping/partying. They aren’t doing bills. 99.9% SLA uptime is appropriate as downtime is feasible
KNOW THIS
How are Composite SLAs calculated? What makes Application SLA worse than individual Service SLA?
When you finish determining ALL the Services you’ll need to provision, you need to aggregate their respective SLA uptimes.
Composite SLA == Multiply ALL SLAs together for each provisioned instance. For example:
2 VMs: 99.9% each VM
1 SQL DB: 99.99%
1 Load Balancer: 99.99%
0.999 * 0.999 * 0.9999 * 0.9999 == 0.997801 == 99.78%
If your required SLA was 99.9% for your entire application, well then you’re SOL ~ LOL
Why is it lower though when the individual Services all have a better SLA than your target one? Because each provisioned Service == an extra level of complexity with increased risk of failure. So you may want to customize some options to get the aggregate to fit your required SLA
KNOW THIS - Hint: “Regional” Composite SLA
How does building availability requirements into your design help improve SLA?
What’s the formula for calculating SLAs under this scheme?
Because you’re increasing AVAILABILITY of Resources instead of CREATING NEW Resources.
For example, don’t create new VMs, but instead deploy one or more instances of the SAME VM across Availability Zones in the same Region. You then use Azure Traffic Manager to fail over if a Zone fails (or Region fails if you do this with Regions).
The formula for this type of deployment: (1 - (1-N)^R)
Where:
- N == The Composite SLA for the application in ONE Region (Ahh…remember what I said before about “ONE Region”?)
- R == the number of Regions or Zones deployed to
So a single VM is 99.95% SLA. If we deploy that VM to 2 different Zones, plugging that in:
1 - (1 - 0.9995)^2) == 0.99999975 == 99.9999% Sextuple 9’s baby!
So that single VM’s SLA, because the same instance was deployed to TWO different Zones, bumps its SLA from 99.5% to 99.99%
- Note Microsoft updated SLAs for VMs such that that deploying 2 or more VMs across Availability Zones delivers an SLA of Quad 9’s: https://azure.microsoft.com/en-us/support/legal/sla/virtual-machines/v1_9/
How does increasing Redundancy increase Availability?
Redundancy is duplicating components across several Regions. Duplicate every part of the application.
Conversely you’d run your application in a single Region during off-peak hours to reduce costs.
Achieving higher than 99.99% is easy (Y/N)?
How much downtime does 99.99% (Quad Nines) equate to?
Nope. 99.99% means 1.01 minutes of downtime per week. Automating recovery through self diagnosis and self healing during an outage is the best approach.
What are the three (3) phases of the Service Lifecycle?
The release process for new services:
Development => Public Preview => General Availability (GA)
Public Preview allows public use and experimentation to garner public feedback and feature requests
GA comes after the service is validated and tested. Then it’s released to all customers as a production-ready service
Hint: ASDD
What is a Workload?
What four (4) items do Workloads define for said requirements?
A distinct capability or task that’s LOGICALLY SEPARATED from other tasks in terms of business logic and data storage (ex. tasks regarding Functions vs tasks regarding Cosmos DB)
Each workload defines requirements for:
- Availability
- Scalability
- Data Consistency
- Disaster Recovery
How do you receive Service Credit for outages?
Where is the process documented?
What’s the typical timeline for this process?
File a claim.
Each SLA specifies the timeline for which a claim must be submitted (end of calendar month following the month of the incident).