How to think about metrics Flashcards
What are work metrics?
Work metrics indicate the top-level health of your system by measuring its useful output.
What are the four subtypes of work metrics:
Throughput
Success
Error
Performance
What is throughput in relation to metrics?
The amount of work the system is doing per unit time. Throughput is usually recorded as an absolute number.
What is success in relation to metrics?
Metrics that represent the percentage of work that was executed successfully.
What is error in relation to metrics?
Metrics capture the number of erroneous results, usually expressed as a rate of errors per unit time or normalized by the throughput to yield errors per unit of work. Error metrics are often captured separately from success metrics when there are several potential sources of error, some of which are more serious or actionable than others.
What is performance in relation to metrics?
Metrics quantify how efficiently a component is doing its work. The most common performance metric is latency, which represents the time required to complete a unit of work. Latency can be expressed as an average or as a percentile, such as “99% of requests returned within 0.1s”.
What are resource metrics?
Resource metrics can help you reconstruct a detailed picture of a system’s state, making them especially valuable for investigation and diagnosis of problems.
What are the four key areas for each system resource that you should try to collect?
Utilization
Saturation
Errors
Observability
What are some examples of other metrics?
Common examples include counts of cache hits or database locks. When in doubt, capture the data.
How are events defined?
We can define events as: discrete, infrequent occurrences that can provide crucial context for understanding what changed in your system’s behavior.
What are some examples of events?
Changes: Internal code releases, builds, and build failures
Alerts: Internally generated alerts or third-party notifications
Scaling events: Adding or subtracting hosts
What are the four characteristics of good data?
Well-understood: You should be able to quickly determine how each metric or event was captured and what it represents.
Granular: If you collect metrics too infrequently or average values over long windows of time, you may lose the ability to accurately reconstruct a system’s behavior.
Tagged by scope: Each of your hosts operates simultaneously in multiple scopes, and you may want to check on the aggregate health of any of these scopes, or their combinations.
Long-lived: If you discard data too soon, or if after a period of time your monitoring system aggregates your metrics to reduce storage costs, then you lose important information about what happened in the past.
What is a record?
In short, a record is a low-urgency alert that does not notify anyone automatically but is recorded in a monitoring system in case it becomes useful for later analysis or investigation.
What is a notification?
A notification is a moderate-urgency alert that notifies someone who can fix the problem in a non-interrupting way such as email or chat.
What is a page?
A page is an urgent alert that interrupts a recipient’s work, sleep, or personal time, whatever the hour.
What type of alert should work metrics recieve?
Page
What type of alert should resource metrics other than utilization receive?
Record
What type of alert should resource metrics for utilization receive?
Notification or page depending on severity
What type of alert should a failed work-related event receive?
Page
What level of severity should alerts as records be categorized as?
Low
Many alerts will not be associated with a service problem, so a human may never even need to be aware of them. For instance, when a data store that supports a user-facing service starts serving queries much slower than usual, but not slow enough to make an appreciable difference in the overall service’s response time, that should generate a low-urgency alert that is recorded in your monitoring system for future reference or investigation but does not interrupt anyone’s work. After all, transient issues that could be to blame, such as network congestion, often go away on their own. But should the service start returning a large number of timeouts, that alert-based data will provide invaluable context for your investigation.
What level of severity should alerts as notifications be categorized as?
Moderate
The next tier of alerting urgency is for issues that do require intervention, but not right away. Perhaps the data store is running low on disk space and should be scaled out in the next several days. Sending an email and/or posting a notification in the service owner’s chat room is a perfect way to deliver these alerts—both message types are highly visible, but they won’t wake anyone in the middle of the night or disrupt an engineer’s flow.
What level of severity should alerts as pages be categorized as?
High
The most urgent alerts should receive special treatment and be escalated to a page (as in “pager”) to urgently request human attention. Response times for your web application, for instance, should have an internal SLA that is at least as aggressive as your strictest customer-facing SLA. Any instance of response times exceeding your internal SLA would warrant immediate attention, whatever the hour.
What are the the three questions to determine the alert’s level of urgency?
- Is this issue real? If the issue is indeed real, it should generate an alert.
- Does this issue require attention?
- Is this issue urgent?
Should you build pages on symptoms or causes?
Symptoms
Further Reading: https://www.datadoghq.com/blog/monitoring-101-collecting-data/
When is the only time you should send a page?
When symptoms of urgent problems in your system’s work are detected, or if a critical and finite resource limit is about to be reached
When should you record alerts?
Whenever your monitoring system detects real issues in your infrastructure, even if those issues have not yet affected overall performance.
What is the workflow for investigating an issue?
Start at the top with work metrics >
Dig into resources >
Consider alerts and other events that may be correlated with your metric
Fix the issue with more instrumentation and metrics
What is the standardized monitoring framework that allows you to investigate problems more systematically?
For each system in your infrastructure, set up a dashboard ahead of time that displays all its key metrics, with relevant events overlaid.
Investigate causes of problems by starting with the highest-level system that is showing symptoms, reviewing its work and resource metrics and any associated events.
If problematic resources are detected, apply the same investigation pattern to the resource (and its constituent resources) until your root problem is discovered and corrected.
What are the three pillars of observability?
Metrics, logs, and traces