How to think about metrics Flashcards
What are work metrics?
Work metrics indicate the top-level health of your system by measuring its useful output.
What are the four subtypes of work metrics:
Throughput
Success
Error
Performance
What is throughput in relation to metrics?
The amount of work the system is doing per unit time. Throughput is usually recorded as an absolute number.
What is success in relation to metrics?
Metrics that represent the percentage of work that was executed successfully.
What is error in relation to metrics?
Metrics capture the number of erroneous results, usually expressed as a rate of errors per unit time or normalized by the throughput to yield errors per unit of work. Error metrics are often captured separately from success metrics when there are several potential sources of error, some of which are more serious or actionable than others.
What is performance in relation to metrics?
Metrics quantify how efficiently a component is doing its work. The most common performance metric is latency, which represents the time required to complete a unit of work. Latency can be expressed as an average or as a percentile, such as “99% of requests returned within 0.1s”.
What are resource metrics?
Resource metrics can help you reconstruct a detailed picture of a system’s state, making them especially valuable for investigation and diagnosis of problems.
What are the four key areas for each system resource that you should try to collect?
Utilization
Saturation
Errors
Observability
What are some examples of other metrics?
Common examples include counts of cache hits or database locks. When in doubt, capture the data.
How are events defined?
We can define events as: discrete, infrequent occurrences that can provide crucial context for understanding what changed in your system’s behavior.
What are some examples of events?
Changes: Internal code releases, builds, and build failures
Alerts: Internally generated alerts or third-party notifications
Scaling events: Adding or subtracting hosts
What are the four characteristics of good data?
Well-understood: You should be able to quickly determine how each metric or event was captured and what it represents.
Granular: If you collect metrics too infrequently or average values over long windows of time, you may lose the ability to accurately reconstruct a system’s behavior.
Tagged by scope: Each of your hosts operates simultaneously in multiple scopes, and you may want to check on the aggregate health of any of these scopes, or their combinations.
Long-lived: If you discard data too soon, or if after a period of time your monitoring system aggregates your metrics to reduce storage costs, then you lose important information about what happened in the past.
What is a record?
In short, a record is a low-urgency alert that does not notify anyone automatically but is recorded in a monitoring system in case it becomes useful for later analysis or investigation.
What is a notification?
A notification is a moderate-urgency alert that notifies someone who can fix the problem in a non-interrupting way such as email or chat.
What is a page?
A page is an urgent alert that interrupts a recipient’s work, sleep, or personal time, whatever the hour.