How to think about metrics Flashcards by Collin Sanford

What are work metrics?

Work metrics indicate the top-level health of your system by measuring its useful output.

How well did you know this?

Not at all

Perfectly

What are the four subtypes of work metrics:

Throughput
Success
Error
Performance

How well did you know this?

Not at all

Perfectly

What is throughput in relation to metrics?

The amount of work the system is doing per unit time. Throughput is usually recorded as an absolute number.

How well did you know this?

Not at all

Perfectly

What is success in relation to metrics?

Metrics that represent the percentage of work that was executed successfully.

How well did you know this?

Not at all

Perfectly

What is error in relation to metrics?

Metrics capture the number of erroneous results, usually expressed as a rate of errors per unit time or normalized by the throughput to yield errors per unit of work. Error metrics are often captured separately from success metrics when there are several potential sources of error, some of which are more serious or actionable than others.

How well did you know this?

Not at all

Perfectly

What is performance in relation to metrics?

Metrics quantify how efficiently a component is doing its work. The most common performance metric is latency, which represents the time required to complete a unit of work. Latency can be expressed as an average or as a percentile, such as “99% of requests returned within 0.1s”.

How well did you know this?

Not at all

Perfectly

What are resource metrics?

Resource metrics can help you reconstruct a detailed picture of a system’s state, making them especially valuable for investigation and diagnosis of problems.

How well did you know this?

Not at all

Perfectly

What are the four key areas for each system resource that you should try to collect?

Utilization
Saturation
Errors
Observability

How well did you know this?

Not at all

Perfectly

What are some examples of other metrics?

Common examples include counts of cache hits or database locks. When in doubt, capture the data.

How well did you know this?

Not at all

Perfectly

How are events defined?

We can define events as: discrete, infrequent occurrences that can provide crucial context for understanding what changed in your system’s behavior.

How well did you know this?

Not at all

Perfectly

What are some examples of events?

Changes: Internal code releases, builds, and build failures
Alerts: Internally generated alerts or third-party notifications
Scaling events: Adding or subtracting hosts

How well did you know this?

Not at all

Perfectly

What are the four characteristics of good data?

Well-understood: You should be able to quickly determine how each metric or event was captured and what it represents.

Granular: If you collect metrics too infrequently or average values over long windows of time, you may lose the ability to accurately reconstruct a system’s behavior.

Tagged by scope: Each of your hosts operates simultaneously in multiple scopes, and you may want to check on the aggregate health of any of these scopes, or their combinations.

Long-lived: If you discard data too soon, or if after a period of time your monitoring system aggregates your metrics to reduce storage costs, then you lose important information about what happened in the past.

How well did you know this?

Not at all

Perfectly

What is a record?

In short, a record is a low-urgency alert that does not notify anyone automatically but is recorded in a monitoring system in case it becomes useful for later analysis or investigation.

How well did you know this?

Not at all

Perfectly

What is a notification?

A notification is a moderate-urgency alert that notifies someone who can fix the problem in a non-interrupting way such as email or chat.

How well did you know this?

Not at all

Perfectly

What is a page?

A page is an urgent alert that interrupts a recipient’s work, sleep, or personal time, whatever the hour.

How well did you know this?

Not at all

Perfectly

What type of alert should work metrics recieve?

Study These Flashcards

Page

What type of alert should resource metrics other than utilization receive?

Study These Flashcards

Record

What type of alert should resource metrics for utilization receive?

Study These Flashcards

Notification or page depending on severity

What type of alert should a failed work-related event receive?

Study These Flashcards

Page

What level of severity should alerts as records be categorized as?

Study These Flashcards

Low

Many alerts will not be associated with a service problem, so a human may never even need to be aware of them. For instance, when a data store that supports a user-facing service starts serving queries much slower than usual, but not slow enough to make an appreciable difference in the overall service’s response time, that should generate a low-urgency alert that is recorded in your monitoring system for future reference or investigation but does not interrupt anyone’s work. After all, transient issues that could be to blame, such as network congestion, often go away on their own. But should the service start returning a large number of timeouts, that alert-based data will provide invaluable context for your investigation.

What level of severity should alerts as notifications be categorized as?

Study These Flashcards

Moderate

The next tier of alerting urgency is for issues that do require intervention, but not right away. Perhaps the data store is running low on disk space and should be scaled out in the next several days. Sending an email and/or posting a notification in the service owner’s chat room is a perfect way to deliver these alerts—both message types are highly visible, but they won’t wake anyone in the middle of the night or disrupt an engineer’s flow.

What level of severity should alerts as pages be categorized as?

Study These Flashcards

High

The most urgent alerts should receive special treatment and be escalated to a page (as in “pager”) to urgently request human attention. Response times for your web application, for instance, should have an internal SLA that is at least as aggressive as your strictest customer-facing SLA. Any instance of response times exceeding your internal SLA would warrant immediate attention, whatever the hour.

What are the the three questions to determine the alert’s level of urgency?

Study These Flashcards

Is this issue real? If the issue is indeed real, it should generate an alert.
Does this issue require attention?
Is this issue urgent?

Should you build pages on symptoms or causes?

Study These Flashcards

Symptoms

Further Reading: https://www.datadoghq.com/blog/monitoring-101-collecting-data/

When is the only time you should send a page?

When symptoms of urgent problems in your system’s work are detected, or if a critical and finite resource limit is about to be reached

When should you record alerts?

Whenever your monitoring system detects real issues in your infrastructure, even if those issues have not yet affected overall performance.

What is the workflow for investigating an issue?

Start at the top with work metrics > Dig into resources > Consider alerts and other events that may be correlated with your metric Fix the issue with more instrumentation and metrics

What is the standardized monitoring framework that allows you to investigate problems more systematically?

For each system in your infrastructure, set up a dashboard ahead of time that displays all its key metrics, with relevant events overlaid. Investigate causes of problems by starting with the highest-level system that is showing symptoms, reviewing its work and resource metrics and any associated events. If problematic resources are detected, apply the same investigation pattern to the resource (and its constituent resources) until your root problem is discovered and corrected.

What are the three pillars of observability?

Metrics, logs, and traces

How to think about metrics Flashcards

(29 cards)