Observability Flashcards
4 Primary Aspects of Observability
Logs, Metrics, Alarms, Events
Logs
a list of application information, system performance, or user activities that provide a verbose, running commentary of activity
What interface do we use to query and visualize the log data stored in Elasticsearch?
Kibana
What are logs most useful for?
Logs are very useful for diagnosing errors because they contain detailed data about activity as well as stack traces.
Are logs or metrics more costly? Why?
Logs contain detailed information which makes them much more costly than metrics, so we should use logs sparingly. For example, we should use a metric to record that a request succeeded. We should use a log when a request fails, because then the stack trace and other context will be useful.
Metrics
a set of numbers, measured over intervals of time, that give information about a particular process or activity
3 Metric Types
Counters, Timers, Gauges
Counters
statsd events which can count. The two most useful outputs are count:sum (sum of all increments) and count:rate (sum of all increments, normalized to increments per second)
How many times did an event occur?
eg. How many times was the database accessed?
Timers
pre-computed quartiles/percentiles based on events coming into statsd. They may not represent time, but are used whenever you need to understand the distribution of a piece of data (such as finding the median, 95th percentile, etc). Timer metric names are suffixed with timer:[computed value] (for example, timer:p99).
eg. What was the average response time when accessing the database?
eg. How many miles was the average ride?
Gauges
views over an internal value of an application. For most purposes, gauge:mean will be the most useful, representing the arithmetic mean of all received gauge values.
e.g. How many database connections are in our connection pool?
Metric names all start with ______________
the environment they are in, specifically production or staging
Alarms / Alerts
notifications sent to the on-call engineer when the data emitted by one or more metrics passes a certain threshold. The terms “alert” and “alarm” are used interchangeably
Every alert has a condition that determines when it will fire, and that condition is based on ________
one or more metrics.
What tool do we use to view alerts?
Lighthouse
What tool do we use to manage incidents?
PagerDuty