Observability Design Patterns Flashcards
Log Aggregation
Summary:
Centralized logger for multiple services. Example, AWS CloudWatch, Splunk etc.
Detail:
Consider a use case where an application consists of multiple service instances that are running on multiple machines. Requests often span multiple service instances. Each service instance generates a log file in a standardized format. How can we understand the application behavior through logs for a particular request?
We need a centralized logging service that aggregates logs from each service instance. Users can search and analyze the logs. They can configure alerts that are triggered when certain messages appear in the logs. For example, PCF does have Loggeregator, which collects logs from each component (router, controller, diego, etc…) of the PCF platform along with applications. AWS Cloud Watch also does the same.
Performance Metrics
Summary:
Centralized monitoring of services. Examples DataDog and New Relic, Prometheus. Usually requires agents and can be push or pull
Detail:
When the service portfolio increases due to microservice architecture, it becomes critical to keep a watch on the transactions so that patterns can be monitored and alerts sent when an issue happens. How should we collect metrics to monitor application perfomance?
A metrics service is required to gather statistics about individual operations. It should aggregate the metrics of an application service, which provides reporting and alerting. There are two models for aggregating metrics:
Push — the service pushes metrics to the metrics service e.g. NewRelic, AppDynamics
Pull — the metrics services pulls metrics from the service e.g. Prometheus
Performance Metrics
Summary:
Centralized monitoring of services. Examples DataDog and New Relic, Prometheus. Usually requires agents and can be push or pull
Detail:
When the service portfolio increases due to microservice architecture, it becomes critical to keep a watch on the transactions so that patterns can be monitored and alerts sent when an issue happens. How should we collect metrics to monitor application perfomance?
A metrics service is required to gather statistics about individual operations. It should aggregate the metrics of an application service, which provides reporting and alerting. There are two models for aggregating metrics:
Push — the service pushes metrics to the metrics service e.g. NewRelic, AppDynamics
Pull — the metrics services pulls metrics from the service e.g. Prometheus
Distributed Tracing
Summary:
“Radioactive Isotope” to track calls across multiple services. Generally achieved with a common transaction ID reported by services via logging.
Detail:
In a microservice architecture, requests often span multiple services. Each service handles a request by performing one or more operations across multiple services. While in troubleshoot it is worth to have trace ID, we trace a request end-to-end.
The solution is to introduce a transaction ID. Follow approach can be used;
Assigns each external request a unique external request id.
Passes the external request id to all services.
Includes the external request id in all log messages.
Health Check
Summary:
API method to return health of service. Should include check on critical dependencies for service as well (e.g. service is no good if it can’t talk to store).
Detail:
When microservice architecture has been implemented, there is a chance that a service might be up but not able to handle transactions. Each service needs to have an endpoint which can be used to check the health of the application, such as /health. This API should check the status of the host, the connection to other services/infrastructure, and any specific logic.