Metrics Monitoring and Alerting System Flashcards

1
Q

What is Metrics Monitoring and Alerting System ?

A

A well-designed monitoring and alerting system plays a key role in providing clear visibility into the health of the infrastructure to ensure high availability and reliability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the High Level Requirements ?

A

The requirements are:

  • The infrastructure being monitored is large-scale.
    • 100 million daily active users
    • Assume we have 1,000 server pools, 100 machines per pool, 100 metrics per machine => ~10 million metrics
    • 1-year data retention
    • Data retention policy: raw form for 7 days, 1-minute resolution for 30 days, 1-hour resolution for 1 year.
  • A variety of metrics can be monitored, for example:
    • CPU usage
    • Request count
    • Memory usage
    • Message count in message queues
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the non-functional requirements ?

A
  • Scalability. The system should be scalable to accommodate growing metrics and alert volume.
  • Low latency. The system needs to have low query latency for dashboards and alerts.
  • Reliability. The system should be highly reliable to avoid missing critical alerts.
  • Flexibility. Technology keeps changing, so the pipeline should be flexible enough to easily integrate new technologies in the future.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the components of the system ?

A
  • Data collection: collect metric data from different sources.
  • Data transmission: transfer data from sources to the metrics monitoring system.
  • Data storage: organize and store incoming data.
  • Alerting: analyze incoming data, detect anomalies, and generate alerts. The system must be able to send alerts to different communication channels.
  • Visualization: present data in graphs, charts, etc. Engineers are better at identifying patterns, trends, or problems when data is presented visually, so we need visualization functionality.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a time-series database ?

A

A time series database (TSDB) is a database optimized for time-stamped or time series data. Time series data are simply measurements or events that are tracked, monitored, downsampled, and aggregated over time. This could be server metrics, application performance monitoring, network data, sensor data, events, clicks, trades in a market, and many other types of analytics data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the High Level Design Components ?

A

Metrics source. This can be application servers, SQL databases, message queues, etc.
Metrics collector. It gathers metrics data and writes data into the time-series database.
Time-series database. This stores metrics data as time series. It usually provides a custom query interface for analyzing and summarizing a large amount of time-series data. It maintains indexes on labels to facilitate the fast lookup of time-series data by labels.
Query service. The query service makes it easy to query and retrieve data from the time-series database. This should be a very thin wrapper if we choose a good time-series database. It could also be entirely replaced by the time-series database’s own query interface.
Alerting system. This sends alert notifications to various alerting destinations.
Visualization system. This shows metrics in the form of various graphs/charts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the metrics collection models ?

A

Push vs Pull

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Kafka ?

A

Kafka is used as a highly reliable and scalable distributed messaging platform.
Kafka is a queing system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe Query service

A

The query service comprises a cluster of query servers, which access the time-series databases and handle requests from the visualization or alerting systems. Having a dedicated set of query servers decouples time-series databases from the clients (visualization and alerting systems). And this gives us the flexibility to change the time-series database or the visualization and alerting systems, whenever needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly