chapter 7 Flashcards

Question 1

Q

Explain what the Monitoring step in the DevOps lifecycle is and how it can be tied to the Plan step

Answer

A

Monitoring step: is essential for tracking the performance and health of applications and systems. It enables teams to quickly identify and act upon potential issues, or otherwise keep an eye on the daily operation of an application or system

How can it be tied to the plan step: when a bug,… get detected, teams can go back to the pan step in the devops lifecycle to start planning new code or components that will solve the issue

Question 2

Q

Explain the use and importance of logs in the context of monitoring

Answer

A

use: logs play a critical role in recording and preserving the sequence of events that occur within an application or system. It is a fundamental component for maintaining the health, security and efficiency of any digital system. It serves as a vital diagnostic tool and a historical record of the information that developers and operations teams rely on to understand the behavior of the system and diagnose issues

importance:
debugging, performance checking, security auditing, compliance, bussiness analytics, incident response

debugging: logs are invaluable when it comes to debugging. It creates targeted approaches to problem-solving, rather than a broad sweep

performance checking: logs can also be used to look at an applications performance over time. By analyzing logs, teams can identify performance trends

security auditing: security is another area where logs play a vital role. They can record access attempts, user transactions, and changes to the system, providing a trail that can be used to detect unauthorized access or other security breaches

compliance: many industries are subject to regulations that require the retention of logs for a certain period. These logs must be stored securely and often need to be readily accessible for auditing purposes

business analytics: logs can also be mined for business insights, for example web server logs can reveal user behavior patterns, popular content, and potential areas for site improvement

incident response: in the event of a system failure or breach, logs are often the first place responders will look to establish a timeline and understand the size of the incident

Question 3

Q

Describe how the Python logging module makes logging possible and uses different levels and handlers to work

Answer

A

The python logging module is a versatile and widely-used facility that provides a flexible way for applications and libraries to handle logging. It is part of the standard python library, which means its readibly available and does not require additional installation.

Question 4

Q

Explain the use and importance of metrics in the context of monitoring

Answer

A

Metrics are the quantifiable measures that track the performance and health of applications and systems real-time

Question 5

Q

Give examples of basic operational metrics

Answer

A

Cpu usage: measures the percentage of cpu resources being used by the api service

Memory consumption: tracks the amount of memory the api service is using

Network i/o: monitors the amount of data being sent and received by the api service

Disk i/o: observes the read and write operations on the disk where the api service is hosted

Container health: checks the status of containers running the api service

Question 6

Q

Describe what Prometheus and its main features are

Answer

A

Prometheus is an open-source systems monitoring and alerting toolkit. It’s a standalone project designed to collect and store metrics as time series data, which means that metrics information is stored with the timestamp at which it was recorded, along with optional key-value pairs called labels

Main features:
metric collection, flexible queries (promQL), stand-alone and locale storage, pull model, configurable targets

Metric collection: Prometheus collects data from various targets (services, applications, servers,…) over time. These targets expose metrics via an http endpoint

Flexible queries (promQL): promQL allows you to query and manipulate your collected data. Example: You can use promQL to calculate the average response time over the last hour or find the top 5 endpoints with the highest error rate

Stand-alone and local storage: Prometheus doesn’t rely on external databases. It stores data locally in its own time series database. Example: Even if your network goes down, Prometheus continues to collect and store metrics

Pull model: Prometheus pulls data from targets at regular intervals. Example: Every 15 seconds Prometheus fetches cpu usage, memory consumption, and other metrics from your application

Configurable targets: you define which targets Prometheus should scrape. This can be done statically or via service discovery. Example: in your Prometheus.yml, you specify that proemtheus should scrape metrics from your fastapi app at localhost:8000

Question 7

Q

Explain how Prometheus works in a deployment with a FastAPI container it scrapes and Grafana container that makes a dashboard of its data

Answer

A

Fastapi application: exposes metrics at localhost:8000/metrics. Prometheus can scrape this endpoint to collect data

Prometheus to collect metrics: it scrapes data form predefined targets, in this case our fastapi application. Stores this data for querying and alerting

Grafana for visualization: connects to Prometheus and queries the stored data. Lets you create dashboards to visualize the data

Question 8

Q

Explain what KPI’s and Service levels in SLAs are and how they relate to monitoring

Answer

A

Kpi’s ( key performance indicators) are tied to demands the customer has for our applications and are crucial for evaluating the success of the api in meeting those goals for the users. The 3 important kpi’s for api monitoring might include:
response time, error rate, availability

Response time: a direct indicator of user ex, as it measures the time an api takes to process and return a response

Error rate: represents the stability and reliability of the api by showing the percentage of requests that result in an error

Availability: indicate the percentage of time the api is operational and accessible, which is crucial for user reliance

Service level agreements ( SLAs) are built upon service levels and provide concrete examples for these operational standards. An SLA may specify these service levels:
The response time of an api call should not exceed 300 milliseconds

The error rate should be below 1% for all api calls measured over a month

The availability should be at least 99.95% measured over a month

KPI’s are explicitly tied to service levels like these

Service levels, especially through SLAs, help in quantifying the expectations and obligations of both the it provider and customer, making the KPIs and monitoring a critical component in managing the contract

Question 9

Q

Describe the use of alerts in the context of monitoring

Answer

A

Alerts can be used on both metrics and logs. They often rely on metrics to trigger notifications when performance indicators fall outside of acceptable ranges.

Similarly, Alerts can be generated from Logs when specific error messages or event patterns are detected, indicating a potential issue that requires attention.

Example: an alert may be set up to notify the team if cpu usage exceeds 90% for a certain period.

chapter 7 Flashcards

(9 cards)