chapter 7 Flashcards
Explain what the Monitoring step in the DevOps lifecycle is and how it can be tied to the Plan step
Monitoring step: is essential for tracking the performance and health of applications and systems. It enables teams to quickly identify and act upon potential issues, or otherwise keep an eye on the daily operation of an application or system
How can it be tied to the plan step: when a bug,… get detected, teams can go back to the pan step in the devops lifecycle to start planning new code or components that will solve the issue
Explain the use and importance of logs in the context of monitoring
use: logs play a critical role in recording and preserving the sequence of events that occur within an application or system. It is a fundamental component for maintaining the health, security and efficiency of any digital system. It serves as a vital diagnostic tool and a historical record of the information that developers and operations teams rely on to understand the behavior of the system and diagnose issues
importance:
debugging, performance checking, security auditing, compliance, bussiness analytics, incident response
debugging: logs are invaluable when it comes to debugging. It creates targeted approaches to problem-solving, rather than a broad sweep
performance checking: logs can also be used to look at an applications performance over time. By analyzing logs, teams can identify performance trends
security auditing: security is another area where logs play a vital role. They can record access attempts, user transactions, and changes to the system, providing a trail that can be used to detect unauthorized access or other security breaches
compliance: many industries are subject to regulations that require the retention of logs for a certain period. These logs must be stored securely and often need to be readily accessible for auditing purposes
business analytics: logs can also be mined for business insights, for example web server logs can reveal user behavior patterns, popular content, and potential areas for site improvement
incident response: in the event of a system failure or breach, logs are often the first place responders will look to establish a timeline and understand the size of the incident
Describe how the Python logging module makes logging possible and uses different levels and handlers to work
The python logging module is a versatile and widely-used facility that provides a flexible way for applications and libraries to handle logging. It is part of the standard python library, which means its readibly available and does not require additional installation.
Explain the use and importance of metrics in the context of monitoring
Metrics are the quantifiable measures that track the performance and health of applications and systems real-time
Give examples of basic operational metrics
Cpu usage: measures the percentage of cpu resources being used by the api service
Memory consumption: tracks the amount of memory the api service is using
Network i/o: monitors the amount of data being sent and received by the api service
Disk i/o: observes the read and write operations on the disk where the api service is hosted
Container health: checks the status of containers running the api service
Describe what Prometheus and its main features are
Prometheus is an open-source systems monitoring and alerting toolkit. It’s a standalone project designed to collect and store metrics as time series data, which means that metrics information is stored with the timestamp at which it was recorded, along with optional key-value pairs called labels
Main features:
metric collection, flexible queries (promQL), stand-alone and locale storage, pull model, configurable targets
Metric collection: Prometheus collects data from various targets (services, applications, servers,…) over time. These targets expose metrics via an http endpoint
Flexible queries (promQL): promQL allows you to query and manipulate your collected data. Example: You can use promQL to calculate the average response time over the last hour or find the top 5 endpoints with the highest error rate
Stand-alone and local storage: Prometheus doesn’t rely on external databases. It stores data locally in its own time series database. Example: Even if your network goes down, Prometheus continues to collect and store metrics
Pull model: Prometheus pulls data from targets at regular intervals. Example: Every 15 seconds Prometheus fetches cpu usage, memory consumption, and other metrics from your application
Configurable targets: you define which targets Prometheus should scrape. This can be done statically or via service discovery. Example: in your Prometheus.yml, you specify that proemtheus should scrape metrics from your fastapi app at localhost:8000
Explain how Prometheus works in a deployment with a FastAPI container it scrapes and Grafana container that makes a dashboard of its data
Fastapi application: exposes metrics at localhost:8000/metrics. Prometheus can scrape this endpoint to collect data
Prometheus to collect metrics: it scrapes data form predefined targets, in this case our fastapi application. Stores this data for querying and alerting
Grafana for visualization: connects to Prometheus and queries the stored data. Lets you create dashboards to visualize the data
Explain what KPI’s and Service levels in SLAs are and how they relate to monitoring
Kpi’s ( key performance indicators) are tied to demands the customer has for our applications and are crucial for evaluating the success of the api in meeting those goals for the users. The 3 important kpi’s for api monitoring might include:
response time, error rate, availability
Response time: a direct indicator of user ex, as it measures the time an api takes to process and return a response
Error rate: represents the stability and reliability of the api by showing the percentage of requests that result in an error
Availability: indicate the percentage of time the api is operational and accessible, which is crucial for user reliance
Service level agreements ( SLAs) are built upon service levels and provide concrete examples for these operational standards. An SLA may specify these service levels:
The response time of an api call should not exceed 300 milliseconds
The error rate should be below 1% for all api calls measured over a month
The availability should be at least 99.95% measured over a month
KPI’s are explicitly tied to service levels like these
Service levels, especially through SLAs, help in quantifying the expectations and obligations of both the it provider and customer, making the KPIs and monitoring a critical component in managing the contract
Describe the use of alerts in the context of monitoring
Alerts can be used on both metrics and logs. They often rely on metrics to trigger notifications when performance indicators fall outside of acceptable ranges.
Similarly, Alerts can be generated from Logs when specific error messages or event patterns are detected, indicating a potential issue that requires attention.
Example: an alert may be set up to notify the team if cpu usage exceeds 90% for a certain period.