l10 observability Flashcards
Observability
Telemetry Collection:
* Collecting metrics from various sources across all
layers (L3 – L7)
* Needs to gain metrics from infrastructure as well as
elements of application (deployment & services)
Analytics and Visibility
* Visualizing and Analyzing Metrics
* Mechanism for reporting anomalies
* Packet-Capture from Pod-to-Pod necessary
Security and Troubleshooting
* Tracing or Service Meshing
* Mechanisms for Prediction and Analytics
eBPF
Extensible Berkley Packet Filters:
* Sandboxed Programs in Userspace in
Kernel
* Cillium (as one example) for better
networking instead of iptables
Target of Observability, Critical Path
A span A is part of the critical path if and
only if:
– A’s parent is blocked on A’s completion
at time t
– A is not blocked on any child span’s
completion at time
SLI
Service Level Indicator
What are we measuring?
E.g. How much time take the search results
- Base for defining availability
- For one specitic action/attribute
- Of one specific service
- Examples to be defined
- Golden Signals of one specific service of
one operation (the concise the better) for
network service - Durability for storage
- Correctness for computation
- Just the metrics, no thresholds / rules to meet
- Need to be derived automatically
SLO
Service Level Objective
How well do we perform on the SLI?
E.g. Queries should be return results within 500ms
- Threshold to be hold for defined SLIs:
SLI <= target threshold - Technically hard to define, need to be refined
- Wrong SLI à no use
- Threshold too low à Customer/Services affected
- Threshold too high à Too many incidents, false alarms
- Must be simple yet holistic
- Avoid absolutes (always available, for all data accesses, etc.)
- Organizational hard to develop
- Must be defined with product management
- Have as few SLOs as possible (but as many as necessary)
p95(http_latency[path=webappl/impressum}) < 50
SLA
Service Level Agreement
Consequences for missing objectives
E.g. Apologize, payback, …
- Result if SLO is not met
- Legal and easy language with fixed defined consequences
- Promise against Customer defined by Product Management (not DevOps any more)
- Not of interest for the rest of the lecture since not definable by sourcecode but in contracts
4 causes for failure
Internal System Changes,
Changes in User Behaviour,
Changes in dependencies
Changes in platform
those are system boundaries
Availability, Parallel vs Serial / Sequentiell
Parallel:
* HA-Setup of same services
* E.g. Horizontal Scaling
Parallel Component = 1 - ( 1 - obe)*(1-unde)
Serial:
Series Component = C1 * C2 * C3 * C4…
* Different Services of different kinds
* E.g. Database and Messaging and Load Balance