VL 7 Flashcards
Observable system
Exposes enough data about itself so that generating information and easy accessing this info becomes simple
Purposes of monitoring: infrastructure level
Resource management
Incident detection
Root cause analysis
Auditing
Intrusion detection
Purposes of monitoring: application level
Performance analysis
Resource management
Failure detection
SLA verification
Auditing
Target system: parallel system
• Batch system
• Data are collected during an application run.
• Analysis happens post mortem.
• Execution is reproducable.
Target system: cloud
• Interactive system
• Data are continuously produced - Realtime Data
• Realtime analysis
Data used for
• Immediate action or
• Study past system behavior
Three pillars of monitoring
Metrics(data to use for monitoring)
Logs
Traces
Important metrics
Latency: time it takes to do request
Throughput or traffic
Error rate
Utilization or saturation
Monitoring system requirements
• Comprehensive (collect everything that is available)
• Low intrusion
• Extensibility
• Scalability
• Elasticity
• Accuracy
• Resilience
Blackbox monitoring
(Cannot look what happened when processing request)
• The monitored system is handled as a black box.
repust
• No data are gained from the inside of the system.
• E.g. only the request interface of a service is visible nothing about the internal structure.
White box monitoring
• Data is also from the inside of the system.
• This gives more context and more detailed insights.
• E.g. Internal organization of a service is visible, e.g., asynchronous internal handling of requests.
Overheads
• lead to intrusion
reasons:
•Instrumentation
• Computation for aggregations
• Memory overhead for buffering
• Time to push to disk or transfer to collector
• Storage overhead for long-term storage
Reduction techniques:
• Number of metrics
• Measurement frequency
• Representation
• Batching
• Sampling
• Long-term coarsening
Event logs: form
Plaintext
Structured (typically json)
Binary
Prometheus
Open source monitoring system
• Features
• Metric collection in form of time series
• Storage by a time series database
• Query language for accessing the time series
• Alerting
• Visualization
Borgmon
• Provides measurement of metrics
• Storing as time series
• Rule for aggregation
• Hierarchical design for scalability
Usage
• Alerting
• Dashboard
Distributed Tracing - Google Dapper
•
• Capture the interaction of different services
• Capture the individual events, e.g., submit a request, receive the request, start processing, ….., submit answer, receive answer
Associate events with a given request to be able to analyze the execution of this request.