VL 7 Flashcards
Observable system
Exposes enough data about itself so that generating information and easy accessing this info becomes simple
Purposes of monitoring: infrastructure level
Resource management
Incident detection
Root cause analysis
Auditing
Intrusion detection
Purposes of monitoring: application level
Performance analysis
Resource management
Failure detection
SLA verification
Auditing
Target system: parallel system
• Batch system
• Data are collected during an application run.
• Analysis happens post mortem.
• Execution is reproducable.
Target system: cloud
• Interactive system
• Data are continuously produced - Realtime Data
• Realtime analysis
Data used for
• Immediate action or
• Study past system behavior
Three pillars of monitoring
Metrics(data to use for monitoring)
Logs
Traces
Important metrics
Latency: time it takes to do request
Throughput or traffic
Error rate
Utilization or saturation
Monitoring system requirements
• Comprehensive (collect everything that is available)
• Low intrusion
• Extensibility
• Scalability
• Elasticity
• Accuracy
• Resilience
Blackbox monitoring
(Cannot look what happened when processing request)
• The monitored system is handled as a black box.
repust
• No data are gained from the inside of the system.
• E.g. only the request interface of a service is visible nothing about the internal structure.
White box monitoring
• Data is also from the inside of the system.
• This gives more context and more detailed insights.
• E.g. Internal organization of a service is visible, e.g., asynchronous internal handling of requests.
Overheads
• lead to intrusion
reasons:
•Instrumentation
• Computation for aggregations
• Memory overhead for buffering
• Time to push to disk or transfer to collector
• Storage overhead for long-term storage
Reduction techniques:
• Number of metrics
• Measurement frequency
• Representation
• Batching
• Sampling
• Long-term coarsening
Event logs: form
Plaintext
Structured (typically json)
Binary
Prometheus
Open source monitoring system
• Features
• Metric collection in form of time series
• Storage by a time series database
• Query language for accessing the time series
• Alerting
• Visualization
Borgmon
• Provides measurement of metrics
• Storing as time series
• Rule for aggregation
• Hierarchical design for scalability
Usage
• Alerting
• Dashboard
Distributed Tracing - Google Dapper
•
• Capture the interaction of different services
• Capture the individual events, e.g., submit a request, receive the request, start processing, ….., submit answer, receive answer
Associate events with a given request to be able to analyze the execution of this request.
Google dapper: designed for
• Continuous and ubiquitous tracing
• Low-overhead
• Application transparency
• Scalability
Dapper:security and privacy
Doesn’t collect any payload data
Can be used to enforce security policies
Such runtime verification provides greater assurance than source code audits
Managing overheads
Coalescing events
Asynchronous writes
Adaptive samping at the application
Adaptive sampling at collection time