Making decisions: Log Diagnostics Flashcards
Situation
Teams weren’t tapping into the diagnostic potential of our log aggregation tool.
Task
Investigate barrier to entry
Action - Phase 1 (research)
- Conducted user research to discover how users used our log aggregation tool and the querying tools (Athena and Quicksight)
- Discovered developers knew how to use tools but felt they querying tools were too slow.
- Based on research pursued spike of debugging UI tools
Action - Phase 2 (spike)
• Optimised for fast feedback loop and set-up speed.
• Had four options:
1. CloudWatchLogs Agent → CloudWatch → Lambda → AWS ES → Kibana
2. Flume Agent (Log Extractor) → Elasticsearch → Kibana
3. Flume Agent (Log Extractor) → S3 → Lambda → AWS Elasticsearch → Kibana
4. Build ELK stack from scratch from a subset of data.
• Spiked #3 because version conflicts on 1 & 2, and 4 was not optimising for feedback and setup.
• Implemented the spike and conducted follow-up user research on how developers interacted with Kibana.
Result
- Discovered that the lack of standardised logging across teams hindered log fidelity.
- Developers confirmed they would use ssh/grep until log diagnostic tool was easier than ssh/grep.
- Proposed work to standardise logging on shared infrastructure.
- Stopped work on delivering an ELK-stack based set of features
Reflection
- Strengths: user research, focus on fast-feedback, pros/cons analysis.
- Differently: explored other log diagnostic tooling besides AWS elasticsearch.
- Pushed harder to continue work on log standardisation.
- Create a milestone system for log diagnostics