Making decisions: Log Diagnostics Flashcards

1
Q

Situation

A

Teams weren’t tapping into the diagnostic potential of our log aggregation tool.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Task

A

Investigate barrier to entry

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Action - Phase 1 (research)

A
  • Conducted user research to discover how users used our log aggregation tool and the querying tools (Athena and Quicksight)
  • Discovered developers knew how to use tools but felt they querying tools were too slow.
  • Based on research pursued spike of debugging UI tools
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Action - Phase 2 (spike)

A

• Optimised for fast feedback loop and set-up speed.
• Had four options:
1. CloudWatchLogs Agent → CloudWatch → Lambda → AWS ES → Kibana
2. Flume Agent (Log Extractor) → Elasticsearch → Kibana
3. Flume Agent (Log Extractor) → S3 → Lambda → AWS Elasticsearch → Kibana
4. Build ELK stack from scratch from a subset of data.
• Spiked #3 because version conflicts on 1 & 2, and 4 was not optimising for feedback and setup.
• Implemented the spike and conducted follow-up user research on how developers interacted with Kibana.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Result

A
  • Discovered that the lack of standardised logging across teams hindered log fidelity.
  • Developers confirmed they would use ssh/grep until log diagnostic tool was easier than ssh/grep.
  • Proposed work to standardise logging on shared infrastructure.
  • Stopped work on delivering an ELK-stack based set of features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Reflection

A
  • Strengths: user research, focus on fast-feedback, pros/cons analysis.
  • Differently: explored other log diagnostic tooling besides AWS elasticsearch.
  • Pushed harder to continue work on log standardisation.
  • Create a milestone system for log diagnostics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly