Troubleshooting Flashcards
Users are complaining about slow response times for a critical application. Walk us through your approach to diagnosing the source of latency.
- Gather Initial Information (patterns, changes, consistency)
- Define Performance Baseline
- Network Analysis (packet loss, bandwidth utilisation; use ping or traceroute)
- Server Health Check (utilisation)
- Database Analysis (utilisation, queries, index usage)
- Application Profiling (inefficient code, memory leaks)
- Application Dependencies (changes)
- Application Logs (errors, tracing)
- Web Server Analysis (logs, response times, load)
- Load Balancer Examination (configuration, performance)
- Client-Side Investigation (browser compatibility)
- Performance Monitoring (utilisation, latency, tracing, load testing)
- Security (firewalls)
- Comparative Analysis (normal vs. slow, patterns)
- Collaboration (dev, DB, sys admins)
- Testing and Validation (test hypothesis)
17 Communication and Resolution (stakeholders)
A database serving an essential application goes down unexpectedly. How would you handle this incident? Describe the steps you’d take to bring the database back online while minimizing data loss and service disruption.
- Initiate incident response process
- Communicate with stakeholders
- Assess impact and scope
- Isolate cause (logs, metrics)
- Implement Immediate Fixes (patch, unlock bottleneck, restart service)
- Restore from Backups (use backup/restore plan)
- Data Recovery (perform point-in-time recovery)
- Testing and Verification
- Monitor and Stabilise
- Identify Preventive Measures (post incident retro, comms, documentation)
The application is experiencing an increase in HTTP 500 internal server errors. Outline your process for investigating and resolving these errors, including the possible factors you’d consider and the strategies you’d employ to mitigate the issue.
- Initial Assessment
- Monitoring and Alerting
- Error Logs Analysis
- Identify Patterns
- Code Review
- Database Inspection
- Infrastructure Assessment
- Third-Party Services
- Server Configuration
- Testing and Reproduction
- Rollback Recent Changes
- Code Debugging
- Error Handling and Logging
- Load and Performance Testing
- Bug Fixing and Code Deployment
- Communication
- Post-Incident Review
- Documentation
Users are reporting intermittent connectivity issues, and you suspect a misconfiguration in the load balancer settings. Describe how you would verify the load balancer configuration, identify any misconfigurations, and rectify the issue to restore proper traffic distribution.
- Gather Information
- Logging and Monitoring
- Access Load Balancer Configuration
- Review Load Balancer Configuration
- Check Health Checks
- Session Persistence (misconfigured session persistence can lead to uneven distribution of traffic)
- Connection Limits and Timeouts
- Protocol and Port Settings
- Compare with Best Practices
- Network Topology and Routing
- Backup Configuration
- Rectify Misconfigurations
- Testing
- Verification and Validation
- User Feedback
- Post-Incident Review
- Documentation
One of the microservices in a distributed application is exhibiting a memory leak, causing it to gradually consume more memory over time. How would you troubleshoot this issue, identify the service with the leak, and implement a solution to prevent further memory consumption?
- Gather information
- Examine monitoring and logs (memory usage, garbage collection, heap utilisation)
- Analyse Memory Dump (capture memory dumps at different time intervals when the leak is suspected)
- Identify the Leaking Code (inefficient memory management practices, unclosed resources, excessive object creation)
- Analyse Dependencies
- Memory Profiling (identify memory hotspots, anything consuming excessive memory)
- Heap Analysis (visualise the memory usage patterns)
- Testing and Isolation
- Fix the Code
- Retest and Validate
- Post-Incident Review
- Documentation