Resilience_Engineer_Interview_Flashcards
How do you decide whether to build a custom solution or use an existing SaaS tool?
Evaluate based on cost, team expertise, time to market, scalability, and maintenance. At Eficens, I chose Sage Intacct over custom Spring Boot build due to faster integration, compliance readiness, and resource optimization.
Describe a scenario where you integrated legacy systems with new cloud-native solutions.
Integrated an on-prem HRIS with Azure AD using Terraform and Python middleware. Deployed changes incrementally with feature flags and achieved seamless SSO without service disruption.
Describe your experience automating data flows using APIs or scripting.
Used AWS Lambda and Python to automate pulling logs from CloudTrail and Security Hub into Elasticsearch for real-time analysis. This eliminated manual log processing entirely.
How have you contributed to system resilience and fault tolerance?
Containerized services using Docker, deployed on AWS Lambda, and implemented retries with exponential backoff. Added Splunk alerts for failures. Result: 99.9% uptime and 60% faster recovery time.
How have you used PostgreSQL, Elasticsearch, or Snowflake in your work?
Used PostgreSQL for microservice data with optimized queries and indexing; Elasticsearch in AWS SOC project for log analysis; no direct experience with Snowflake but strong SQL/ETL background.
How do you ensure backward compatibility in system changes?
Use feature flags, schema versioning, regression tests, sandbox testing, and deploy with blue-green or canary strategy to avoid breaking changes.
Tell me about a time you worked with non-technical stakeholders.
Worked with HR to simplify IAM role assignment. Built a Flask-based web GUI that managed AWS IAM roles via Python scripts. Reduced IT dependency by 70% and provisioning time from 2 days to 1 hour.
Describe a time you had to manage competing priorities.
Balanced backend optimization and compliance deadline at TCS. Prioritized based on business impact, split tasks across sprints, and maintained open communication. Delivered both with minimal delay.
If your MVP solution starts failing in production, how would you handle it?
Diagnose with logs (Splunk), isolate the issue, implement rollback or retry, notify stakeholders, and write a postmortem with long-term fixes.
What’s your approach to identifying and clearing technical roadblocks?
Analyze performance metrics, use tracing (like AWS X-Ray), refactor for efficiency, consult documentation, and validate fixes with stress testing. Example: optimized Lambda memory bottleneck by batching and refactoring code.