Tell me about a time: tech Flashcards
Tell me about a time you worked with observability – maybe especially when you had to work to improve the observability of a system.
S: The core part of our Wifi authentication service, the open source FreeRADIUS server, lacked visibility and only had minimal diagnostic tools. If authentications failed, we had little to no visibility as to why, we could not trace where the request originated or the failure reason.
T: Establish how important this was to our end users and stakeholders, what could be done to improve the observability of this component (esp. given it’s a niche tool in a niche language), and what value improving the observability would deliver.
A: Spoke with the PM to figure out points 1 and 3. Auth failures occurred infrequently but took a week to resolve and detracted from feature development. Researched monitoring and logging options in FreeRADIUS community. Identified a solution developed by eduroam which we could tweak for our configuration. Effort exceeded value so was deprioritised. Used 10% time to develop a solution: locally building our FreeRADIUS server to test the eduroam configuration. Spiked implementation, documented approach, and gave talk on solution.
R: Team had the tools, the solution, and the documentation to move the work forward when it was prioritised.
Tell us about a time you had to debug a severe problem in a production environment
S: We started receiving user support tickets about problems authenticating with GovWifi. We hadn’t introduced a change to the FreeRADIUS server so it was a mystery. FreeRADIUS was a black box component, we had little/no visibility as to why users couldn’t authenticate.
T: Triage the situation, declare an incident, diagnose root cause, and implement a fix.
A: I was on support and it immediately concerned me that multiple public sector buildings were reporting failures. I flagged this with our product manager and tech lead and we declared an incident. In my 10% time I’d created template CloudWatch Logs Insight queries for situations like this and noticed a pattern in the FreeRADIUS logs which indicated the issue was with the version of FreeRADIUS we were using. I then cross-checked the time of the log lines with when I knew our automatic restarter would restart the FreeRADIUS servers using the docker file. I ssh’d onto an existing FreeRADIUS server to check the version of FreeRADIUS running and then reviewed the FreeRADIUS changelog. There had been a breaking change in version 3.0.25 which we were pulling in because we didn’t pin the version of FreeRADIUS in the Docker file
R: We pinned the FreeRADIUS version in the docker file and rolled out the change.
Tell me about a time you were on call, how you approach this in general
S: A robo marketing SMS system discovered our automated SMS help phone number and the two automated systems got into a vicious text cycle which was costing us a lot of money (by government standards).
T: Liaise with the Notify team (in charge of SMS for GovWifi) who notified us of the issue. Take the mantle of incident lead and delegate tasks to the relevant people
A: Declared an incident and started the incident documentation and a group call with the PM, Notify team rep, and developers. Delegated comms to stakeholders to PM who used our comms system to send a service update. Delegated diagnostics to developer pair. Once we identified the root cause we implemented a hot fix.
R: Fixed issue in under two hours. Rolled out permanent fix the next day.
My goal during incidents is to get the service back to a healthy state as soon as possible and communicate with stakeholders. I prefer fixes which will return the service to a healthy state and then can be iterated on. My process for on-call is: triage, initiate incident, discovery/diagnosis, test/validate, implement fix or rollback, monitor. Documenta and communicate throughout.
I’m best at leading an incident, making high level decisions, leading comms, contact/invite the right people, pointing people in the direction, keeping a log of the incident.
Tell me about a time when you’ve had a tricky technical decision to make, how you approached this
Sit: As part of the wifi authentication team GovWifi, I needed to develop availability SLXs for a key stakeholder. We had an existing availability SLA when I joined but something didn’t feel right about it.
Task: Assess our current availability SLXs, ascertain if it was fit for purpose, and if not develop a new one.
Act: Deep dive into CloudWatch metrics, SLX formulas, and how FreeRADIUS wifi authentication server worked. Discovered my hunch was correct: SLX formula was not fit for purpose. Needed more data. Spiked Prometheus and a FreeRADIUS Prometheus exporter to get more accurate data about requests. Discovered implementing a traditional error budget formula would take A LOT of effort. Had a choice: carry on or step back. I stepped back. Returned to the stakeholders to dig deeper into their SLX requirement and realised we could use an existing metric.
Result: Created an SLA which I and the rest of the team felt confident in, saved the team at least 172 hours of work, and ended up rolling back the Prometheus spike.
What your approach has been to automation – when have you decided to automate processes, why, and what’s the impact been? When haven’t you chosen to?
Sit: At Unruly our off-boarding process used to take 4ish days. Lots of manual steps. When resignations were infrequent we could absorb this cost into our prioritisation. The company had a big cultural change and resignations occurred at least 1-2 per week every week. It was pulling time away from priority work.
Task Automate the off-boarding process and reduce the time down to a minimum of 1 day.
Act: At an advantage because I practiced the concept of leave the campsite better than you found it, so for every resignation before the Great Leaving we’d documented the manual steps. I took this documentation and turned into python and/or bash scripts ( Python for third party APIs and Bash scripts for OS changes). It was a tough decision to make but I advocated for improving the process one resignation at a time instead of redirecting engineering effort to just improving it. We’d off-board someone and then spend 1 day improving the process.
Result: I reduced the off-boarding time to half a day.
Result: In general, I automate wherever possible (especially as 10% time or when I join to help learn about a system). However there’s an effort for cost calculation which has to be made. If the effort to automate exceeds the value delivered it must be prioritised along other work. On/Off-boarding on the GovWifi team also took 1-2 days but resignations were so infrequent it didn’t make sense to automate the process relative to other priorities.
Tell me about a time you’ve had to introduce an unpopular or difficult change with a big impact radius
Sit: Cabinet Office Digital Senior Management had the goal to align the infrastructure tooling across COD teams. Admirable goal and necessary. It meant switching between teams would require less upskilling on new tools, potentially reduce costs.
Task: Migrate our self-managed and self-hosted Concourse CI/CD pipelines to AWS CodePipeline and Github Actions AND convince team to follow a decision not made by them but made for them.
Act: Organise sessions between senior managers and team to discuss problem, ask questions, raise concerns, and encourage team sense of contribution to the decision. Demonstrate the various tech stacks across COD and ask SREs to visualise joining a team and the upfront work to upskill.
Then the migration. I led brainstorming and roadmap sessions to figure out the path. I advocated for tackling the high risk and high priority pipeline first so we could discover quickly whether this migration was feasible. I proposed a rotating pair system so every engineer on the team would be exposed to the migration process, get to learn the tooling, and contribute to the migration. I Terraformed the new pipeline, problem solved issues with differences b/t Concourse and AWS (Concourse handles complicated workflows better imo, you need to add Lambdas and Step functions in CodePipeline). We kept our Concourse pipelines alive while migrating and had way to always fall back on them.
Result: We migrated our CI/CD platform but then I gave birth! I wasn’t there for the final decommissioning.
Reflect: It took longer than anticipated. Communication w/ delivery and stakeholders broke down a bit. I wish we’d asked for pair help from another team who’d completed their migration
Tell me about a time you’ve broken something in production
Sit: I can’t remember a specific example which speaks to a few things: 1) I use small frequent commits that turn into small frequent releases 2) I’ve worked places where either we relied on feature toggles or A/B testing OR we had a robust staging environment to catch errors. I’m sure I have broken things but whatever I broke was not large enough to remember. I think I disabled a button once when I meant to reroute it to another path.
My approach to production breaks is to rollback ASAP. Small frequent commits/releases and robust automated pipelines make this a lot easier
How you approach measurement and metrics
General: Work with stakeholders and team to establish and prioritise non-functional requirements. What are we optimising for? Can we measure indicators with existing tools? Let’s think about our product/system, is it complex enough to require monitoring AND observability or just monitoring? What’s the maturity level and/or experience of using measurements/metrics on the team? Any pain points/incident scars which would influence people’s perspective? Any habits people feel attached to? Any skills gaps? Anything people feel vulnerable or shameful about?
At a minimum I want to implement structured logging, some type of traceability, and a way to visualise system health and data. I also like to survey the team before we implement changes so I can compare how people felt before and after. If the system is complex enough it’s worth investing in structured events, distributed tracing, and ensuring events emit telemetry.
I believe in experimenting (take a short before/after survey), letting go of an approach that doesn’t work, making change easy, getting everyone onboard, disagreeing and committing, seek first to understand and then to be understood.
The tools I’ve used: Grafana, Graphite, Splunk, elastic, Log.io, CloudWatch, Prometheus.
Tell me about an experience upgrading or migrating a complex system (e.g., OS upgrades, on-prem to cloud, tooling changes)
Sit: At Unruly we needed to migrate all our systems to CentOS 7 since CentOS 6 was end of life. This would impact all workstations and pairs. It was a big task. The previous team had attempted the work but struggled to communicate deadlines, their approach, and the slow pace. The trust between the team and stakeholders had been impacted.
Task: Mend the relationship between team and stakeholders. Figure out what worked and what hadn’t. Use the lessons to start from scratch. Migrate from CentOS 7.
Act: I met 1-1 with the stakeholders to listen to their concerns and get a sense of their expectations. We also had postmortem about the work stream. From those conversations I created a communication strategy which included a daily updated funny visual, weekly email update, and bi-weekly planning check-ins. As co-lead I advocated for committing to the upfront cost of delving into our Puppet code and the dependencies in order to build an iterative model (I still have the model if you’re curious).
Result: We migrated to CentOS 7 in line with our deadlines and with minimal disruption to teams’ workflows. Stakeholders fedback that the communication strategy was really appreciated and they felt like trust had been repaired.
Tell me about a time you’ve worked with complex legacy code
Sit: At Unruly when I was a software engineer I worked on a complex legacy Java codebase (a monolith). It had over 800 unit tests, took 5 minutes to deploy, and was very, very difficult to introduce changes to.
Task: Get familiar with the code base and work to improve it.
Act: I started by diagramming the project structure about the various parts of the code, taking special note of any connections. I created a “dictionary” in a spreadsheet of all the unfamiliar ad-tech jargon and its meaning. I asked for extra pairing time with members of the team to explain anything I didn’t understand. While pairing I’d take note of areas of the codebase which I could refactor (starting REALLY small) and on my 10% days I’d spend time refactoring. We had an existing “improvement plan” which we never made time for as a team. So I created a process for us to track improvement cards with a visual goal on the board (cookies). I also advocated for splitting improvement cards into smaller, more manageable pieces in order to progress work. Something is better than nothing.
Result: We went from completing one improvement card every 2 weeks to completing 4. I got to be relatively confident in the codebase, I learned how to refactor in Java and implement fancy new Java 8 techniques.
Tell me about a time you’ve improved performance or optimised a system
Sit: At Unruly our metric aggregator Graphite was getting slow and clunky, it couldn’t handle the load..
Task: Develop a solution to the increased load so we could improve the performance.
Act: It was a classic vertical vs horizontal scaling question. I advocated for horizontal scaling since availability was the priority. I spiked vertically scaling the machine but that wouldn’t solve our availability question, had cost implications, didn’t have the performance optimisation I’d hoped for. I re-architected the Graphite server by introducing a network load balancer and Carbon-C Relay to handle increased load and enable horizontal scaling.
Result: Our system was horizontally scaled and could now handle the increased load