Week 10 - Postmortem Examples Flashcards
Give two examples of companies who have used postmortems publicly.
GitLab and Monzo
Talk through Monzo’s public postmortem.
Monzo microservices store data in a NoSQL central shared database (Cassandra)
Cassandra replicates data across three of 21 servers according to a partition key. Any cluster server can translate a partition key to the same three servers to access data
During peak load, the Cassandra cluster was running closer to its limits than Monzo would like - the plan was to increase the cluster size to spread the load across more servers
The incident was Monzo customers were unable to log in, see balances and transactions, send and receive payments or withdraw cash, or get in touch using chat or by phone
The plan was to start scaling Cassandra by adding six new servers, when automated systems detected a Mastercard issue - the correct teams were informed at this point
The reports from customer operations got in touch about app issue
An incident is declared and on-call engineers investigate
The payments team detect a small error in the Mastercard services about handling an error case - work is started on a fix
Notice 404 responses for internal services next
Public status gets updated to let customers know
Mastercard fix is deployed
It is found Cassandra returns a 404, and after investigation they decommission the new servers one-by-one as a fix
Talk through GitLab’s public postmortem.
At the time of the incident, GitLab used a single PostgreSQL primary database and a single PostgreSQL secondary database for failover purposes
The issue happened when an engineer started setting up multiple PostgreSQL servers in the staging environment; prior to starting work, the engineer made an LVM snapshot of the production database in the staging environment
GitLab started experiencing increased database load. One of the problems this caused was that users were unable to post comments on issues and merge requests
Part of the increased load was spam, and part was a deletion process removing a GitLab employee and their associated data after they had been reported for abuse by a troll
Due to the increased load, the secondary database’s replication process began to lag behind. Eventually, replication failed as segments needed by the secondary had already been removed by the primary
Manually resynchronising the replication process involved removing the data directory on the secondary, but the engineer deleted the data directory on the primary by mistake, trashing around 300GB
The database backups were also empty because of the version of PostgreSQL
The decision was take to recover by copying the earlier snapshot and all IDs were increased by 100,000 to avoid reuse problems
This took around 18 hours!