Week 10 - Postmortem Examples Flashcards

1
Q

Give two examples of companies who have used postmortems publicly.

A

GitLab and Monzo

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Talk through Monzo’s public postmortem.

A

Monzo microservices store data in a NoSQL central shared database (Cassandra)

Cassandra replicates data across three of 21 servers according to a partition key. Any cluster server can translate a partition key to the same three servers to access data

During peak load, the Cassandra cluster was running closer to its limits than Monzo would like - the plan was to increase the cluster size to spread the load across more servers

The incident was Monzo customers were unable to log in, see balances and transactions, send and receive payments or withdraw cash, or get in touch using chat or by phone

The plan was to start scaling Cassandra by adding six new servers, when automated systems detected a Mastercard issue - the correct teams were informed at this point

The reports from customer operations got in touch about app issue

An incident is declared and on-call engineers investigate

The payments team detect a small error in the Mastercard services about handling an error case - work is started on a fix

Notice 404 responses for internal services next

Public status gets updated to let customers know

Mastercard fix is deployed

It is found Cassandra returns a 404, and after investigation they decommission the new servers one-by-one as a fix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Talk through GitLab’s public postmortem.

A

At the time of the incident, GitLab used a single PostgreSQL primary database and a single PostgreSQL secondary database for failover purposes

The issue happened when an engineer started setting up multiple PostgreSQL servers in the staging environment; prior to starting work, the engineer made an LVM snapshot of the production database in the staging environment

GitLab started experiencing increased database load. One of the problems this caused was that users were unable to post comments on issues and merge requests

Part of the increased load was spam, and part was a deletion process removing a GitLab employee and their associated data after they had been reported for abuse by a troll

Due to the increased load, the secondary database’s replication process began to lag behind. Eventually, replication failed as segments needed by the secondary had already been removed by the primary

Manually resynchronising the replication process involved removing the data directory on the secondary, but the engineer deleted the data directory on the primary by mistake, trashing around 300GB

The database backups were also empty because of the version of PostgreSQL

The decision was take to recover by copying the earlier snapshot and all IDs were increased by 100,000 to avoid reuse problems

This took around 18 hours!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly