Chapter 5 - Replication Flashcards
What is replication?
Keeping the same data on multiple machines that are connected via a network
What are the reasons one may want to replicate data?
- Reduce latency by keeping data geograpically close to users
- Increase availablility as the system can continue to work even if some parts fail
- Increase read throughput by scaling out the number of machines that can serve read queries
Whate the three main approaches to replicating changes between nodes?
- Single leader
- Multi-leader
- Leaderless replication
What is a replica?
A node/server that stores a copy of the database.
Every write needs to be processed by ever replica, otherwise the replicas no longer contain the same data.
What is leader-based replication?
- One replica is designated the leader
- All client write queries go to the leader
- The other replicas are followers
- When the leader writes new data to its local storage it also sends the data change to its followers as part of he change stream
- Clients can read from any replica
What are synchronous and asynchronous replication?
Synchronous: The leader waits for the follower to confirm it has recieved the write before reporting success to the user
Asynchronous: The leader sens the message to the follower replica but does not wait for a response
What are the disadvantages of synchronous replicatoin?
- Synchronous replication may slow down the entire system if the follower is recovering from a failure, the system is near capacity or there are networking problems
- Impractical for all followers to be synchronous, any node outage would cause the system to grind to a hault
What are the advantages of synchronous replicatoin?
- Synchronous replication gaurantees the follower has an up-to-date copy of the data consistent with the leader
- One synchronous follower can be upgraded to leader if leader fails
What are the advantages of asynchronous replication?
- The leader can continue to process writes even if all followers are down
How can we add new followers in leader-based replication?
- Take a consistent snapshot of the leaders database without taking a lock on the database (most DBs have this feature)
- Copy snapshot to follower node
- Follower requests all data changes that have happened since the snapshot was taken
How can we handle node outages for followers in leader-based replication?
- Once the follower has restarted checked the log for latest processed transaction
- Follower can request all the data changes that occurred since then
- Can continue recieving a stream of data changes as before
How can we handle node outages for leader in leader-based replication?
- Controller node appoints new leader (may be the load balancer?)
- No easy way to decide how to recover unreplicated writes
What is statement-based replication?
- Leader logs every write request, a statement, that it executes
- Leader sends that statement log to its followers
- For relational databases this means every literal SQL statement (INSERT, DELETE, UPDATE) is forwarded to followers
- The followers parse and execute the statement as if it has been recieved from a client
What are the potential pitfalls of statemened-based replication?
- Statements that call non-deterministic functions, NOW() or RAND() would generate a different value on each replica
- If statements use autoincrementing columns or depend on existing data they must be executed in the EXACT same order on each replica
What is write-ahead log shipping?
- For both log-structured storage engines and B-trees, an append-only log is stored on disk
- The leader sends the log to followers and uses it to build a copy of the exact same data structures found on the leader
What are the disadvantages of write-ahead log shipping?
- Write ahead log contains details of which bytes were changes in which disk blocks
- Closely coupled to the storage engine
- Not possible to run different versions of the database software on the leaders and followers