Chapter 5 - Replication Flashcards

Question 1

Q

What is replication?

Answer

A

Keeping the same data on multiple machines that are connected via a network

Question 2

Q

What are the reasons one may want to replicate data?

Answer

A

Reduce latency by keeping data geograpically close to users
Increase availablility as the system can continue to work even if some parts fail
Increase read throughput by scaling out the number of machines that can serve read queries

Question 3

Q

Whate the three main approaches to replicating changes between nodes?

Answer

A

Single leader
Multi-leader
Leaderless replication

Question 4

Q

What is a replica?

Answer

A

A node/server that stores a copy of the database.
Every write needs to be processed by ever replica, otherwise the replicas no longer contain the same data.

Question 5

Q

What is leader-based replication?

Answer

A

One replica is designated the leader
All client write queries go to the leader
The other replicas are followers
When the leader writes new data to its local storage it also sends the data change to its followers as part of he change stream
Clients can read from any replica

Question 6

Q

What are synchronous and asynchronous replication?

Answer

A

Synchronous: The leader waits for the follower to confirm it has recieved the write before reporting success to the user
Asynchronous: The leader sens the message to the follower replica but does not wait for a response

Question 7

Q

What are the disadvantages of synchronous replicatoin?

Answer

A

Synchronous replication may slow down the entire system if the follower is recovering from a failure, the system is near capacity or there are networking problems
Impractical for all followers to be synchronous, any node outage would cause the system to grind to a hault

Question 8

Q

What are the advantages of synchronous replicatoin?

Answer

A

Synchronous replication gaurantees the follower has an up-to-date copy of the data consistent with the leader
One synchronous follower can be upgraded to leader if leader fails

Question 9

Q

What are the advantages of asynchronous replication?

Answer

A

The leader can continue to process writes even if all followers are down

Question 10

Q

How can we add new followers in leader-based replication?

Answer

A

Take a consistent snapshot of the leaders database without taking a lock on the database (most DBs have this feature)
Copy snapshot to follower node
Follower requests all data changes that have happened since the snapshot was taken

Question 11

Q

How can we handle node outages for followers in leader-based replication?

Answer

A

Once the follower has restarted checked the log for latest processed transaction
Follower can request all the data changes that occurred since then
Can continue recieving a stream of data changes as before

Question 12

Q

How can we handle node outages for leader in leader-based replication?

Answer

A

Controller node appoints new leader (may be the load balancer?)
No easy way to decide how to recover unreplicated writes

Question 13

Q

What is statement-based replication?

Answer

A

Leader logs every write request, a statement, that it executes
Leader sends that statement log to its followers
For relational databases this means every literal SQL statement (INSERT, DELETE, UPDATE) is forwarded to followers
The followers parse and execute the statement as if it has been recieved from a client

Question 14

Q

What are the potential pitfalls of statemened-based replication?

Answer

A

Statements that call non-deterministic functions, NOW() or RAND() would generate a different value on each replica
If statements use autoincrementing columns or depend on existing data they must be executed in the EXACT same order on each replica

Question 15

Q

What is write-ahead log shipping?

Answer

A

For both log-structured storage engines and B-trees, an append-only log is stored on disk
The leader sends the log to followers and uses it to build a copy of the exact same data structures found on the leader

Question 16

Q

What are the disadvantages of write-ahead log shipping?

Answer

A

Write ahead log contains details of which bytes were changes in which disk blocks
Closely coupled to the storage engine
Not possible to run different versions of the database software on the leaders and followers

Question 17

Q

What is logical (rows-based) log replication?

Answer

A

Different log formats for replication and for the storage engine
Logical log is a sequence of records describing the writes to database tables at the granularity of a row
Allows different nodes to run different database engines

Question 18

Q

What is trigged-based replication?

Answer

A

Lets you register custom application code that is automatically executed when a data change (write transaction) occurs in a database system.
This custom application code or external process can then replicate the data change to another system

Question 19

Q

What is read-after-write or read-your-write consistency?

Answer

A

A guarantee that if a user writes a change to the database they will always see any updates they submitted themselves
Also need to consider cross device read-after-write consistency
Can be implemented by forwarding the reads of a user that has recently written to the leader or a sufficiently updated follower

Question 20

Q

What are monotonic reads?

Answer

A

A guarantee that if a user makes several reads in sequence, they won’t read older data after having previously read newer data
e.g read from a follower and get 2 comments, then read from another follower with more lag and only get 1st comment
Can be implemented by making sure users always read from the same replica

Question 21

Q

What are consistent prefix reads?

Answer

A

A guarantee that if a sequence of writes happens in a certain order, anyone reading those writes will see them appear in the same order
If the database always applied writes in the same order, reads always see a consistent prefix – this is more of a problem for partitioned databases

Question 22

Q

What is a multi-leader configuration?

Answer

A

There are multiple leaders in the database topology, each leader can both be written to and acts as a follower to other leaders
The benefits rarely outweight the added complexity

Question 23

Q

What are some disavantages of multi-leader replication?

Answer

A

The same data may be concurrently modified in two different datacenters
Those write conflicts must be resolved

Question 24

Q

Describe a simple write conflict in a multi-leader database

Answer

A

A wiki page is being simultaneously edited by two users
User 1 changes the title from A to B
User 2 changes the title from A to C
Each users change is successfully applied to their local leader but when the change is asynchronously replicated a conflict is detected

Question 25

Q

How can we avoid conflicts?

Answer

A

Method 1: Changes to a certain page, for example, are always sent to the same leader
Method 2: Last Write Wins
Method 3: Give each replica a unique ID, writes from higher-numbered replica take precedence
Method 4: Record the conflict in an explicit data structure and write application code that resolves the conflict on read

Question 26

Q

What is a replication topology?

Answer

A

The communication path along which writes are propogated from one node to another

Question 27

Q

What is the all-to-all multi-leader replication topology?

Answer

A

Every leader sends its writes to every other leader

Question 28

Q

What is the circular multi-leader replication topology?

Answer

A

Each node recieves writes from one node and forwards those writes, plus any writes of its own, to one ther node

Question 29

Q

What is the star multi-leader replication topology?

Answer

A

One node is designated as the root node which forwards it’s writes to all other nodes

Question 30

Q

What problems may arise with the star and circular topology?

Answer

A

If just one node fails, it can interrupt the flow of replication messages between the other nodes

Question 31

Q

What problems may arise with the all-to-all topology?

Answer

A

Client A insert a row into a table on leader 1
Client B updates the row on leader 3
Leader 2 recieves the writes in a different order and is being asked to update a row that does not exist

Question 32

Q

What is leaderless replication?

Answer

A

Also known as Dynamo-style. Any node can process client write requests. A coordinator node may send write requests to other nodes on behalf of clients.

Question 33

Q

What are quorum writes and quorum reads?

Answer

A

Quorum write: Clients said their write requests to all/multiple replicas. If the number of nodes that respond successfully is greater than a certain threshold the write is considered successful.
Quorum read: Clients said read requests to several nodes in parallel, version numbers are used to determine which value is newer.

Question 34

Q

What is read repair?

Answer

A

Clients make reads from several nodes in parallel (quorum reads)
If the client sees that one of the responses is stale they can send a newer read back to that replica
Good for data that is frequently read

Question 35

Q

What is an anti-entropy process?

Answer

A

A background process that looks for differences in data and copies missing data from one replica to another

Question 36

Q

What is the quorum condition?

Answer

A

If there are n replicas
Every write must be confirmed by w nodes to be considered successful
And we must query at least r nodes for each read
As long as w + r > n we expect to get at least one up-to-date value when reading
Think about it… set of nodes written and set of nodes read must overlap

Question 37

Q

How can stale values be returned even if the quorum condition is met?

Answer

A

Two writes occur concurrently, especially if last write wins is used
Write happens concurrently with a read
Write succeeded in some replicas but failed in others, it is not rolled back, some replicas may or may not return the value
Data carrying new value fails and is restored using replica carrying old value, breaking the quorum condition

Question 38

Q

What is a sloppy quorum and hinted handoff?

Answer

A

The client cannot connect to the usual n nodes which the data is stored
The data can be written to any w nodes, which may include nodes that are not where the data is usuaully stored
Once the client cannot connect again that data is sent back to the usual n nodes (hinted handoff)
Useful for increaisng write availability

Question 39

Q

What are concurrent operations? (tricky)

Answer

A

Two operations that are unaware of each other.
There is no happens-before relationship between them.
e.g User 1 changes title A to B, User 2 changes title A to C

Question 40

Q

Describe a versioning algorithm to capture the happens-before relationship and deal with concurrent writes

Answer

A

Page 187-188 haha

Brainscape's Knowledge GenomeTM

Chapter 5 - Replication Flashcards

Brainscape's Knowledge Genome^TM