Chapter 5 - Replication Flashcards

1
Q

What is replication?

A

Keeping the same data on multiple machines that are connected via a network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the reasons one may want to replicate data?

A
  • Reduce latency by keeping data geograpically close to users
  • Increase availablility as the system can continue to work even if some parts fail
  • Increase read throughput by scaling out the number of machines that can serve read queries
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Whate the three main approaches to replicating changes between nodes?

A
  • Single leader
  • Multi-leader
  • Leaderless replication
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a replica?

A

A node/server that stores a copy of the database.
Every write needs to be processed by ever replica, otherwise the replicas no longer contain the same data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is leader-based replication?

A
  • One replica is designated the leader
  • All client write queries go to the leader
  • The other replicas are followers
  • When the leader writes new data to its local storage it also sends the data change to its followers as part of he change stream
  • Clients can read from any replica
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are synchronous and asynchronous replication?

A

Synchronous: The leader waits for the follower to confirm it has recieved the write before reporting success to the user
Asynchronous: The leader sens the message to the follower replica but does not wait for a response

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the disadvantages of synchronous replicatoin?

A
  • Synchronous replication may slow down the entire system if the follower is recovering from a failure, the system is near capacity or there are networking problems
  • Impractical for all followers to be synchronous, any node outage would cause the system to grind to a hault
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the advantages of synchronous replicatoin?

A
  • Synchronous replication gaurantees the follower has an up-to-date copy of the data consistent with the leader
  • One synchronous follower can be upgraded to leader if leader fails
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the advantages of asynchronous replication?

A
  • The leader can continue to process writes even if all followers are down
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can we add new followers in leader-based replication?

A
  • Take a consistent snapshot of the leaders database without taking a lock on the database (most DBs have this feature)
  • Copy snapshot to follower node
  • Follower requests all data changes that have happened since the snapshot was taken
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can we handle node outages for followers in leader-based replication?

A
  • Once the follower has restarted checked the log for latest processed transaction
  • Follower can request all the data changes that occurred since then
  • Can continue recieving a stream of data changes as before
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can we handle node outages for leader in leader-based replication?

A
  • Controller node appoints new leader (may be the load balancer?)
  • No easy way to decide how to recover unreplicated writes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is statement-based replication?

A
  • Leader logs every write request, a statement, that it executes
  • Leader sends that statement log to its followers
  • For relational databases this means every literal SQL statement (INSERT, DELETE, UPDATE) is forwarded to followers
  • The followers parse and execute the statement as if it has been recieved from a client
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the potential pitfalls of statemened-based replication?

A
  • Statements that call non-deterministic functions, NOW() or RAND() would generate a different value on each replica
  • If statements use autoincrementing columns or depend on existing data they must be executed in the EXACT same order on each replica
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is write-ahead log shipping?

A
  • For both log-structured storage engines and B-trees, an append-only log is stored on disk
  • The leader sends the log to followers and uses it to build a copy of the exact same data structures found on the leader
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the disadvantages of write-ahead log shipping?

A
  • Write ahead log contains details of which bytes were changes in which disk blocks
  • Closely coupled to the storage engine
  • Not possible to run different versions of the database software on the leaders and followers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is logical (rows-based) log replication?

A
  • Different log formats for replication and for the storage engine
  • Logical log is a sequence of records describing the writes to database tables at the granularity of a row
  • Allows different nodes to run different database engines
18
Q

What is trigged-based replication?

A

Lets you register custom application code that is automatically executed when a data change (write transaction) occurs in a database system.
This custom application code or external process can then replicate the data change to another system

19
Q

What is read-after-write or read-your-write consistency?

A
  • A guarantee that if a user writes a change to the database they will always see any updates they submitted themselves
  • Also need to consider cross device read-after-write consistency
  • Can be implemented by forwarding the reads of a user that has recently written to the leader or a sufficiently updated follower
20
Q

What are monotonic reads?

A

A guarantee that if a user makes several reads in sequence, they won’t read older data after having previously read newer data
e.g read from a follower and get 2 comments, then read from another follower with more lag and only get 1st comment
Can be implemented by making sure users always read from the same replica

21
Q

What are consistent prefix reads?

A

A guarantee that if a sequence of writes happens in a certain order, anyone reading those writes will see them appear in the same order
If the database always applied writes in the same order, reads always see a consistent prefix – this is more of a problem for partitioned databases

22
Q

What is a multi-leader configuration?

A

There are multiple leaders in the database topology, each leader can both be written to and acts as a follower to other leaders
The benefits rarely outweight the added complexity

23
Q

What are some disavantages of multi-leader replication?

A

The same data may be concurrently modified in two different datacenters
Those write conflicts must be resolved

24
Q

Describe a simple write conflict in a multi-leader database

A

A wiki page is being simultaneously edited by two users
User 1 changes the title from A to B
User 2 changes the title from A to C
Each users change is successfully applied to their local leader but when the change is asynchronously replicated a conflict is detected

25
Q

How can we avoid conflicts?

A

Method 1: Changes to a certain page, for example, are always sent to the same leader
Method 2: Last Write Wins
Method 3: Give each replica a unique ID, writes from higher-numbered replica take precedence
Method 4: Record the conflict in an explicit data structure and write application code that resolves the conflict on read

26
Q

What is a replication topology?

A

The communication path along which writes are propogated from one node to another

27
Q

What is the all-to-all multi-leader replication topology?

A

Every leader sends its writes to every other leader

28
Q

What is the circular multi-leader replication topology?

A

Each node recieves writes from one node and forwards those writes, plus any writes of its own, to one ther node

29
Q

What is the star multi-leader replication topology?

A

One node is designated as the root node which forwards it’s writes to all other nodes

30
Q

What problems may arise with the star and circular topology?

A

If just one node fails, it can interrupt the flow of replication messages between the other nodes

31
Q

What problems may arise with the all-to-all topology?

A

Client A insert a row into a table on leader 1
Client B updates the row on leader 3
Leader 2 recieves the writes in a different order and is being asked to update a row that does not exist

32
Q

What is leaderless replication?

A

Also known as Dynamo-style. Any node can process client write requests. A coordinator node may send write requests to other nodes on behalf of clients.

33
Q

What are quorum writes and quorum reads?

A

Quorum write: Clients said their write requests to all/multiple replicas. If the number of nodes that respond successfully is greater than a certain threshold the write is considered successful.
Quorum read: Clients said read requests to several nodes in parallel, version numbers are used to determine which value is newer.

34
Q

What is read repair?

A

Clients make reads from several nodes in parallel (quorum reads)
If the client sees that one of the responses is stale they can send a newer read back to that replica
Good for data that is frequently read

35
Q

What is an anti-entropy process?

A

A background process that looks for differences in data and copies missing data from one replica to another

36
Q

What is the quorum condition?

A

If there are n replicas
Every write must be confirmed by w nodes to be considered successful
And we must query at least r nodes for each read
As long as w + r > n we expect to get at least one up-to-date value when reading
Think about it… set of nodes written and set of nodes read must overlap

37
Q

How can stale values be returned even if the quorum condition is met?

A
  • Two writes occur concurrently, especially if last write wins is used
  • Write happens concurrently with a read
  • Write succeeded in some replicas but failed in others, it is not rolled back, some replicas may or may not return the value
  • Data carrying new value fails and is restored using replica carrying old value, breaking the quorum condition
38
Q

What is a sloppy quorum and hinted handoff?

A
  • The client cannot connect to the usual n nodes which the data is stored
  • The data can be written to any w nodes, which may include nodes that are not where the data is usuaully stored
  • Once the client cannot connect again that data is sent back to the usual n nodes (hinted handoff)
    Useful for increaisng write availability
39
Q

What are concurrent operations? (tricky)

A

Two operations that are unaware of each other.
There is no happens-before relationship between them.
e.g User 1 changes title A to B, User 2 changes title A to C

40
Q

Describe a versioning algorithm to capture the happens-before relationship and deal with concurrent writes

A

Page 187-188 haha