Data Intensive Ch5 - Replication Flashcards
Replication
Copy same data on multiple machines connected via network
Why?
- keep data geo-close to user (lowers LATENCY)
- allow system to work even if facing a failure (increases AVAILABILITY)
- scale out number of machines which can serve requests (increases READ THROUGHPUT)
Leaders and Followers
Replication approach.
Leader (aka master, primary) is the only node which accepts ALL writes
Follower (aka slave, secondary) receives stream of data changes from the leader to update its local copy of data. Writes must be applied in the same order as they happened on the leader.
Clients can read from leader or followers.
Sync vs Async replication
Sync - leader must block client and wait with confirmation until ALL followers ack the change
Drawback:
Leader MUST block and wait for ack as if follower failed or is recovering it’s not known whether data was saved successfully (could be just network issues, could be app was under heavy load)
That increases latency to unbounded values
It’s undefined how much time leader will replicate as it’s unknown how long will any failed follower recover
Advantage:
if write was ACK then reads from any replica is guaranteed to be fresh once leader returned response
Async - leader broadcasts change and ACK to client before followers did
Drawback:
if leader fails before writes were replicated they may get lost
Advantage:
leader accepts writes even if followers can’t keep up with applying new changes
Hybrid approach is also possible (some followers are replicated in sync manner, rest in async)
How to add a new follower in single-leader DB.
Take a snapshot in some point-in-time from the leader
Ideally this happens without taking any locks on DB
New follower loads the snapshot
Then it asks for a STREAM of changes that happened AFTER snapshot -> snapshot should be associated with some position in replication log
Once follower consumes the stream it is ready to serve requests.
Follower catchup recovery
Follower must keep log of changes received from leader on local disk (write ahead log of a kind?)
If follower is taken down after restart it can read point-in-time from which stream of changes is required.
Not sure wtf here - re-read
Failover with single-leader replication
-> Difficult problem because:
Some follower must be promoted as new leader
Client must start sending writes to a new leader
Followers must start listening from a new leader
-> Manual - admin does the switch
-> Automatic - adds challenges:
How to detect failure? (can be due to app crash (OOM, bug), can be due to power cut, can be due to network split…)
No foolproof method - timeouts and heartbeats are usually good enough.
How to elect new leader?
- election by majority
- arbitrary election by controller node
Ideally take the node with the most up-to-date state
Consensus problem - ALL nodes must agree the leader is the same node
How to switch clients to a new leader?
Writes should be auto-rerouted to new leader
Old leader must step-down after recovery and become a follower replica
Issues with leader failovers
Pending writes
Previous leader can store writes not yet replicated to a new leader
In the meantime new leader could accept conflicting changes
Pending writes cannot be simply applied anymore
Usually - discard them but this violates durability so case-by-case
Caution is necessary if data is used for coordination of business process with external system
Github incident
New leader didn’t have up-to-date value for autoincrement counter
IDs were reused and some users gained access to Redis cache entries made for records pending on previous leader. Data leak.
Split brain
Previous leader after recovery still assumes to be leader
Both nodes accept writes - data conflicts are bound to happen
Shutting down a leader if more than 1 detected is risky (why? re-read)
Failover timeout selection
If too short - often leadership changes - expensive
If too long - writes get lost
-> Dobór odpowiedniego timeoutu - za krótki = częste zmiany lidera, za długi - zapisy będą przepadać
Types of replication log
Statement-based
Leader broadcasts databse-query-language commandss to followers (SQL if SQL DB etc)
Caveats
- nondeterministic functions like rand() or now()
Theoretically could replace non-deterministic output with concrete values but it’s still prone to errors
- autoincrementing columns
Concurrent transaction commands are sent out to replicas and it’s extremely unlikely they will be executed in the exact same order on all replicas
- side-effects from triggers, stored procedures
Write-ahead log shipping
Leader sends WAL to replicas
WAL is usually low-level byte repesentation of record change (like literally what row looks like on disk)
Might be tightly coupled to DB version making multi-version replicas out of question
Logical-row-based log replication
Independent from byte-representation of data
Inserts could be a tuple of all columns’ values
Deletes could be primary key of deleted row
Update could be primary key + changed columns’ valyes
Useful for Change Data Capture
Trigger based
Useful if replicating only subset of data
Useful if conflict resolution logic needs to be added at low level
Trigger can call our app code on events in specified table (like row added to users)
Replication lag and eventual consistency
Replication lag is how much time are replicas behind
Sync replication does not scale well
The more followers, the more latency and the lower reliability
More nodes in the system - higher probability of failure
Async is the way then but this brings eventual consistency
Bascially if leader receives no new writes then after some UNSPECIFIED time followers will EVENTUALLY catchup and become CONSISTENT with leader
Eventual is vague - could be order of milliseconds, could be order of minutes (failures etc)
What is read-after-write consistency problem?
AKA reading your own writes
Results in poor UX
Symptomes:
user after writing some data reads from the replica that does not have that data
Example: posting a comment on facebook then comment is invisible to the author on some page
Remedies:
Read data modified by current user only from the leader
Appliable to specific kind of data like user profile which is always edited by the owner
Ineffective if there is a lot of data being edited
-> Can use a heuristic like for X minutes after user makes an edit read from the leader only
-> Monitor replication lag and disable reads from replicas which are X behind
Client passes last-write timestamp to the system
Routing selects replica which is up-to-date up-to-passed-timestamp
Can also block read until replica catches up to the timestamp but it’s obviously asking for latency problems
Timestamp meaning logical or clock (logical being better)
-> Cross-device consistency is hard - timestamps from different devices are not synchronized with each other
Timestamps would need to be centrailized
-> Multi-data center replicas
One device could connect to other data center than another
Each has different set of leader/follower
What is monotonic reads problem?
Symptomes:
user experiences going back in time
Observed data is seemingly lost on subsequent read
Example: user views post on FB but after refresh it’s gone as read was made from the replica that does not have it yet
How to counteract:
one user session should read from the same replica (use hashing ID etc)
On failure re-routing is necessary
TLDR
Read-after-read -> can’t read older data than the one already observed
Logical timestamp returned on reads, always read at least fresh data or…
Couple user session with replica
What is consistent prefix reads problem?
Violation of causality due to replication lag
Symptomes:
user sees the answer before the question
Especially Impactful with partitioned DB
Without partitioning we can assume ALL replicas apply the writes in the same order. So now way answer-write is applied before question-write.
When partitioning comes into play then if answer is partitioned differently than question it can become available sooner.
Multi-Leader Replication
AKA Active/active, master-master
Leaders are followers to each other
Usually implemented as cross-datacenter (each DC have an independent set of leader-follower)
Pros:
- writes have lower latency as we have leaders close to user’s geo-location
- fault-tolerance in datacenter-failure scenario - each is independent. When the other recovers it must catchup
- network-failures-tolerance - single leader for multiple datacenters lots of writes can get lots in cross-datacenter network requests (as cross-datacenter network is less reliable than local-inner-datacenter one)
Cons:
- concurrent, conflicting writes on different leaders
Conflict Resolution Mechanism required
- autoincrementing IDs, constraints are tricky
Offline mobile clients is an example of multi-leader replication
In offline mode client-device becomes a local DB leader
Syncing back to remote db is kind of replication
Replication lag can be order of days
Collaborative editting (Google docs) also bears a lot of similarities with multi-leader-replication
Handling conflict resolution in multi-leader replication
TLDR:
Avoid conflicts - couple session with leader (user home datacenter fex)
Last write wins (by write ID, leader ID etc) - eventual consistency
Waiting for all replicas (leaders?) before ACKing
- might as well use SLR
Avoiding conflicts
For given record writes always go to specific datacenter leader (fex home datacenter of user)
Does not work if leader changes
Does not work if user changes geo-location and different DC is close
Read conflict resolution - client solves the problem manually, we store conflicting version
Converging towards consistent state
All writes get timestamp/id
Select which conflicting write wins (Risk of discarding some writes):
- Greater ID wins (LWW)
- replicas with greater ID win
Alternatively:
- Store conflict information in data structure and let user resolve the conflict - no data loss
- Data mergin (like concat two strings in alphabetic order)
Conflict resolution logic in replication tools:
On write
DBs run a script when conflict is detected
Script runs in a background - no user feedback possible
Bucardo - Perl script
On read
Conflicting values are read
On read multiple versions are present to the user in app
User must resolve conflict manually and make a write
CouchDB
Issues with transactions - some trx may save multiple records at the same time but conflict resolution is done in separation
Automatic conflict resolution
Immature feature yet
-> Conflict-free replicated datatypes in Riak 2
Special data structures that can be concurrently edited
Two-way merge
-> Mergeable persistent data structure
Tracks history like Git
Three way merge
-> Operational transformation
Used in Google docs
Useful for concurrent edit of list of ordered elements like text-documemt