System Design Fundamentals Flashcards

Question

What are the things to think about practically in regards to the CAP Theorem and real world distributed systems

Answer 1

I think part of the problem with practical interpretations of the CAP theorem, especially with Gilbert and Lynch’s formulation, is the fact that most real distributed systems do not require atomic consistency or perfect availability and will never be called upon to perform on a network suffering from arbitrary message loss. Consistency, Availability, and Partition Tolerance are the Platonic ideals of a distributed system–we can partake of them enough to meet business requirements, but the nature of reality is such that there will always be compromises.

Answer 2

Despite your best efforts, your system will experience enough faults that it will have to make a choice between reducing yield (i.e., stop answering requests) and reducing harvest (i.e., giving answers based on incomplete data). This decision should be based on business requirements.

Answer 3

CAP theorem doesn’t just describe any old system, but a very specific model of a system: The CAP system model is a single, read-write register – that’s all. For example, the CAP theorem says nothing about transactions that touch multiple objects: they are simply out of scope of the theorem, unless you can somehow reduce them down to a single register. The only fault considered by the CAP theorem is a network partition (i.e. nodes remain up, but the network between some of them is not working). That kind of fault absolutely does happen, but it’s not the only kind of thing that can go wrong: nodes can crash or be rebooted, you can run out of disk space, you can hit a bug in the software, etc. In building distributed systems, you need to consider a much wider range of trade-offs, and focussing too much on the CAP theorem leads to ignoring other important issues. Also, the CAP theorem says nothing about latency, which people tend to care about more than availability. In fact, CAP-available systems are allowed to be arbitrarily slow to respond, and can still be called “available”. Going out on a limb, I’d guess that your users wouldn’t call your system “available” if it takes 2 minutes to load a page.

Answer 4

An example of a network partition would be a network link getting interrupted between 2 datacenters as an example that have replica sets we want to sync.

Answer 5

CP/AP: a false dichotomy The fact that we haven’t been able to classify even one datastore as unambiguously “AP” or “CP” should be telling us something: those are simply not the right labels to describe systems. I believe that we should stop putting datastores into the “AP” or “CP” buckets, because: Within one piece of software, you may well have various operations with different consistency characteristics. Many systems are neither consistent nor available under the CAP theorem’s definitions. However, I’ve never heard anyone call their system just “P”, presumably because it looks bad. But it’s not bad – it may be a perfectly reasonable design, it just doesn’t fit one of the two CP/AP buckets. Even though most software doesn’t neatly fit one of those two buckets, people try to shoehorn software into one of the two buckets anyway, thereby inevitably changing the meaning of “consistency” or “availability” to whatever definition suits them. Unfortunately, if the meaning of the words is changed, the CAP theorem no longer applies, and thus the CP/AP distinction is rendered completely meaningless. A huge amount of subtlety is lost by putting a system in one of two buckets. There are many considerations of fault-tolerance, latency, simplicity of programming model, operability, etc. that feed into the design of a distributed systems. It is simply not possible to encode this subtlety in one bit of information. For example, even though ZooKeeper has an “AP” read-only mode, this mode still provides a total ordering of historical writes, which is a vastly stronger guarantee than the “AP” in a system like Riak or Cassandra – so it’s ridiculous to throw them into the same bucket. Even Eric Brewer admits that CAP is misleading and oversimplified. In 2000, it was meant to start a discussion about trade-offs in distributed data systems, and it did that very well. It wasn’t intended to be a breakthrough formal result, nor was it meant to be a rigorous classification scheme for data systems. 15 years later, we now have a much greater range of tools with different consistency and fault-tolerance models to choose from. CAP has served its purpose, and now it’s time to move on.

Answer 6

Having clones of servers running the same code and then creating an image of one server and adding a bunch of new machines is called horizontal scaling. You want to have these services be stateless and store data externally in some cache or database. Example would be user data you would have in a database or session information for User Session in a persistent cache like Redis for better performance. "External" means that the data does not live in the application servers, the servers themselves just have the code base, it the logic/rule area that pulls out data from the DB, or other types of data stores.

Answer 7

What is Zookeeper? - It helps sync information across nodes in a cluster in the case you have multiple nodes in your cluster or infrastructure When would you use it? - Generally for keeping data services in sync, what comes to mind is What are the problems does Zookeeper solves? Give an example when you'd want to use Zookeeper? Why does Zookeeper exist? How does Zookeeper work?

Answer 8

Cache invalidation is when an entry in the cache is when we no longer containing the correct result and we have to replace or remove it. One strategy is TTL (time to live) - this has drawbacks and it depends on the TTL we set for an entry and this can be tricky at some times to get correct. - ------------------- (1) Write Through Strategy After we update our DB -> we update our Cache. So an example would be a basic system such as Client -> App Service -> DB (and after the DB state is written) -> we update our cache that reflects the state in our DB. Write through strategy updates after a state change in the DB, so another service cannot also change this DB directly and can only use the API we have to update the DB so the cache also gets invalidated through the write-through strategy. Defining boundaries are really important in Cache Invalidation.

Answer 9

First of all, the main purpose of caching is speed. The basic idea is simple: if you know you are going to compute the same thing, you may just load the result saved from the previous run, and skip the computing this time. There are two keywords here: “the same thing”, and “the saved result”. The latter means you are essentially trading (more) storage space for (less) time. That is the price to pay for caching, and also an important fact to be aware of when you use caching (i.e., caching is definitely not free, and sometimes the price can be fairly high). The tricky thing is “the same thing”. How do you know that you are computing the same thing? That is all “cache invalidation” is about. When things become different, you have to invalidate the cache, and do the (presumably time-consuming) computing again.

Answer 10

A replication lag is the cost of delay for transaction(s) or operation(s) calculated by its time difference of execution between the primary/master against the standby/slave node. Basically the read (secondary) nodes or child nodes of the primary can't keep up or in sync, and take a bit of time to get the latest data. It could either be due to I/O thread or SQL thread. ---- To work out what's causing the lag, you must determine which replication thread is getting backed up. Replication relies on three threads per master/slave connection: one is created on the master and two are created on the slave. The Slave I/O Thread. When you issue START SLAVE on a slave server, the slave creates this thread which connects to the master and requests a copy of the master's binary log. The Binlog Dump Thread. When the slave connects to the master, the master uses this thread to send the slave the contents of its binary log. The Slave SQL Thread. The slaves creates this SQL (or applier) thread to read the contents of the retrieved binary log and apply its contents.

Answer 11

The binary log is a set of log files that contain information about data modifications made to a MySQL server instance. The log is enabled by starting the server with the --log-bin option. Basically it a list of state transition of the database, basically a log of all the transactions that can be applied to a database. There are 2 reasons why a binary log exists: 1. Replication for child nodes 2. Backups -------- The binary log has two important purposes: For replication, the binary log is used on master replication servers as a record of the statements to be sent to slave servers. Many details of binary log format and handling are specific to this purpose. The master server sends the events contained in its binary log to its slaves, which execute those events to make the same data changes that were made on the master. A slave stores events received from the master in its relay log until they can be executed. The relay log has the same format as the binary log. Certain data recovery operations require use of the binary log. After a backup file has been restored, the events in the binary log that were recorded after the backup was made are re-executed. These events bring databases up to date from the point of the backup. ------------- There are two types of binary logging: Statement-based logging: Events contain SQL statements that produce data changes (inserts, updates, deletes) Row-based logging: Events describe changes to individual rows Reference: https://dev.mysql.com/doc/internals/en/binary-log-overview.html

Answer 12

A content delivery network is a bunch of caches basically that are around the world, they are proxy servers that serve content closer to the user typically for faster response times for static assets in most cases such as html/css/js and videos too. It is to help increase performance in 2 ways 1. Users are closer to the datacenter having the static content (we may still need some time to hit a server for dynamic content) 2. Our servers don't need to fulfil the requests the a CDN can. Pros: - Speed up in responses and can take load off servers for requests as we are saving items Cons 1. There are costs that are high financially if traffic is high 2. Content might be stale if it is updated before the TTL expires it 3. CDNs require changing URLs for static content to point to the CDN --------- The 2 types of CDNs are push and pull CDNs Push - receive new content whenever changes occur in our server. So you can push content to a push cdn right away on load. Pull - grab (poll) new content from our server when the first user requests that content. You leave the content on your server and rewrite URLs to point to the CDN. This results in slower responses until the content is cached in the CDN. Depending on the situation there are trade-offs to both, Push seems better as we have the latest content a user can get from the server vs. Pull where the initial requests may be slow. -------- The decision on which CDN type to go with revolves in large part around traffic and downloads. Travel blogs that are hosting videos and podcasts (aka. large downloads) will find a push CDN cheaper and more efficient in the long run since the CDN won’t re-download content until you actively push it to the CDN. A pull CDN can help high-traffic-small-download sites by keeping the most popular content on CDN servers. Subsequent updates (or “pulls”) for content aren’t frequent enough to drive up costs past that of a push CDN.

Answer 13

An HTTP request is **made by a client, to a named host, which is located on a server**. The aim of the request is to access a resource on the server. To make the request, the client uses components of a URL (Uniform Resource Locator), which includes the information needed to access the resource. HTTP itself is a protocol, or a set of rules for accessing resources on the internet in a client-server model. Resources can be HTML files, JSON, media, etc. We make HTTP requests to APIs using the HTTP protocol, that allows developers to access these resources Reference: [https://www.freecodecamp.org/news/http-request-methods-explained/](https://www.freecodecamp.org/news/http-request-methods-explained/)

Answer 14

- A GET request, (HTTP method here is GET) is a method we use to read/retrieve a resource. A succesful GET request returns a response for the information you requested. - A POST request on the other hand is used when we want to create a new resource, a post request requires a body in which you defined the data that you want to create. Usually we will get a response back for a success with 200 as a response code. - A PUT request is used when we want to modify/update an existing resource, it is also **idempotent** meaning it will always produce the same result whereas if we call a post request constantly we will be creating the same resource multiple times. PUT is usually also associated with creating/updating an item by usually some explicit id, where the resource is known by the client, such as `PUT /users/1` whereas `POST /users` post request would typically be constructed to hit the server URL and handle the logic from there and often the case the client doesn’t know the exact URL of the resource. - When it comes to POST vs. PUT, you often see both opinions of using one over the other, from my current experience, I think it depends on the idempotence of the action, meaning that depending on the problem, which method is safer for us to use? Does POST create side effects because we have cases where we may need to make a request multiple times and the requirements are that we should simply replace an item if needed? POST CAN be idempotent, its just not guaranteed. - A DELETE request **used to delete a resource from the server**. Unlike GET and HEAD requests, the DELETE requests may change the server state. Sending a message body on a DELETE request might cause some servers to reject the request. But you still can send data to the server using URL parameters. You have to know the resource as well to specify what exactly you want to delete as well.

Answer 15

TCP is a connection-oriented protocol, whereas UDP is a connectionless protocol. A key difference between TCP and UDP is speed, **as TCP is comparatively slower than UDP**. Overall, UDP is a much faster, simpler, and efficient protocol, however, retransmission of lost data packets is only possible with TCP. - When to use TCP or UPD - TCP: You want to use TCP when you want to have a connection, and be guaranteed that your packets will actually arrive. That includes web pages, and web API calls that you want to use TCP for. - UDP: You want to use UDP when you care about having updated data such as video livestreams or HFT trading firms. UDP is datagram based, not connection based. - Advantages and Disadvantages of TCP - Advantages of TCP - Guarantees the resending, ignoring of duplications, rearranging of packets, and rate of packets sent. - Can do read() and write() on the TCP file descriptor - Disadvantages of TCP - Resource and more overhead - Advantages and Disadvantages of UDP - Advantages of UDP - Lightweight - Gets the most updates information, and fast. No 3 way handshake. - UDP is stateless, no setup to be done. - Disadvantages of UDP - No guarantee - Duplicate, missing, and out of order packets

Answer 16

Oftentimes in an interview we will need to estimate the performance of a system, however there are many ways to measure this. We can look at average performance, but this does not describe potential variation in calls to our service. Three important metrics: 1. Throughput - The number of records processed per second (Good for batch jobs) 2. Response time - The time between the client sending a request and receiving a response. More important when talking about online systems 3. Latency - The duration that a request is waiting to be handled Another important concept is tail latency, which describes the latency at a certain percentile of requests. For example, the latency metric of 1 second for p95 says that 95% of requests have a lower time than 1 second, the other 5% do not.

Answer 17

Consists of tables holding many rows of structured data Rows of one table can have relation to rows of another table if both rows share a common key Has a built in query optimizer that uses the quickest estimated implementation for a SQL statement (declarative language\

Answer 18

Imagine a basic database where you want to fetch a record when making a read. If all of the rows are just stored on a hard drive, every single time a read call is made, you would need to search for the row on the disk, resulting in a very slow O(n) time complexity. Instead, this process can be sped up by creating an index, which allows a database to quickly search for rows based on certain values of the tuple that defines a record. While indexes sound great, they also have tradeoffs: Pros: Having an index speeds up reads, if it is frequently used (the application often queries for values based on the column the index corresponds to) Cons: Having an index slows down writes, because on every write additional work must be done behind the scene to maintain the proper data formatting for the index

Answer 19

Note that all writes to databases (assuming no indexes) are done by just appending to a log, as this is the quickest way to write to disk (sequential writes). It also makes concurrency much easier to deal with as there are no conflicts on crash of one value being partially overwritten. -------- Every SQL Server database has a transaction log that records all transactions and the database modifications made by each transaction. The transaction log is a critical component of the database. If there is a system failure, you will need that log to bring your database back to a consistent state. For information about the transaction log architecture and internals, see the SQL Server Transaction Log Architecture and Management Guide. ------- Reference: https://docs.microsoft.com/en-us/sql/relational-databases/logs/the-transaction-log-sql-server?view=sql-server-ver15

Answer 20

Replication is the process of storing multiple copies of the same data on multiple different computers. It serves three main purposes. Firstly, the redundancy of the data means that a service can still serve reads and writes if one of the database nodes crashes. Secondly, replication can actually speed up the process of reading or writing if the operation is performed by a database node that is geographically closer to the client. Finally, replicating data to many databases allows the reduction of load on each database. However, there are many different ways of implementing replication, each with their own benefits and drawbacks, and we will read about them below. ----- Types ``` Single leader replication Replication log implementation Multi leader replication Leaderless replication Sloppy Quorums ```

Answer 21

A JOIN clause is used to combine rows from two or more tables, based on a related column between them. EXAMPLE

Answer 22

SELECT user_id, game_id FROM scores WHERE (user_id= '0877') AND (game_id='9322') ---------- 3 Tables - Users, Scores, and Games SCORE - id (bigint auto_increment) - user_id - game_id - creation_date - score (bigint) USER - id (bigint auto_increment) - username GAME - id (bigint auto_increment) - title: varchar(255) - description: text GET all the score history for a given user_id and given game_id GET /leaderboard/:game_id/:user_id

Answer 23

A replication strategy used for scaling databases. -------------- One of the nodes is designated to be the leader -> all writes are sent to this leader and they are written to the leader's local storage. All of the other replicas are known as followers - Data is sent to the followers from the leader via a replication log - Each follower takes the log and updates the local data in the same order that the log specifies Clients can perform reads from either the leader or the follower Can be performed either synchronously or asynchronously - Synchronous replication is when the client only receives a message that a given write was successful once the changes have been propagated to all of the replicas (strong consistency) - Asynchronous replication is when the client receives a message saying that their write was successful the moment it reaches the leader database, all changes to the replicas are propagated in the background (eventual consistency) - While synchronous replication ensures that all followers have up to date data, it is impractical because a crash on one of the followers or just a follower operating slowly slows breaks the whole system - Typically synchronous replication means that only one follower is synchronous while the rest are asynchronous, if the synchronous follower fails another one of the followers is made synchronous - In a fully asynchronous system, writes to the leader that have yet to be propagated are lost on a crash Setting up new followers is a relatively easy process, and does not affect write throughput of the system - Take a consistent snapshot of the leader database, and copy this to the follower node - Then connect to the leader, and use the replication log to catch up from the position of the snapshot in the log - Once caught up, start acting as a normal follower It is very easy to recover from a follower crashing - Just check the log of changes that it needs to make to see what point the follower was upto when it crashed, and from then on connect to the leader and connect all of the changes since then - After catching up, continue to act as a normal follower If the leader fails in this configuration, the system must perform a failover - First the system must determine that the leader has actually failed, this is impossible to do with complete certainty as there are a variety of things that can go wrong (crashes, power outage, network issues), so most systems have databases frequently communicate with one another and use a timeout to determine if a node is dead - Use some sort of consensus mechanism (will talk about this in more detail later) to determine a follower node that will become the new leader, typically good choice is the most up to date follower - Configure clients to send their write requests to the new leader, make sure if the old leader comes back it now realizes it is a follower - Failover can be a dangerous situation because you may need to discard some writes from the old leader, can lead to inconsistencies if other systems (not database) had already propagated those changes internally (such as a cache) In some scenarios two nodes may end up thinking they are the leader, could lead to corrupted data (split brain, can be dealt with) - If timeout for determining failover is too small, may perform unnecessary failovers and introduce extra load on a system

Answer 24

A log/list of transactions applied to a database. Easiest is to just copy over the SQL statements. Another is to use a write ahead log the same way that databases do for indexing. The most simple way is to just copy over the SQL statements used by the leader - However, this is a problem because some SQL commands are nondeterministic - While these values could be replaced with deterministic values by the original database, other solutions that are better have been made Another option is to use a write ahead log in the same way that databases do for indexing - Append only sequence of bytes containing all writes to the database - This log already exists on disk, so just send it over network to followers - Disadvantage is write ahead log has which bytes were changed, so a change in the storage engine over a replica may render everything moot if things are stored in different locations, makes rolling upgrades impossible and requires downtime A logical log describes all the changes made to a given row (usually by primary key) - Decoupling the storage index and the log allows for rolling upgrades and backwards compatibility

Answer 25

Here are some problems with replication lag as the readers will eventually be consistent: 1. Reading your own writes - After writing data and refreshing, you may still see the old data since the changes you have made have not yet been propagated on the replica you are reading from. - Requires read-after-write consistency, which says that after uploading a page you will see the writes that you have just made - Can either always query the leader for areas of the application that are editable by the user, or keep track of the last write on the client, and for some amount of period of time afterwards only read from the leader (or a replica that is up to date as of that timestamp) 2. Monotonic reads (problem) - Reads occurring on several different replicas actually can make it seem as if you are moving back in time - Guarantee monotonic reads, one way of doing so is to make sure that each user always reads from the same replica, can be done based off of a hash of the user ID (this can break down if said replica fails) 3. Consistent prefix reads - When two things in the database have a causal relationship, but the one that precedes the other has a greater replication lag so to another user it seems like the latter write comes before the preceding one (happens when they are on different partitions, otherwise log would maintain order) - Could perhaps make sure causally related writes are on the same partition, but not always possible, so may have to explicitly keep track of causal dependencies

Answer 26

High-availability writes in a distributed database with leaderless replication (both Dynamo and Cassandra employ leaderless replication) requires a heuristic for conflict resolution between concurrent writes. This is essential because every replica of data is considered equal and concurrent writes on the same record at two different replicas are considered perfectly valid. Example: Dynamo, Cassandra ----- Pros and Cons Any replica can accept writes from any of the clients No such thing as failover, simply set a threshold of the number of nodes that need to accept the write for the write to be successful, same with reads - If an unavailable node comes back online, a client may read from many nodes simultaneously, realize the previously offline node has an outdated value and update it accordingly (use version numbers to check which values are out of date), this process is known as read repair - Another way of ensuring up to date data is anti entropy, which is a background process that looks for data differences in replicas and copies the correct data over, however the writes are not copied in any particular order If we can only write to a fraction of nodes at a time and read from a fraction, we can use a quorum in order to ensure that we always read from at least node with a most up to date copy of the data - This occurs when the number of nodes successfully written to plus the number of nodes read from are greater than the number of total replicas - Typically reads and writes are sent to all replicas in parallel There are still cases where quorum reads and writes are not perfect - Even if writes do not succeed on the specified number of nodes they will not be rolled back on the nodes where they have been written - In the event that sloppy quorums are used, the writes may end up on different nodes than reads, such that there is no overlap between them - If a node with a new value fails and its data is restored using a node with an old value, the new value will be lost Works well with multi-datacenter operation - Send writes to all nodes, but have the acknowledgements from the client’s local datacenter be sufficient to fulfill a quorum write in order to reduce the high cross datacenter latency of writes

Answer 27

Concurrent writes = multiple writes on a few different nodes which are not communication Leaderless and Multileader have this issue (more than 1 write node) A problem that occurs in both multileader and leaderless implementations of replications is being able to detect many concurrent writes. Concurrent writes occur when two writes to the database from different clients do not know about each other. While it is most important that the database replicas all converge to a consistent state, there are certain ways of dealing with concurrency that improve durability by not arbitrarily picking one write to keep and throwing out the others. --------- How to detect them? [INCOMPLETE]

Answer 28

When dealing with large systems, a common issue that may occur is that a single database table actually becomes too big to store on a single machine. As a result, the table must be partitioned, or split, onto multiple different nodes. How exactly this splitting is done is an implementation detail, but being able to partition a database greatly increases the scalability of a system by allowing a given database table to get arbitrarily big, and perhaps even store more relevant data in nodes closer to the users accessing it. This being said, partitioning, also known as sharding, comes with many complications. ---- Dealing with partitioning data 1. Want to split up keys so that each partition has relatively even load on it (both in data and queries), otherwise result is hot spot partitions 2. Can partition keys by range chunks, not necessarily even ranges because some ranges will have more data and activity than others - These ranges can be chosen manually or automatically by the database - Keep keys in sorted order within the partition - In certain scenarios, such as partitioning by timestamp ranges, this can easily lead to hotspots if most of the queries want recent data 3. Can partition by hash of key and split by a range of hashes, good hash functions will uniformly distribute the keys - Loses the ability to do fast range queries, have to check all partitions - Helps reduce hotspots, but if all of the activity is on one key, then hot spots will still occur, can perhaps be mitigated for certain keys by adding a random number to the key every time and thus partitioning all of the operations to it, but makes reads slow because need to check all of the partitions for the key data 4. Certain databases allow partitioning by a hash of one key (for example a user id), but then allow you to do efficient range queries on other columns of the data (such as a timestamp

Answer 29

Transactions are an abstraction used by databases to reduce all writes to either a successful one that can be committed, or an erroneous one that can be aborted. While transactions are somewhat hard to implement in distributed systems (we will discuss later), in a single database they can be rather useful. They hope to provide the safety guarantees outlined by ACID. -- The meaning of ACID: Atomicity - If a client makes several writes, but a fault occurs after only some of the writes are completed, the existing completed writes will be rolled back Consistency - The application can rely on the properties of the database to ensure that invariants about the data will hold (in the face of faults) Isolation - Concurrently executing transactions are isolated from one another (serializability), each transaction can pretend it is the only one running on the database. Most databases do not implement this due to performance penalties, instead use weak isolation levels Durability - Once a transaction is completed, the data will never be forgotten, even in the face of faults In single object writes, almost all database engines provide guarantees about atomicity and isolation so that the data for an individual key does not become moot or somehow mixed with the previous value - atomicity can be implemented using a log for crash recovery and isolation can be done using a lock on each object.

Answer 30

Although we have now spoken about some problems that can be reduced to consensus, it now seems best to actually discuss some ways that consensus can be achieved. Firstly, we can talk about two phase commit, which is somewhat inefficient, but solves the problem of atomic commit (getting all replicas to agree on whether a transaction should be committed or aborted). Two phase commit: Algorithm used to solve the atomic commit problem Coordinator node (the application) sends writes to each node Coordinator then sends each node a prepare requests, in which each node responds saying whether it will be able to commit If all the nodes can commit, the coordinator tells them to do so, otherwise it tells them all to abort Coordinator has internal log with its decisions for each transaction in the event that it crashes If the request to commit or abort does not reach all the participants, the coordinator must keep retrying on all nodes until they get the message, cannot accept a timeout Two points of no return Participants (database replicas) that say yes in the prepare stage must eventually commit the write and are not allowed to eventually abort it Once the coordinator decides to commit or abort it must get this through to all of the participant nodes The coordinator is a single point of failure and if it crashes none of the nodes can abort or commit after the have done their preparations (should be replicated) To avoid this happening we would need a perfect failure detector to perform some sort of failover which is impossible due to unbounded network delay When this happens the replicas often have a lock grabbed on many rows, which may prevent a significant amount of transactions until the coordinator node is back Database internal distributed transactions (transactions using only the same database technology) can actually be pretty quick and optimized, however when using multiple different types of data systems (like databases, message brokers, email services), you need a transaction API (such as XA) which is often quite slow. Unlike two phase commit, good consensus algorithms reach agreement by using a majority (quorum) of nodes, in order to improve availability. After new leaders are elected in a subsequent epoch (monotonically increasing in order to prevent split brain), consensus algorithms define a recovery process which nodes can use to get into a consistent state. Coordination services such as ZooKeeper are used internally in many other popular libraries, and are a replicated in memory key value store that allows total order broadcast to your database replicas.

System Design Fundamentals Flashcards

(54 cards)