System Design Interviews Flashcards
How to Design A Key-Value Store:
How to Design a System that Supports Millions of Users
Tell me about CAP Theorem
CAP Theorem is one of the most fundamental theorem in distributed computing.
Consistency - Every node on a network will have access to the same data
Availability - System is always available to the users
Partition Tolerance - In case of a failure in the network, the system will always work
Tell me about APIs
REST
GRPC
GraphQL
Tell me about Databases
SQL
NoSQL
Tell me about Scaling
Vertical Scaling
Horizontal Scaling
Replication vs Sharding
Tell me about Web authentication and basic security
The question of security is all about the trade off between total safety (a wall) vs total convenience (a hole in the wall).
Tell me about Load Balancers and different algorithms
Load balancers are an important component of any distributed system that helps distribute traffic across the machines. As distributed systems are designed to scale, the basic requirement is to add or remove machines in case of increased load or in case of failures. Load balancers also help manage this.
Tell me about Caching
Caching is storing the result of an expensive computation so that we don’t need to do it again.
Tell me about Message Queues
An asynchronous system is one where the client sends the request and does not wait for the message to be delivered. For the client, sending the message is like any other function or an API call, but in contrast to the APIs, the clients do not expect a response.
Tell me about Indexing
Indexing is a mechanism by which the underlying data is mapped for faster retrieval.
Tell me about Failover
The leader fails in a cluster of hosts. One rule about it, the leader’s failures should be tracked.
When a leader node fails:
- One of the follower nodes needs to be promoted to the leader.
- Client node(s) must be reconfigured to send the write request to the new leader.
- Other followers need to be reconfigured to consume data from the new leader.
Complications of Failover
Failover can lead to some tricky issues:
- Failover leads to lost updates in the case of asynchronous replication when the leader goes down.
How to detect if the leader has gone down? - Deciding the threshold for when to mark the leader as unavailable can be challenging. Sometimes it can happen that the traffic on the system is high, and that’s why it is taking longer to respond. In such scenarios, if we bring down the leader, the system would be even more stressed.
- The problem of having more than one leader.
Tell me about Replication
In system design, replication refers to making multiple copies of the same data on different machines.
Why replication? Replication is done to achieve one or more of the following goals:
- To avoid a single point of failure and increase availability when machines go down.
- To better serve the global users by organizing copies by distinct geological locations in order to serve users from copies that are close by.
- To increase throughput. With more machines, more requests can be served.
Synchronous replication: The leader ensures the write is done to all the replicas.
Async replication: The leader does not wait for acknowledgement of write events in replicas.
Sync replication ensures guaranteed delivery to all the followers, while async replication is less time-consuming for the client.
Most common types of Replication:
** 1. Single leader:** In system design, a single machine acts as a leader, and all write requests (or updates to the data store) go through that machine. All the other machines are used to cater to the read requests. It’s known as “primary-standby” or “active-passive” replication.
The leader also needs to pass down the information about all the writes to the follower nodes to keep them up to date. In case the leader goes down, one of the follower nodes (mostly with the most up-to-date data) is promoted to be the leader. This is called failover.
** 2. Multi leader: ** In system design, this means that more than one machine can take the write requests. This makes the system more reliable in case a leader goes down. This also means that every machine (including leaders) needs to catch up with the writes that happen over other machines. This is called “active-active” replication.
Conflict resolution for concurrent writes:
Keeping the update with the largest client timestamp.
Sticky routing—writes from same client/index go to the same leader.
Keeping and returning all the updates.
** 3. Leaderless replication: ** In such a system, all machines can cater to write and read requests. In some cases, the client directly writes to all the machines, and requests are read from all the machines based on quorum. Quorum refers to the minimum number of acknowledgements (for writes) and consistent data values (for reads) for the action to be valid. In other cases, the client request reaches the coordinator that broadcasts the request to all the nodes.
Tell me about “Consistent Hashing”?
Consistent hashing is a way to effectively distribute the keys in any distributed storage system—cache, database, or otherwise—to a large number of nodes or servers while allowing us to add or remove nodes without incurring a large performance hit.