System Design Fundamentals Flashcards
What are the questions you should think about when it comes to the first step of designing a system?
Who is going to use it?
How are they going to use it?
How many users are there?
What does the system do?
What are the inputs and outputs of the system?
How much data do we expect to handle?
How many requests per second do we expect?
What is the expected read to write ratio?
What are the functional requirements?
What are the non-functional requirements such as scale? What do we care about, consistency, availability, partition, etc. - based on the functional requirements? What about durability and fault tolerance?
Describe using a NoSQL vs. SQL database?
Reasons for SQL:
Structured data Strict schema Relational data Need for complex joins Transactions Clear patterns for scaling More established: developers, community, code, tools, etc Lookups by index are very fast
Reasons for NoSQL:
Semi-structured data Dynamic or flexible schema Non-relational data No need for complex joins Store many TB (or PB) of data Very data intensive workload Very high throughput for IOPS
Sample data well-suited for NoSQL:
Rapid ingest of clickstream and log data Leaderboard or scoring data Temporary data, such as a shopping cart Frequently accessed ('hot') tables Metadata/lookup tables
The benefits of NoSQL:
NoSQL has better locality than SQL
- What this means is that for a large document, for example some sort of social media profile, it can all be stored in area
- Therefore pulling this document is much faster than having to fetch multiple related rows in a SQL table
- However, in the event that we only need a certain part of this information, we will be sending more data over the network resulting in a slower call
- Since sequential disk reads are faster, because the disk does not have to move around its head, having lots of data stored in the same place results in a faster query
NoSQL is easier to shard
- To give a quick summary of sharding (will go into more detail later), it is taking a database that is too big and splitting it up over multiple machines
- This is complicated with a SQL database, because when using a join call, you may potentially have to access many partitions resulting in lots of network calls
- On the other hand, the increased locality of NoSQL means that the entire document will likely be on one machine and can be quickly accessed with a single network call
NoSQL data is not formatted
- Makes it a bit more maintainable when adding new features to an object, or just having related data with slightly different structures
- Not needing to format data in rows allows database formatting to more accurately reflect the data structure in memory that stores the object
The benefits of SQL:
SQL allows joins, whereas NoSQL does not
- Generally speaking SQL allows easily fetching multiple related rows of various tables using the join command, whereas there is no support for this in NoSQL
- You can do it using application code, however it will require many network calls and result in a slow query
- As a result, it seems that SQL allows the data to be a bit more modular, where you can only request certain parts of a potentially large document, at the tradeoff that trying to fetch the entire document may take a long time
SQL has transactions
- Transactions are something that will be covered later, but the gist is that they are an abstraction on top of database writes to provide some guarantees about them and simplify the edge cases a programmer must consider
- However, in a distributed setting it rarely makes sense to use transactions and so this benefit is diminished
What is reliability?
To work correctly even in the face of adversity.
So if a machine goes down does it still work? Connection going down? etc.
What is Scalability?
Reasonable ways of dealing with growth.
What is maintainability?
Maintainability
What do you use to store data?
Database, such as sql and nosql databases
How would you speed up a read?
Use a cache
How would you search data?
Use a search index
What is stream processing?
Send a message to another process asynchronously
What do you call the action of “periodically crunch data”?
Batch processing
Can you describe the differences between Performance vs. Scalability?
A service is scalable if it results in increased performance in a manner that is proportional to the resources added.
Generally increasing performances means serving more units of work, but it can also handle larger units of work.
Performance Problem - system is slow for a single user (maybe due to some compute work unit that the server doesn’t efficiently due to some bad algorithm possibly so both amount/size of unit of work is bad)
Scalability Problem - system is fast for a single user but slow under heavy load
What property does a SQL (RDBMS) fulfill when it comes to the CAP Theorum?
C and A
C - consistency
A - availability
We don’t have network partitions in a SQL database (we do have partitions)
We fulfil availability and consistency with ACID transactions in SQL database
ACID A - Atomic C - Consisteny I - Isolated D - Durable
Describe the type of systems that we can design in regards to the CAP Theorem?
CP - Consistency and Partitioning
AP - Availability and Partition
There is only one choice to make in a case of a network partition, do you sacrifice availability or consistency
Give an example of a database that is consistent and available? (no partition)
Partition is network partition
Postgres fulfills CA, and you can use Replication for scaling disk reads and writes for sharding. You do lose Consistency with that scaling but you get AP at that point
Describe what a Consistent/Partition database has to do which sacrifices Availability?
CP databases enable consistency and partition tolerance, but not availability.
I have to be careful when labeling or given an example of a database that has Consistency and Partitioning, because it depends on the way the database is configured and what the setup looks like.
For example, by default MongoDB on a single node instance is strongly consistent. So you get consistent reads and writes,
MongoDB is strongly consistent when you use a single connection or the correct Write/Read Concern Level (Which will cost you execution speed). As soon as you don’t meet those conditions (especially when you are reading from a secondary-replica) MongoDB becomes Eventually Consistent. So MongoDB becomes eventually consistent if you decide to have child nodes that are used for reads with a single main write node.
That’s where I mean it depends on what database we are talking about, and how its configured/setup.
When a partition occurs, the system has to turn off inconsistent nodes until the partition can be fixed. MongoDB is an example of a CP database. It’s a NoSQL database management system (DBMS) that uses documents for data storage. It’s considered schema-less, which means that it doesn’t require a defined database schema. It’s commonly used in big data and applications running in different locations. The CP system is structured so that there’s only one primary node that receives all of the write requests in a given replica set.
Secondary nodes replicate the data in the primary nodes, so if the primary node fails, a secondary node can stand-in.
Describe and give an example of an AP Database? What happens in an Available-Partition database when a partition is happening? (eventual consistency = AP database)
AP databases enable availability and partition tolerance, but not consistency. In the event of a partition, all nodes are available, but they’re not all updated. For example, if a user tries to access data from a bad node, they won’t receive the most up-to-date version of the data. When the partition is eventually resolved, most AP databases will sync the nodes to ensure consistency across them.
Apache Cassandra is an example of an AP database. It’s a NoSQL database with no primary node, meaning that all of the nodes remain available. Cassandra allows for eventual consistency because users can resync their data right after a partition is resolved.
In most scale systems you’ll be trading off against consistency vs. availability, why?
Networks aren’t reliable, so you’ll need to support partition tolerance. You’ll need to make a software tradeoff between consistency and availability.
When should you use a CP approach to a system?
Waiting for a response from the partitioned node might result in a timeout error.
CP is a good choice if your business needs require atomic reads and writes.
Because consistency means = Every read receives the most recent write or an error
When should you use an AP approach to designing a system? (Availability and Partitioning)
Availability means: Every request receives a response, without guarantee that it contains the most recent version of the information
Responses return the most readily available version of the data available on any node, which might not be the latest. Writes might take some time to propagate when the partition is resolved.
AP is a good choice if the business needs allow for eventual consistency or when the system needs to continue working despite external errors.
When should you consider/classify your system to take consistency tradeoffs vs. availability trade offs?
This is in the requirements/non-functional requirements stage. This will guide us to know what databases we want to use, and what approaches we’ll need to take in order to fulfill the requirements properly.
That means think about consistency and availability trade-offs early on.
CP System: Bank application, Stocks
AP System: Leaderboard, Analytics systems (can have some lag time),
How would you choose between a SQL or NoSQL database in a chat system for example to store chat history?
- Examine the data types and read/write patterns
- Types of data that exists usually = generic data such as user profile, settings, user friend list -> this data is stored in a robust reliable relational db for Consistency and Partitioning. Replication and Sharding are common techniques to satisfy availability and scalability requirements for SQL database, so its available for read replicas, and can scale writes depending on where to route the write.
The second piece of data that is unique for chat system is chat history data. Chat history data has a different read/write pattern which is:
- chat history data is enormous, its is a LOT of messages
- Only recent chats are accessed frequently, not old messages
- Recent chat history is viewed in most cases, users might use features to jump to specific messages, and these need to be supported by the data acess layer.
- The read to write ratio is about 1:1 for 1 on 1 chat apps, usually don’t have a heavy read and light write pattern, it is generally even. For this, we will use a key-value store.
Why a key-value store for chat history?
- K-V stores allow for easier horizontal scalings
- We don’t need joins, can use K-V store
- Key-value stores provide low latency to access the data
- Relational databases do not handle long tail of data well, when indexes grow, random access becomes expensive.
- Key-Value stores have been used by FB and Discord. Facebook uses HBase, and Discord uses Cassandra (AP database system)