System Design Flashcards
7 steps of systems design problems
- Requirements clarification
- Back-of-the-envelope estimation
- System interface definition
- Defining data model
- High-level design
- Detailed design
- Identifying and resolving bottlenecks
Step 1: Requirements Clarification
- Determine exact scope
- Define end goals of system
- Clarify which parts of system to focus on (e.g. back-end vs. front-end)
Step 2: Back-of-the-envelope estimation
- Estimate and quantify scale of system
Step 3: System interface definition
Define what APIs are expected from the system
Step 4: Define data model
- How will data flow between components?
- How will different entities interact with each other?
- How will we partition and manage data? Specific choices:
- Which database system should we use? NoSQL vs SQL?
- What kind of block storage should we use to store files? (e.g. multimedia)
Step 5: High-level design
- Draw a block diagram representing core components to solve the actual problem from end-to-end.
- Possibly describe the system verbally or type out in some kind of list format
Step 6: Detailed design
- Dig deeper in to 2-3 major components, guided by interviewer feedback.
- Consider tradeoffs between different approaches.
Step 7: Identifying and resolving bottlenecks
- Identify any single points of failure and discuss mitigation
- Discuss redundancy and backup plans for data and services
- Discuss performance monitoring
key characteristics of distributed systems
- Scalability
- Reliability
- Availability
- Efficiency
- Serviceability or Manageability
Scalability
capability of a system, process, or network to grow and manage increased demand
Horizontal vs vertical scaling
- Horizontal is easier to scale dynamically by adding machines
- Vertical scaling is upper limited and may involve downtime
- Horizontal scaling examples: Cassandra, MongoDB
- Vertical scaling examples: MySQL
Reliability
- Probably a system will fail in a given period
- Distributed system: keeps delivering services with one or several component failures
- Availability over time
Availability
- Time a system remains operational over a specific period
* Accounts for maintainability and repair
Efficiency
- Latency / response time to requests (correlates to number of messages)
- throughput / bandwidth (correlates to size of messages)
Serviceability / manageability
- Ease to operate and maintain
- Simplicity and speed of repair (as it increases, availability/reliability decrease)
- Considerations: ease of diagnostics, ease of updates
- Automated fault detection
Load balancer
- component to spread traffic across a cluster of servers
- improves responsiveness and availability
- crucial to horizontal scaling
- Ensure health of chosen server
- Select a healthy server
Load balancing placements
- Between client and web server
- between web servers and internal layer (app servers)
- between internal layer and database
Load balancer: least connection method
- directs traffic to server with fewest connections
* good for large number of persistent connections
Load balancer: least response time method
- directs traffic to the server with the lowest response time
Load balancer: least bandwidth method
- selects the server that is currently serving the least amount of traffic
Load balancer: round robin method
- cycles through available servers and sends each request to the next server
- good for servers of equal specs and few persistent requests
Load balancer: weighted round robin method
- round robin but with weights on different servers based upon processing capacity
Load balancer: IP hash method
- client IP address is hashed and servers are each assigned blocks of hashes
Load balancer redundancy
- can be a single point of failure
- can add more LBs to form a cluster of active/passive instances
- clustered LBs monitor each other and passive takes over if active fails
Caching
- locality of reference principle: recently requested data is likely to be requested again
- often implemented near front end to reduce downstream traffic
Application server cache
- cache placed directly on request layer, check if each request is in the cache before fetching from disk
- can have multiple layers, e.g. node memory -> node disk (still faster than network)
Content Delivery Network (CDN)
- cache for sites serving large amounts of static media
Cache invalidation
- maintenance to keep cache consistent with database when data changes
Write-through cache
- Data is written into the cache and DB simultaneously
- Maximizes consistency, minimizes risk of loss
- Higher latency as a result of double write operations
Write-around cache
- Data is written directly to permanent storage, bypassing the cache
- avoids flooding the cache with writes
- increases chance of cache misses, which must then be read from back end with higher latency
Write-back cache
- Data is written to cache alone
- Write to permanent storage is done in intervals or chunks
- Lowest latency, highest throughput for write-intensive apps
- Risk of data loss due to crash since there is no cache backup
Cache eviction policies (6)
- FIFO - evicts oldest block first
- LIFO - evicts newest block first
- LRU - evicts least recently used items first
- MRU - evicts most recently used items first
- LFU - evicts least frequently used items first
- RR - random replacement
Importance of estimation
- important later for scaling, partitioning, load balancing, and caching
Examples of system parameters to estimate
- Examples of things to quantify: number of actions, amount of storage, expected network bandwidth usage
Components to include in high level design
*Clients, load balancing, application servers, databases, file storage
Examples of detailed design topics
- How will we partition data types between multiple databases?
- How should we optimize data storage further (e.g. recency)?
- How much and at what layer should we implement caching?
- Which components need load balancing?
Horizontal scaling
Horizontal add more servers in to your resource pool
Vertical scaling
Adding more power/capacity to an existing server
Cache misses
when request data is not found in the cache
Cache invalidation: 3 main schemas
- write-through
- write-around
- write-back
CDN for smaller systems
- for smaller systems, we can design for future transition to a CDN with a separate subdomain for static media
Data partitioning
technique to break up a big database in to many smaller parts across multiple machines
Benefits of data partitioning
Improves the manageability, performance, availability, and load balancing of an application
Justification for data partiioning
After a certain scale point, it is cheaper and easier to scale horizontally by adding machines than it is to grow vertically
Popular data partitioning methods/schemes
- Horizontal partitioning
- Vertical partitioning
- Directory-based partitioning
Horizontal partitioning (range-based partitioning) (data sharding)
Putting different rows in to different tables based upon range of a certain value
Main risk with horizontal partitioning
If the value chosen for partitioning isn’t evenly distributed, then the scheme will lead to unbalanced servers
Vertical partitioning
- divide data to store tables related to a specific feature in their own server
- different types of data in different servers
Main risk with vertical partitioning
If the app grows, we may need to further/horizontally partition a feature-specific database
Directory Based Partitioning
- loosely coupled approach
- create a lookup service which knows your current partitioning scheme
- separates partitioning from the DB access code
- functionality: query the directory server which holds the mapping between key and DB server
Directory based partitioning benefits
- because it is loosely coupled, we can add servers to the DB pool or change the partitioning scheme without impacting the application
Key or hash-based partitioning
- apply a hash to some attributes of the entity we are storing, yielding a partition number
- problem: effectively fixes the total number of DB servers - workaround is to use consistent hashing
List partitioning
- each partition is assigned a list of values
* check each record against the list and store it in the relevant partition
Round-robin partitioning
- ensures uniform data distribution by rotating data assignment between partitions
Composite partitioning
- combination of any partitioning schemes to devise a new scheme
- e.g. first applying list partitioning then hash based
Consistent hashing
- hash-based approach that handles adding/removing servers
- hash the objects and the servers randomly to a unit circle
- hash(o) mod(360)
- assign each object to the next server in the circle in clockwise order
Hash-based partitioning algorithm
- in a system with n servers, place object o in server with id hash(o) mod n
Consistent hashing benefits
If a server fails: only objects mapped to the failed server need to be reassigned to the next server clockwise
If a server is added: only objects mapped to the new server need to be moved
In either case, most objects maintain their prior assignments
Partitioning issues: joins and denormalization
Joins are often not feasible across partitions
Common workaround is to denormalize the DB by adding redundant copies across multiple databases
Denormalization downside is increased risk of data inconsistently
Partitioning issues: Referential integrity
Most RDBMS do not support foreign keys across DBs on different servers
Apps that require referential integrity across partitioned DBs often have to enforce it in application code
Partitioning issues: rebalancing
Rebalancing is difficult without incurring downtime since we have to move resources across partitions
Directory-based partitioning can make rebalancing easier at the cost of increased system complexity and a new single point of failure on the lookup service
Database indexing
Create an index on particular DB tables to make it faster to search.
Index can be created across one or more columns and includes a pointer to the full row
Indexing use cases
- Finding a small payload in a large dataset
Indexing performance impacts
- Can speed up retrieval
- Increased write time
- Increased storage requirements
Proxy server
- intermediate server between the requests from clients and the servers that handle those requests
Proxy server use cases
- Filter, log, and transform requests
* Can serve requests from its cache to reduce downstream load
Open proxy
proxy server accessible to any internet user
Reverse proxy
Proxy within an internal network, not visible to the client. Can include load balancing, caching, security to protect internal servers from direct access
Forward proxy
Proxy managed by a client to handle requests to an external server
Redundancy
Duplication of critical components or functions in a system to increase the reliability or improve performance
Replication
Sharing information to ensure consistency between redundant resources.
Often used in DBMS where updates are written to a primary server which then passes it to the replica servers
Relational database (SQL)
- Structured with predefined schemas
- Each row contains information about an entity
- Each column contains a particular point of data
Non-relational database (NoSQL)
- Unstructured
- easily distributed
- dynamic schema
Common types of NoSQL DBs
- Key-value stores
- Document DBs
- wide-column databases
- graph databases
Key-value store and examples
- Stores an array of key-value pairs
* Examples: Redis
Document databases
- Data is stored in documents which are grouped in to collections
- Each document can have a unique structure
- Example: MongoDB
Wide-column database
- Instead of tables, uses column families which are containers for rows
- Don’t need to know all the columns up front
- Each row can have different numbers of columns
- Best for analyzing largedatasets
- Examples: Cassandra, HBase
Graph database
- Store data graph relations
- Data saved in the form of nodes and their properties, and lines/connections between nodes
- Examples: Neo4J
SQL vs NoSQL: Storage
SQL: data stored in tables where each row represents an entity
NoSQL: variety of storage models
SQL vs NoSQL: Schema
SQL: each record conforms to a fixed/predefined schema. Higher referential integrity.
NoSQL: schemas are dynamic and changeable
SQL vs NoSQL: querying
SQL: uses SQL to manipulate and precisely retrieve subsets of data
NoSQL: queries are focused on retrieving collections of full records
SQL vs NoSQL: scalability
SQL: vertically scalable, difficult to scale horizontally without duplication or indexing
NoSQL: highly horizontally scalable since referential integrity is less of a priority
SQL vs NoSQL: ACID compliance
ACID: Atomicity, Consistency, Isolation, Durability
SQL: usually ACID compliant therefore more reliable
NoSQL: sacrifices ACID compliance for performance and scalability
CAP Theorem
States that it is impossible for a distributed system to simultaneously provide more than 2/3 of the following:
* consistency, availability, and partition tolerance
Consistency
Every read retrieves the most recent write
Availability
Every request receives a response
Partition tolerance
The system continues to work despite message loss or partial failure.
Data is sufficiently replicated across nodes and networks to handle intermittent/partial outages
standard HTTP web request sequence of events
- Client opens a connection and requests data from server
- The server calculates the response
- The server sends back the response
AJAX (asynchronous Javascript) polling
Client repeatedly polls a server for data
Long-polling (Hanging GET)
Client requests information from the server, server holds the request open and waits until data is available
WebSockets
Persistent connection between a client and server over a single TCP connection
- connection established via a WebSocket handshake
- allows for real-time data transfer
Server side events (SSEs)
- Clients establish a persistent, long term connection to the server
- Server uses this connection to send data to a client
storage capacity model
margin of extra storage above anticipated needs
REST API
REpresentational State Transfer
4 possible API actions (CRUD) and associated methods
- Create (POST)
- Read (GET)
- Update (PUT/PATCH)
- Delete (DELETE)
HTTP Request headers
optional set of key value properties shared from client to server
JSON document
JavaScript Object Notation. Document in a key value format frequently used for data transfer via REST APIs
Ways to authenticate
- basic authentication (username/password)
* secret token (e.g. oAuth)
80-20 rule
20% of requests generate 80% of traffic
Key generation service (KGS)
- Service to generate random keys in advance of requests
* Removes risk of duplicates/key collisions
Purging/DB cleanup considerations
- storage is cheap, low cost of storing things for a long time
- searching for expired data can be costly
- active cleanup logic should consider app traffic
- passive cleanup logic could recognize and delete expired data on request