Ststems Design: Basics, Load Balancing, Caching Flashcards
What are functional vs. non functional requirements?
Functional requirements are the requirements that define what a system is supposed to do. They describe the various functions that the system must perform.
Non functional requirements describe how the system performs a task, rather than what tasks it performs. They are related to the quality attributes of the system.
What kinds of estimations might you need to make in a systems design interview?
In system design interviews, there are several types of estimations you may need to make:
Load estimation: Predict the expected number of requests per second, data volume, or user traffic for the system.
Storage estimation: Estimate the amount of storage required to handle the data generated by the system.
Bandwidth estimation: Determine the network bandwidth needed to support the expected traffic and data transfer.
Latency estimation: Predict the response time and latency of the system based on its architecture and components.
Resource estimation: Estimate the number of servers, CPUs, or memory required to handle the load and maintain desired performance levels.
Suppose you’re asked to design a social media platform with 100 million daily active users (DAU) and an average of 10 posts per user per day. To estimate the load, you’d calculate the total number of posts generated daily:
100 million DAU * 10 posts/user = 1 billion posts/day
1 billion posts/day / 86,400 seconds/day ≈ 11,574 requests/second
Storage Estimation
Consider a photo-sharing app with 500 million users and an average of 2 photos uploaded per user per day. Each photo has an average size of 2 MB. To estimate the storage required for one day’s worth of photos, you’d calculate:
500 million users * 2 photos/user * 2 MB/photo = 2,000,000,000 MB/day
Bandwidth Estimation
For a video streaming service with 10 million users streaming 1080p videos at 4 Mbps, you can estimate the required bandwidth:
10 million users * 4 Mbps = 40,000,000 Mbps
Latency Estimation
Suppose you’re designing an API that fetches data from multiple sources, and you know that the average latency for each source is 50 ms, 100 ms, and 200 ms, respectively. If the data fetching process is sequential, you can estimate the total latency as follows:
50 ms + 100 ms + 200 ms = 350 ms
If the data fetching process is parallel, the total latency would be the maximum latency among the sources:
max(50 ms, 100 ms, 200 ms) = 200 ms
Resource Estimation
Imagine you’re designing a web application that receives 10,000 requests per second, with each request requiring 10 ms of CPU time. To estimate the number of CPU cores needed, you can calculate the total CPU time per second:
10,000 requests/second * 10 ms/request = 100,000 ms/second
Assuming each CPU core can handle 1,000 ms of processing per second, the number of cores required would be:
100,000 ms/second / 1,000 ms/core = 100 cores
Designing a messaging service estimation example.
Number of users: Estimate the total number of users for the platform. This can be based on market research, competitor analysis, or historical data.
Messages per user per day: Estimate the average number of messages sent by each user per day. This can be based on user behavior patterns or industry benchmarks.
Message size: Estimate the average size of a message, considering text, images, videos, and other media content.
Storage requirements: Calculate the total storage needed to store messages for a specified retention period, taking into account the number of users, messages per user, message size, and data redundancy.
Bandwidth requirements: Estimate the bandwidth needed to handle the message traffic between users, considering the number of users, messages per user, and message size.
Designing a video streaming platform
Number of users: Estimate the total number of users for the platform based on market research, competitor analysis, or historical data.
Concurrent users: Estimate the number of users who will be streaming videos simultaneously during peak hours.
Video size and bitrate: Estimate the average size and bitrate of videos on the platform, considering various resolutions and encoding formats.
Storage requirements: Calculate the total storage needed to store the video content, taking into account the number of videos, their sizes, and data redundancy.
Bandwidth requirements: Estimate the bandwidth needed to handle the video streaming traffic, considering the number of concurrent users, video bitrates, and user locations.
When designing a large system, what things do you need to consider?
https://www.designgurus.io/course-play/grokking-the-system-design-interview/doc/system-design-basics
- What are the different architectural pieces that can be used?
- How do these pieces work with each other?
- How can we best utilize these pieces: what are the right tradeoffs?
What are the key characteristics of distributed systems?
Scalability, Reliability, Availability, Efficiency, and Manageability
What is scalability?
Scalability is the capability of a system, process, or a network to grow and manage increased demand. Any distributed system that can continuously evolve in order to support the growing amount of work is considered to be scalable.
What is reliability?
Reliability refers to the ability of a system to continue operating correctly and effectively in the presence of faults, errors, or failures. In simple terms, a distributed system is considered reliable if it keeps delivering its services even when one or several of its software or hardware components fail.
A related concept is Fault Tolerance, which is the system’s ability to continue operating (possibly at a reduced level) even when one or more of its components fail. In other words, it is the property that allows a system to absorb or recover from faults without total breakdown.
Reliability vs Fault Tolerance
Scope:
Reliability focuses on the end-to-end correctness and consistency of the entire system’s operation over time.
Fault tolerance focuses on the system’s ability to continue operating when individual components fail.
Perspective:
Reliability is primarily a user-centric concept: Can the system consistently meet the user’s expectations over time?
Fault tolerance is more of a system-centric concept: How does the system handle internal failures or component breakdowns?
Measurement:
Reliability is often measured in terms of uptime, error rates, or mean time between failures (MTBF).
Fault tolerance is often measured by how quickly and effectively the system detects, isolates, and recovers from failures (e.g., failover times).
What is efficiency?
Two standard measures of its efficiency are the response time (or latency) that denotes the delay to obtain the first item and the throughput (or bandwidth) which denotes the number of items delivered in a given time unit (e.g., a second)
These corresponding to the following two unit costs:
* Number of messages globally sent by the nodes of the system regardless of the message size.
* Size of messages representing the volume of data exchanges.
What is availability?
By definition, availability is the time a system remains operational to perform its required function in a specific period.
Serviceability or Manageability
Serviceability or manageability is the simplicity and speed with which a system can be repaired or maintained
What layer does AWS’s ALB operate on and what is its use case?
Layer 7 - Application Layer of the OSI model. Designed for HTTP and websockets traffic.
What is AWS’s Elastic Load Balancer?
Elastic Load Balancer (ELB)
This is the umbrella term for AWS’s load balancing service, which includes the Application Load Balancer (ALB), Network Load Balancer (NLB), and Gateway Load Balancer (GLB). Initially, it referred to the Classic Load Balancer (CLB), which is now deprecated for new deployments.
What layer does the Network Load Balancer work at and what are its use cases?
Layer: Operates at Layer 4 (Transport Layer of the OSI model).
Use Case: Designed for TCP/UDP and TLS traffic with ultra-high performance and low latency requirements.