SD Flashcards

1
Q

Steps to System Design

A
  1. Clarify Requirements
  2. Capacity Estimation
  3. High Level Design
  4. Database Design
  5. API Design
  6. Dive into Key Components
  7. Address Key Issues: Scalability, Reliability

or

  1. Requirements clarifications: Always ask questions to find the exact scope of the problem you are solving.
  2. Back-of-the-envelope estimation: It’s always a good idea to estimate the scale of the system you are going to design. This will also help later, when you will be focusing on scaling, partitioning, load balancing, caching, etc.
  3. System interface definition: Define what APIs are expected from the system. This will not only establish the exact contract expected from the system but also ensure that you have not gotten any requirements wrong.
  4. Define data model: Defining the system data model early on will clarify how data will flow among different components of the system and later will also guide towards the data partitioning and management.
  5. High-level design: Draw a block diagram with 5–6 boxes representing the core components of your system. You should identify enough components that are needed to solve the actual problem from end to end.
  6. Detailed design: Dig deeper into 2–3 components; interviewer’s feedback should always guide you towards which parts of the system she wants you to explain further. You should be able to provide different options, their pros and cons, and why are you choosing them?
  7. Identifying and resolving bottlenecks: Try to discuss as many bottlenecks (and different approaches to mitigate them) as possible.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Clarify Requirements

A

Understand the problem, clarify any ambiguities and gather as much info as possible about the system

Two types of requirements to clarify:
* Functional
* Non-functional

Understanding the scope early prevents you from heading in the wrong direction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Functional Requirements Questions

A

What are the core features that the system should support?

Are there any particular features that are more critical than others?

Who will use this system (customers, internal teams etc..)?

What specific actions should users be able to perform on the system?

How will users interact with the system (web, mobile app, API, etc.)?

Does the system need to support multiple languages or locales?

What are the key data types the system must handle (text, images, structured data, etc). It can influence your database choices.

Are there any external systems or third-party services the system needs to integrate with?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Non-Functional Requirements Questions

A

What is the expected scale of the system in terms of users and requests?

How much data volume is expected to be handled by the system?

What are the inputs and outputs of the system?

What is the expected read-to-write ratio?

Can the system have some downtime, or does it need to be highly available?

Are there any specific latency requirements?

How critical is data consistency? Can some eventual consistency be tolerated for the sake of availability?

Are there any specific non-functional requirements (performance, scalability, reliability) we should focus on?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Capacity Estimation

A

After clarifying the requirements, you can do some calculations to estimate the capacity of the system you are going to design.

Note: Not every system design interview will require detailed capacity estimates. It’s always a good idea to check with your interviewer if it’s necessary.

That said, it’s usually helpful to at least get a rough idea of the number of requests and storage requirements.

Estimating the scale upfront helps guide your design decisions and ensures that the system can meet the desired criteria.

This can include things like expected daily/monthly users, read/write requests per second, data storage and network bandwidth needs.

USERS: Estimate the number of daily users and maximum concurrent users during peak hours.

TRAFFIC: Calculate expected read/write per second. Consider peak traffic periods and potential spikes in usage.

STORAGE: Consider the different types of data (structured, unstructured, multimedia) and estimate the total amount of storage required (and its growth rate).

MEMORY: Evaluate the potential benefits of caching to reduce latency and improve performance. Estimate how much memory you might need to store frequently accessed data.

NETWORK: Estimate bandwidth requirements based on the estimated traffic volume and data transfer sizes.

It is always a good idea to estimate the scale of the system we’re going to design. This will also help later when we will be focusing on scaling, partitioning, load balancing, and caching.

  1. What scale is expected from the system (e.g., number of new tweets, number of tweet views, number of timeline generations per sec., etc.)?
  2. How much storage will we need? We will have different storage requirements if users can have photos and videos in their tweets.
  3. What network bandwidth usage are we expecting? This will be crucial in deciding how we will manage traffic and balance load between servers.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

High Level Design

A

With the requirements and expected capacity in mind, start designing the high-level architecture of the system.

Break down the system into major components or modules, such as the frontend, backend, databases, caches, and external services.

Draw a simple BLOCK DIAGRAM with 5–6 boxes representing the core components of our system. We should identify enough components that are needed to solve the actual problem from end to end that outlines the major system components and the high-level flow of data and requests through the system, from the client to the backend and back.

  • Keep it simple and clean.
  • Use appropriate notations and symbols to represent the components, their interactions, and the data flow.
  • Use different colors, line styles, or symbols to differentiate between various types of components or interactions.
  • Stick with simple boxes representing components and arrows showing directional data flow.
  • Show how data flows through the system, from input to storage and retrieval using arrows.
  • Avoid cluttering the diagram with too much detail or unnecessary elements.
  • Don’t overthink the minor details, this is about the big picture.

For Twitter, at a high level, we will need multiple application servers to serve all the read/write requests with load balancers in front of them for traffic distributions. If we’re assuming that we will have a lot more read traffic (as compared to write), we can decide to have separate servers for handling these scenarios. On the backend, we need an efficient database that can store all the tweets and can support a huge number of reads. We will also need a distributed file storage system for storing photos and videos.

clients -> LB -> [appServer1, appServer2, …] <—> DB, File Storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What to include in High Level Design Diagram

A

CLIENT APPLICATIONS: Indicates how users will interact with the system (web browser, mobile app, desktop application etc.).

WEB SERVERS: Servers that handle incoming requests from clients.

LOAD BALANCERS: Used to evenly distribute traffic to servers to handle significant traffic.

APPLICATION SERVICES: The backend logic layer where the core functionalities of the system are implemented.

DATABASES: Specify the type of database: SQL vs. NoSQL, and briefly explain why.

CACHING LAYER: Specify caching (eg.. Redis, Memcached) if you’re using to reduce load on the database.

MESSAGE QUEUES: If using asynchronous communication.

EXTERNAL SERVICES: If the system relies on third-party APIs (e.g., payment gateways), include them.

For every component, make sure to consider trade-offs and justify why you picked specific technologies or architectures (e.g., “We need strong consistency, so a relational database is a good fit”).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Database Design

A

This steps involve modeling the data, choosing the right storage for the system, designing the database schema and optimizing the storage and retrieval of data based on the access patterns.

  • Data Modeling
  • Choosing the Right Storage
  • Design The Database Schema
  • Define Data Access Patterns
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data Modeling

A
  • Identify the main data entities or objects that the system needs to store and manage (e.g., users, products, orders).
  • Consider the relationships between these entities and how they interact with each other.
  • Determine the attributes or properties associated with each entity (e.g., a user has an email, name, address).
  • Identify any unique identifiers or primary keys for each entity.
  • Consider normalization techniques to ensure data integrity and minimize redundancy.

Defining the data model in the early part of the interview will clarify how data will flow between different components of the system. Later, it will guide for data partitioning and management. The candidate should be able to identify various entities of the system, how they will interact with each other, and different aspects of data management like storage, transportation, encryption, etc. Here are some entities for our Twitter-like service:

  • User: UserID, Name, Email, DoB, CreationData, LastLogin, etc.
  • Tweet: TweetID, Content, TweetLocation, NumberOfLikes, TimeStamp, etc.
  • UserFollows: UserdID1, UserID2
  • FavoriteTweets: UserID, TweetID, TimeStamp
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Choosing the Right Storage

A
  • Evaluate the requirements and characteristics of the data to determine the most suitable database type.
  • Consider factors such as data structure, scalability, performance, consistency, and query patterns.
  • Relational databases (e.g., MySQL, PostgreSQL) are suitable for structured data with complex relationships and ACID properties.
  • NoSQL databases (e.g., MongoDB, Cassandra) are suitable for unstructured or semi-structured data, high scalability, and eventual consistency. BASE properties (basically available, soft state, and eventually consistent)
  • Consider using a combination of databases if different data subsets have distinct requirements.

Relational, Key-Value, Graph, Document, Column Store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Design the Database Schema

A
  • Define the tables, columns, data types, and relationships based on the chosen database type.
  • Specify primary keys, foreign keys, and any necessary indexes to optimize query performance.
  • Consider denormalization techniques, such as duplication or pre-aggregation, to improve read performance if needed.

Table: User
- UserId: PK
- Name
- Email
- DOB

Table: Tweet
- TweetId: PK
- UserId
- Content
- Likes
- CreationTime

Table: UserFollow
- UserId1: FK
- UserId2: FK

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Database Indexing?

A

Database indexing is a technique used to accelerate the retrieval of data within a database. Think of an index as the table of contents in a book 📖. Without it, the database would have to scan every row to find the needed data, which would be inefficient, especially as data volume grows.

When we create an index on a database column, we’re creating a structure that holds a sorted list of pointers to the rows where each unique value occurs. This makes retrieving rows by specific values (such as a specific user ID or product ID) significantly faster, especially as table size grows.

✨ Database indexing is a powerful technique for boosting query performance and scalability in large-scale systems. However, it requires a thoughtful and strategic approach to balance the costs and benefits.

🔍 By understanding the types of indexes available, their use cases, and the practical implications of index design, system architects and developers can build highly performant applications that scale gracefully under high traffic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why Database Indexing Is Essential

A
  • Improves Query Performance: Indexes drastically reduce the time needed for data retrieval, especially on large datasets.
  • Supports Query Optimization: Query optimizers in databases rely heavily on indexes to decide the most efficient path for retrieving data.
  • Enhances System Scalability: In systems where performance needs to scale with user growth, indexing is crucial for maintaining high query throughput.
  • Reduces Disk I/O: Since indexes allow databases to locate data with fewer reads, they decrease the amount of I/O operations, which is beneficial in both performance and cost.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Monitoring and Best Practices for Indexing in Production

A

Index Monitoring Tools:

  • Database-Specific Tools: Most databases (e.g., MySQL’s EXPLAIN, PostgreSQL’s pg_stat_activity) provide tools for examining index usage and query plans.
  • Performance Monitoring Tools: Tools like Prometheus, Datadog, and New Relic allow monitoring query performance and identifying slow queries affected by indexing.
  • Automated Index Tuning: Cloud databases often have automatic index suggestions based on query patterns, helping optimize without manual intervention.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Types of Database Indexing Techniques

A
  1. B-Tree Indexes 🌲
    * Ideal:
    • Range-based queries, ordered retrieval
      * Example:
    • E-commerce product filtering by price
      * Performance Impact:
    • Moderate Storage cost, good for range queries
  2. Hash Indexes 🔢
    * Ideal:
    • Exact-match lookups, unique constraints
      * Example:
    • Social media authentication
      * Performance Impact:
    • Fast exact matches, no range support
  3. Bitmap Indexes
    * Ideal:
    • Large text searching, natural language queries
      * Example:
    • blog or article search
      * Performance Impact:
    • resource-intensive, but powerful for text search
  4. Full-Text Indexes
    * Ideal:
    • Low-cardinality columns, analytic queries with AND/OR
      * Example:
    • Data warehouse analytics (e.g. status filters)
      * Performance Impact:
    • Efficient for AND/OR queries, less so for high-frequency updates
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the best practices for Database indexing

A

📊 Index Only When Necessary: Avoid over-indexing; create indexes based on query needs and frequency to maximize value.

🔄 Review Index Performance Regularly: Applications evolve, and so do query patterns. Routinely review and adjust your indexing strategies to align with current access patterns.

⚡ Use Covering Indexes for Common Queries: A covering index can fulfill a query directly, reducing I/O by avoiding main table access and improving response times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are B-Tree Indexes

A

B-Trees are a balanced tree data structure where nodes are arranged in sorted order, allowing for efficient range-based queries. B-Tree (Balanced Tree) indexes are among the most commonly used indexing structures in relational databases. A B-tree organizes data hierarchically, where each node can have multiple children. B-tree indexes are self-balancing, meaning that data is organized in a way that keeps the tree balanced for optimized read and write performance.

  • Advantages:
    • Excellent for range queries (e.g., finding all users aged between 25 and 30).
    • Self-balancing properties provide consistent access times.
  • Disadvantages:
    • Performance can degrade with high-frequency updates due to rebalancing.
    • More complex to maintain with heavy write loads.
  • Use Cases: Suited for large datasets in read-heavy applications (e.g., e-commerce product listings). For eg. Suppose an e-commerce platform has a products table that includes columns like product_id, price, and date_added. Users might want to filter products within a specific price range or list products added within a certain timeframe. A B-Tree index on the price or date_added column can enable this efficiently:
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are Hash Indexes

A

Hash indexes use a hash function to convert a search key into a location in a table. These indexes work well for equality comparisons (e.g., finding a user by user ID) but are not effective for range queries.

  • Advantages:
    • Very fast for equality comparisons (e.g., SELECT * FROM users WHERE id = ?).
    • Less memory overhead compared to B-trees for single-column indexes.
  • Disadvantages:
    • Cannot handle range queries.
    • May have performance issues with collisions if hash values are not unique.
  • Use Cases: High-speed lookups in applications where queries are based primarily on unique IDs or keys (e.g., session token lookup). For eg. Consider a social media platform where user authentication checks if the provided username and password_hash match a stored record. Since this query only requires an exact match and doesn’t involve any range-based searching, a hash index is ideal.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are Bitmap Indexes

A

Bitmap indexes store columns as binary strings (bitmaps), where each bit indicates the presence or absence of a particular value. Bitmap indexes are very efficient for columns with low cardinality (i.e., columns with a limited number of distinct values, like a “status” field).

  • Advantages:
    • Efficient for columns with low cardinality (e.g., Boolean or status fields).
    • Excellent for complex queries involving multiple fields.
  • Disadvantages:
    • Requires substantial storage space on high-cardinality fields.
    • Can slow down write operations due to the need to update multiple bitmaps.
  • Use Cases: Data warehouses and analytical databases where queries are read-intensive and based on low-cardinality fields. In a data warehouse storing millions of transactions for analysis, columns like status (with values like ‘completed,’ ‘pending,’ ‘failed’) or is_premium (yes/no) benefit from bitmap indexing. Analysts often need to filter and aggregate data based on these low-cardinality columns, and bitmap indexes allow for efficient query processing on them.
  • Ideal Use Cases : Data warehouses, OLAP systems, Report generation
  • Poor Use Cases : OLTP systems, High-cardinality columns, Frequent updates
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are Full-Text Indexes

A

Full-text indexes are specialized for searching text-based fields using keywords. They are widely used in applications where searching text data is essential, like document management systems.

  • Advantages:
    • Highly optimized for text search queries.
    • Supports complex queries, including Boolean and proximity searches.
  • Disadvantages:
    • Can consume large amounts of storage and increase complexity.
    • Slower to maintain on fields with frequent text updates.
  • Use Cases: Search-heavy applications, such as social media and document search systems. For eg. Imagine a blog platform where users want to search articles based on keywords, titles, and body content. Full-text indexing on these columns can allow for efficient and flexible search functionality across large text fields.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Define Data Access Patterns

A
  • Identify the common data access patterns and queries that the system will perform.
  • Optimize the database schema and indexes based on these access patterns to ensure efficient data retrieval.
  • Use appropriate caching mechanisms to store frequently accessed data and reduce database load.
  • For scalability, consider partitioning or sharding your data across multiple databases or tables.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Design API and Communication Protocols

A

Designing the API (Application Programming Interface) and communication protocols defines how different components of the system interact with each other and how external clients can access the system’s functionality.

  • Identify the API Requirements:
  • Choose the API Style:
  • Define the API Endpoints:
  • Specify the Data Formats:
  • Choose Communication Protocols:

Define what APIs are expected from the system. This will not only establish the exact contract expected from the system but will also ensure that we haven’t gotten any requirements wrong. Some examples of APIs for our Twitter-like service will be:

postTweet(user_id, tweet_data,

tweet_location, timestamp, …)

generateTimeline(user_id,

current_time, user_location, …)

markTweetFavorite(user_id, tweet_id, timestamp, …)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Identify the API Requirements

A
  • Determine the main functionalities and services that the system needs to expose through the API.
  • Consider the different types of clients (e.g., web, mobile, third-party services) that will interact with the API.
  • Identify the data inputs, outputs, and any specific requirements for each API endpoint.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Choose the API Style:

A
  • Select an appropriate API style based on the system’s requirements and the clients’ needs.
  • RESTful APIs (Representational State Transfer) are commonly used for web-based systems and provide a uniform interface for resource manipulation.
  • GraphQL APIs offer a flexible and efficient approach for clients to query and retrieve specific data fields.
  • RPC (Remote Procedure Call) APIs are suitable for systems with well-defined procedures or functions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Define the API Endpoints:

A
  • Design clear and intuitive API endpoints based on the system’s functionalities and data model.
  • Use appropriate HTTP methods (e.g., GET, POST, PUT, DELETE) for each endpoint to indicate the desired action.
  • createProfile(name, email, password string)
  • postTweet(userID, content string, timestamp time.Time)
  • followUser(userID1, userID2 string)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Specify the Data Formats:

A
  • Choose the data formats for the API requests and responses.
  • Common formats include JSON (JavaScript Object Notation) and XML (eXtensible Markup Language).
  • Consider factors such as readability, parsing efficiency, and compatibility with the clients and system components.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Choose Communication Protocols:

A
  • HTTPS: Commonly used for RESTful APIs and web-based communication.
  • WebSockets: Useful for real-time, bidirectional communication between clients and servers (e.g., chat applications).
  • gRPC (gRPC Remote Procedure Call): Efficient for inter-service communication in microservices architectures.
  • Messaging Protocols: AMQP, MQTT for asynchronous messaging (often used with message queues).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Dive Deeper into Key Components

A

Your interviewer will likely want to focus on specific areas so pay attention and discuss those things in more detail.

Common Areas for Deep Dives:

  • Databases: How would you handle a massive increase in data volume? Discuss sharding (splitting data across multiple databases), replication (read/write replicas).
  • Web Servers/Application Servers: How do you add more servers behind the load balancer for increased traffic?
  • Load Balancers: Which Load Balancing techniques and algorithms to use (e.g., round-robin, least connections).
  • Caching: Where would you add more cache layers (in front of web servers? in the application layer?), and how would you deal with cache invalidation?
  • Single Points of Failure: Identify components whose failure would take down the system and discuss how to address it.
  • Authentication/Authorization: How would you manage user access and permissions securely?
  • Rate Limiting: How would you prevent excessive use or abuse of your APIs?

the only important thing is to consider tradeoffs between different options while keeping system constraints in mind.

  • Since we will be storing a massive amount of data, how should we partition our data to distribute it to multiple databases? Should we try to store all the data of a user on the same database? What issue could it cause?
  • How will we handle hot users who tweet a lot or follow lots of people?
  • Since users’ timelines will contain the most recent (and relevant) tweets, should we try to store our data in such a way that is optimized for scanning the latest tweets?
  • How much and at which layer should we introduce cache to speed things up?
  • What components need better load balancing

clients
—> LB
—-> [appServer1, 2, …]
—-> [AggSvc1, 2, …]
—-> [DB Shard1, 2, …]
<—> [CacheSvc1, 2, …]
—-> LB
—-> [CacheSvc1, 2, …]
—-> file storage
—-> [CacheSvc1, 2, …]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Address Key Issues

A

This step involves identifying and addressing the core challenges that your system design is likely to encounter.

These challenges can range from scalability and performance to reliability, security, and cost concerns.

  • Addressing Scalability and Performance Concerns
  • Addressing Reliability

Try to discuss as many bottlenecks as possible and different approaches to mitigate them.

  • Is there any single point of failure in our system? What are we doing to mitigate it?
  • Do we have enough replicas of the data so that if we lose a few servers, we can still serve our users?
  • Similarly, do we have enough copies of different services running such that a few failures will not cause a total system shutdown?
  • How are we monitoring the performance of our service? Do we get alerts whenever critical components fail, or their performance degrades?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Addressing Scalability and Performance Concerns:

A
  • Scale horizontally (Scale-out) by adding more nodes and use load balancers to evenly distribute the traffic among the nodes.
    • This means adding more machines to your system to spread the workload across multiple servers. It’s often considered the most effective way to scale for large systems.
  • Scale vertically (Scale-up) by increasing the capacity of individual resources (e.g., CPU, memory, storage).
    • This means adding more power to your existing machines by upgrading server with more RAM, faster CPUs, or additional storage.
    • It’s a good approach for simpler architectures but has limitations in how far you can go. There is a risk of the single point of failure
  • Load Balancing: Load balancing is the process of distributing traffic across multiple servers to ensure no single server becomes overwhelmed.
    • Google employs load balancing extensively across its global infrastructure to distribute search queries and traffic evenly across its massive server farms.
  • Implement caching to reduce the load on backend systems and improve response times.
    • Caching is a technique to store frequently accessed data in-memory (like RAM) to reduce the load on the server or database. Implementing caching can dramatically improve response times.
      • Reddit uses caching to store frequently accessed content like hot posts and comments so that they can be served quickly without querying the database each time.
    • Consider using caching when all three of these are true:
      • Computing the result is costly
      • Once computed, the result tends to not change very often (or at all)
      • The objects we are caching are read often
  • Select efficient data structures and algorithms for critical operations.
  • Optimize database queries and indexes.
  • Denormalize data when necessary to reduce join operations.
  • Use database partitioning and sharding for improved query performance.
    • Partitioning means splitting data or functionality across multiple nodes/servers to distribute workload and avoid bottlenecks.
  • Implement content delivery networks (CDNs) to serve static assets from geographically distributed servers.
    • CDN distributes static assets (images, videos, etc.) closer to users. This can reduce latency and result in faster load times.
      • Cloudflare provides CDN services, speeding up website access for users worldwide by caching content in servers located close to users.
  • Utilize asynchronous programming models to handle concurrent requests efficiently.
    • Asynchronous communication means deferring long-running or non-critical tasks to background queues or message brokers.
    • This ensures your main application remains responsive to users.
      • Slack uses asynchronous communication for messaging. When a message is sent, the sender’s interface doesn’t freeze; it continues to be responsive while the message is processed and delivered in the background.
  • Microservices Architecture
    • Micro-services architecture breaks down application into smaller, independent services that can be scaled independently.
    • This improves resilience and allows teams to work on specific components in parallel.
      • Uber has evolved its architecture into microservices to handle different functions like billing, notifications, and ride matching independently, allowing for efficient scaling and rapid development.
  • Auto Scaling:
    • Automatically adjust the number of active servers based on the current load. This ensures that the system can handle spikes in traffic without manual intervention.
      • AWS Auto Scaling monitors applications and automatically adjusts capacity to maintain steady, predictable performance at the lowest possible cost.
  • Multi-region Deployment
    • Deploy the application in multiple data centers or cloud regions to reduce latency and improve redundancy.
      • Spotify uses multi-region deployments to ensure their music streaming service remains highly available and responsive to users all over the world, regardless of where they are located.
31
Q

Addressing Reliability

A

Reliability refers to a system’s ability to function correctly and consistently, even in the presence of failures or errors.

Here are some key considerations for making our system more reliable:

  • Analyze the system architecture and identify potential single point of failures.
  • Design redundancy into the system components (multiple load balancers, database replicas) to eliminate single points of failure.
  • Consider geographical redundancy to protect against regional failures or disasters.
  • Implement data replication strategies to ensure data availability and durability.
  • Implement circuit breaker patterns to prevent cascading failures and protect the system from overload. (A design pattern used in modern software development, applied to detect failures and encapsulate the logic of preventing a failure from constantly recurring.)
  • Implement retry mechanisms with exponential backoff to handle temporary failures and prevent overwhelming the system during recovery.
  • Implement comprehensive monitoring and alerting systems to detect failures, performance issues, and anomalies.
32
Q

Scalability

A

As a system grows, the performance starts to DEGRADE unless we adapt it to deal with that growth.

Scalability is the property of a system to handle a growing amount of load by ADDING RESOURCES to the system.

A system that can continuously evolve to support a growing amount of work is scalable.

33
Q

How can a system grow?

A
  1. Growth in User Base
    More users started using the system, leading to increased number of requests.
    * Example: A social media platform experiencing a surge in new users.
  2. Growth in Features
    More features were introduced to expand the system’s capabilities.
    * Example: An e-commerce website adding support for a new payment method.
  3. Growth in Data Volume
    Growth in the amount of data the system stores and manages due to user activity or logging.
    * Example: A video streaming platform like youtube storing more video content over time.
  4. Growth in Complexity
    The system’s architecture evolves to accommodate new features, scale, or integrations, resulting in additional components and dependencies.
    * Example: A system that started as a simple application is broken into smaller, independent systems.
  5. Growth in Geographic Reach
    The system is expanded to serve users in new regions or countries.
    * Example: An e-commerce company launching websites and distribution in new international markets.
34
Q

Common Components in System Design

A

5.1: Payment Service
Payment services handle transactions between customers and businesses. Integrating a reliable payment service is crucial for e-commerce and subscription-based platforms. Popular payment service providers include Stripe, PayPal, and Square. These services usually provide APIs to facilitate secure transactions and manage recurring payments, refunds, etc.

5.2: Analytic Service
Analytic services enable data collection, processing, and visualization to help businesses make informed decisions. These services can track user behaviour, monitor system performance, and analyze trends. Standard analytic service providers include Google Analytics, Mixpanel, and Amplitude. Integrating analytic services into a system can help businesses optimize their offerings and improve the user experience.

5.3: Notification
Notification services keep users informed about updates, alerts, and important information. These services can deliver notifications through various channels, such as email, SMS, and push notifications. Examples of notification service providers include Firebase Cloud Messaging (FCM), Amazon Simple Notification Service (SNS), and Twilio.

5.4: Search
Integrating a powerful search component is essential for systems with large amounts of data or content. A search service should provide fast, relevant, and scalable search capabilities. Elasticsearch, Apache Solr, and Amazon CloudSearch are popular choices for implementing search functionality. These services typically support full-text search, faceted search, and filtering, enabling users to find the information they’re looking for quickly and efficiently.

5.5: Recommendation Service
Recommendation services use algorithms to provide personalized suggestions to users based on their preferences, behaviour, and other factors. These services can significantly improve user engagement and satisfaction. Techniques for generating recommendations include collaborative filtering, content-based filtering, and hybrid approaches. Machine learning algorithms, such as matrix factorization and deep learning, can also be used to generate more sophisticated recommendations.

35
Q

Microservice Advantages

A
  • Scalability: Individual microservices can be scaled independently based on demand, optimizing resource usage.
  • Flexibility: Different microservices can be developed, tested, deployed, and maintained using different technologies.
  • Faster Development: Smaller, focused teams can work on separate microservices concurrently, speeding up development cycles and release times.
  • Resilience: Failures in one microservice are isolated and less likely to affect the entire system, improving overall reliability.
  • Easier Maintenance: Smaller codebases for each microservice are easier to understand, modify, and debug, reducing technical debt.
  • Flexible to outsourcing: Intellectual property protection can be a concern when outsourcing business functions to third-party partners. A microservices architecture can help by isolating partner-specific components, ensuring the core services remain secure and unaffected.
36
Q

Microservice Challenges

A
  • Complexity: Developing and maintaining a microservices-based application typically demands more effort than a monolithic approach. Each service requires its own codebase, testing, deployment pipeline, and documentation.
  • Inter-Service Communication: Microservices rely on network communication, which can introduce latency, failures, and complexities in handling inter-service communication.
  • Data Management: Distributed data management can be challenging, as each microservice may have its own database, leading to issues with consistency, data synchronization, and transactions.
  • Deployment Overhead: Managing the deployment, versioning, and scaling of multiple microservices can require sophisticated orchestration and automation tools like Kubernetes.
  • Security: Each microservice can introduce new potential vulnerabilities, increasing the attack surface and requiring careful attention to security practices.
37
Q

Microservice Patterns

A
  • Database Per Service Pattern
  • API Gateway Pattern
  • Backend For Frontend Pattern
  • Command Query Responsibility Segregation (CQRS)
  • Event Sourcing Pattern
  • Saga Pattern
  • Sidecar Pattern
  • Circuit Breaker Pattern
  • Anti-Corruption Layer
  • Aggregator Pattern

https://medium.com/@sylvain.tiset/top-10-microservices-design-patterns-you-should-know-1bac6a7d6218

38
Q

Database Per Service Microservice Pattern

A

The Database per Service pattern is a design approach in microservices architecture where each microservice has its own dedicated database. Each database is accessible only via its microservice own API. The service’s database is effectively part of the implementation of that service. It cannot be accessed directly by other services.

If a relational database is chosen, there are 3 ways to keep the data private to other databases:

  • Private tables per service: Each service owns a set of tables that must only be accessed by that service
  • Schema per service: Each service has a database schema that’s private to that service
  • Database server per service: Each service has it’s own database server.

Here are the main benefits to use this pattern:

  • Loose Coupling: Services are less dependent on each other, making the system more modular.
  • Technology Flexibility: Teams can choose the best database technology, chosing a proper database size, for their specific service requirements for each microservice.

A design pattern always comes with trade offs, here are some challenges that are not solved with this pattern:

  • Complexity: Managing multiple databases, including backup, recovery, and scaling, adds complexity to the system.
  • Cross-Service Queries: Hard to implement queries for data spread across multiple databases. API gateway or Aggregator pattern can be used to tackle this issue.
  • Data Consistency: Maintaining consistency across different services’ databases requires careful design and often involves other patterns like Event sourcing or Saga pattern.
39
Q

API Gateway Microservice Pattern

A

The API Gateway pattern is a design approach in microservices architecture where a single entry point (the API gateway), handles all client requests. The API gateway acts as an intermediary between clients and the microservices, routing requests to the appropriate service, aggregating responses, and often managing cross-cutting concerns like authentication, load balancing, logging, and rate limiting.

Here are the main advantages of using an API Gateway in a microservice architecture:

  • Simplified Client Interaction: Clients interact with a single, unified API instead of dealing directly with multiple microservices.
  • Centralized Management: Cross-cutting concerns are handled in one place, reducing duplication of code across services.
  • Improved Security: The API gateway can enforce security policies and access controls, protecting the underlying microservices.

Here are the main drawbacks:

  • Single Point of Failure: If the API gateway fails, the entire system could become inaccessible, so it must be highly available and resilient.
  • Performance Overhead: The gateway can introduce latency and become a bottleneck if not properly optimized when scaling.
40
Q

Backend For Frontend Microservice Pattern

A

The Backend for Frontend (BFF) pattern is a design approach where a dedicated backend service is created for each specific frontend or client application, such as a web app, mobile app, or desktop app. Each BFF is designed to respond to the specific needs of its corresponding frontend, handling data aggregation, transformation, and communication with underlying microservices or APIs. The BFF pattern is best used in situations where there are multiple front-end applications that have different requirements.

Here are the benefits of such a pattern:

  • Optimized Communication with Frontends: Frontends get precisely what they need, leading to faster load times and a better user experience.
  • Reduced Complexity for Frontends: The frontend is simplified as the BFF handles complex data aggregation, transformation, and business logic.
  • Independent Evolution: Each frontend and its corresponding BFF can evolve independently, allowing for more flexibility in development.

However, this pattern comes with these drawbacks:

  • Complexity: Maintaining separate BFFs for different frontends adds to the development and maintenance complexity.
  • Potential Duplication: Common functionality across BFFs might lead to code duplication if not managed properly.
  • Consistency: Ensuring consistent behavior across different BFFs can be challenging, especially in large systems.
41
Q

Command Query Responsibility Segregation (CQRS) Microservice Pattern

A

The CQRS pattern is a design approach where the responsibilities of reading data (queries) and writing data (commands) are separated into different models or services. The separation of concerns enables each model to be tailored to its specific function:

  • Command Model: Can be optimized for handling complex business logic and state changes.
  • Query Model: Can be optimized for efficient data retrieval and presentation, often using denormalized views or caches.

Communication between the read and the write services can be done in several ways like message queues or by using Event sourcing pattern described below.
Here are the main benefits of the CQRS pattern:

  • Performance Optimization: Each model can be optimized for its specific operations, enhancing overall system performance.
  • Scalability: Read and write operations can be scaled independently, improving resource utilization.
  • Maintainability: By separating command and query responsibilities, the codebase becomes more organized and easier to understand and modify.

Here are the challenges with this pattern:

  • Complexity: The need to manage and synchronize separate models for commands and queries add complexity to the system.
  • Data Consistency: Ensuring consistency between the command and query models, especially in distributed systems where data updates may not be immediately propagated, can be challenging.
  • Data Synchronization: Synchronizing the read and write models can be challenging, particularly with large volumes of data or complex transformations. Techniques such as event sourcing or message queues can assist in managing this complexity.
42
Q

Event Sourcing Microservice Pattern

A

The Event Sourcing pattern captures state changes as a sequence of events, stored in an event store instead of directly saving the current state. This event store acts like a message broker, allowing services to subscribe to events via an API. When a service records an event, it is sent to all interested subscribers. To reconstruct the current state, all events in the event store are replayed in sequence. The last process can be optimzed using snapshots to avoid replaying every events but only the last ones.

Here are the main benefits of event sourcing pattern:

  • Audit Trail: Provides a complete history of changes, which is useful for auditing, debugging, and understanding how the system evolved over time.
  • Scalability: By storing only events, write operations can be easily scaled. This allows the system to handle a high volume of writes across multiple consumers without performance concerns.
  • Evolutivity: Easy addition of new functionality by introducing new event types, as the business logic for processing events is separated from the event storage.

It comes with these drawbacks:

  • Complexity: The need to manage event streams and reconstruct state can be more complex than a traditional approach, also there is a learning curve to master this practice.
  • Higher storage requirements: Event Sourcing usually demands more storage than traditional methods, as all events must be stored and retained for historical purposes.
  • Complex querying: Querying event data can be more challenging than with traditional databases because the current state must be reconstructed from events.
43
Q

Saga Microservice Pattern

A

The Saga Pattern is used in distributed systems to manage long-running business transactions across multiple microservices or databases. It does this by breaking the transaction into a sequence of local transactions, each updating the database and triggering the next step via an event. If a transaction fails, the saga runs compensating transactions to undo the changes made by previous steps.

Sagas can be coordinated in two ways:

  • Choreography: Each service listens to events and triggers the next step in the saga. This is a decentralized approach where services communicate directly with each other.
  • Orchestration: A central orchestrator service directs the saga, telling each service when to perform its transaction and managing the flow of the entire process.

Here are the main benefits of the saga pattern:

  • Data eventual consistency: It enables an application to maintain data consistency across multiple services.
  • Improved Resilience: By breaking down transactions into smaller, independent steps with compensating actions, the Saga Pattern enhances the system’s ability to handle failures without losing data consistency.

It comes with its drawbacks:

  • Complexity: Implementing the Saga Pattern can add complexity, especially in managing compensating transactions and ensuring all steps are correctly coordinated.
  • Lack of automatic rollback: Unlike ACID transactions, sagas do not have automatic rollback, so developers must design compensating transactions to explicitly undo changes made earlier in the saga.
  • Lack of isolation: The absence of isolation (the “I” in ACID) in sagas increases the risk of data anomalies during concurrent saga execution.
44
Q

Sidecar Microservice Pattern

A

The Sidecar Pattern involves deploying an auxiliary service (sidecar), alongside a primary application service within the same environment, such as a container or pod. The sidecar handles supporting tasks like logging, monitoring, or security, enhancing the primary service’s functionality without modifying its code. This pattern promotes modularity and scalability by offloading non-core responsibilities to the sidecar, allowing the primary service to focus on its main functionality.

Before going into pros and cons of this pattern, let’s see some use cases of the pattern:

  • Logging and Monitoring: A sidecar can collect logs or metrics from the primary service and forward them to centralized systems for analysis.
  • Security: Sidecars can manage security functions like authentication, authorization, and encryption. Offloading these responsibilities to the sidecar allows the core service to concentrate on its business logic.

Here are the main advantages of this pattern:

  • Modularity and Extensibility: The Sidecar pattern allows developers to easily add or remove functionalities by attaching or detaching sidecar containers, enhancing code reuse and system maintainability without affecting the core service.
  • Isolation of Concerns: The sidecar operates separately from the core service, isolating auxiliary functions and minimizing the impact of sidecar failures.
  • Scalability: By decoupling the core service from the sidecar, each component can scale independently based on its specific needs, ensuring that scaling the core service or sidecar does not affect the other.

Here comes the main disavantages:

  • Increased Complexity: Adds a layer of complexity, requiring management and coordination of multiple containers, which can increase deployment and operational overhead.
  • Potential Single Point of Failure: The sidecar container can become a single point of failure, necessitating resilience mechanisms like redundancy and health checks.
  • Latency: Introduces additional communication overhead, which can affect performance, especially in latency-sensitive applications.
  • Synchronization and Coordination: Ensuring proper synchronization between the primary service and the sidecar can be challenging, particularly in dynamic environments.
45
Q

Circuit Breaker Microservice Pattern

A

The Circuit Breaker Pattern is a design approach used to enhance the resilience and stability of distributed systems by preventing cascading failures. It functions like an electrical circuit breaker: when a service encounters a threshold of consecutive failures, the circuit breaker trips, stopping all requests to the failing service for a timeout period. During this timeout, the system can recover without further strain. After the timeout, the circuit breaker allows a limited number of test requests to check if the service has recovered. If successful, normal operations resume; if not, the timeout resets. This pattern helps manage service availability, prevent system overload, and ensure graceful degradation in microservices environments.

The Circuit Breaker pattern typically operates in three main states: Closed, Open, and Half-Open. Each state represents a different phase in the management of interactions between services. Here’s an explanation for each state:

  • Closed: The circuit breaker allows requests to pass through to the service. It monitors the responses and failures. If failures exceed a predefined threshold, the circuit breaker transitions to the “Open” state.
  • Open: The circuit breaker prevents any requests from reaching the failing service, redirecting them to a fallback mechanism or returning an error. This state allows the service time to recover from its issues.
  • Half-Open: After a predefined recovery period, the circuit breaker transitions to the “Half-Open” state, where it allows a limited number of requests to test if the service has recovered. If these requests succeed, the circuit breaker returns to the “Closed” state; otherwise, it goes back to “Open.”

Here are the main benefits of this pattern:

  • Prevents Cascading Failures: By halting requests to a failing service, the pattern prevents the failure from affecting other parts of the system.
    Improves System
  • Resilience: Provides a mechanism for systems to handle failures gracefully and recover from issues without complete outages.
  • Enhances Reliability: Helps maintain system reliability and user experience by managing and isolating faults.

Here are the main challenges that comes in with this pattern:

  • Configuration Complexity: Setting appropriate thresholds and recovery periods requires careful tuning based on the system’s behavior and requirements.
  • Fallback Management: Ensuring effective fallback mechanisms that provide meaningful responses or handle requests appropriately is crucial.

Note there exists other design pattern to reduce damage done by failures like the Bulkhead pattern, which isolates different parts of a system into separate pools to prevent a failure in one part from impacting others.

46
Q

Anti-Corruption Layer Microservice Pattern

A

The Anti-Corruption Layer (ACL) Pattern is a design pattern used to prevent the influence of external systems’ design and data models from corrupting the internal design and data models of a system. It acts as a barrier or translator between two systems, ensuring that the internal system remains isolated from and unaffected by the complexities or inconsistencies of external systems.

Here are the main benefits from the ACL pattern:

  • Protection: Shields the internal system from external changes and potential corruption.
  • Flexibility: Easier integration with external systems by managing differences in data models and protocols.
  • Maintainability: Simplifies modifications and updates to either the internal or external systems without affecting the other.

On the other hand, here are the main challenges of the ACL pattern:

  • Latency: Latency can be added by calls made between the two systems.
  • Scaling: Scaling ACL with many microservices or monolith applications can be a concern for the development team.
  • Added Complexity: Introduces additional complexity due to the need for translation and adaptation logic.
47
Q

Aggregator Microservice Pattern

A

The Aggregator Pattern is a design pattern used to consolidate data or responses from multiple sources into a single, unified result. An aggregator component or service manages the collection of data from different sources, coordinating the process of fetching, merging, and processing the data.

Here are the main benefits from the Aggregator pattern:

  • Simplified Client Interaction: Clients interact with one service or endpoint, reducing complexity and improving ease of use.
  • Reduced Network Calls: Aggregates data from multiple sources in one place, minimizing the number of calls or requests needed from clients and improving overall efficiency.
  • Centralized Data Processing: Handles data processing and transformation centrally, ensuring consistency and coherence across different data sources.

Here are the drawbacks of this pattern:

  • Added Complexity: Implementing the aggregation logic can be complex, especially when dealing with diverse data sources and formats.
  • Single Point of Failure: Since the aggregator serves as the central point for data collection, any issues or failures with the aggregator can impact the availability or functionality of the entire system.
  • Increased Latency: Aggregating data from multiple sources may introduce additional latency, particularly if the sources are distributed or if the aggregation involves complex processing.
  • Scalability Challenges: Scaling the aggregator to handle increasing amounts of data or requests can be challenging, requiring careful design to manage load and ensure responsiveness.
48
Q

12 fundamental (technical) system design concepts

A
  1. APIs
  2. Databases (SQL vs NoSQL)
  3. Scaling
  4. CAP theorem

5.Web authentication and basic security

  1. Load balancers
  2. Caching
  3. Message queues
  4. Indexing

10.Failovers

11.Replication

  1. Consistent hashing
49
Q

Representational State Transfer (REST) Strengths vs Weaknesses

A

Strengths: This approach creates structured ways of getting and modifying information from your database. It is the most universally used and works for most circumstances. This method also tends to have tooling that supports generation of documentation that can make it easier for developers to understand, especially for external services accessing the API through network calls.

Weaknesses: It requires you to write the requests for each type of entity in your database, in contrast to GraphQL where no separate inquiries are required to grab all the data the caller needs. It also isn’t as space efficient as RPC.

50
Q

Remote Procedure Call (RPC) Strengths vs Weaknesses

A

RPC is like communication in a family or with close friends. When you are with family and you notice your favorite snacks in the fridge, you can usually skip a lot of communication and make assumptions that you can eat some without asking. Since you have close and frequent communication, you make certain processes more efficient.

RPC allows the execution of a procedure or command in a remote machine. In other words, you can write code that executes on another computer internally in the same way you write code that runs on the current machine. In this approach, the API is more thought of as an action or command. And it is easier to add these functions to extend the functionality.

RPC - /placeAnOrder (OrderDetails order)
REST - POST /order/orderNumber={} [Order body]

Strengths: It is more space efficient than REST, and it makes development easier since the code you write that requires communication to other computers does not require much special syntax.

Weaknesses: It can be only used for internal communication. There are complications that can occur, such as timing issues, when you are communicating between machines, and RPC could make this distinction less clear, leading developers to miss corner cases that cause faults in the system.

51
Q

GraphQL Strengths vs Weaknesses

A

GraphQL can be thought of as those Amazon Go stores where you can walk in, grab what you need, and walk out. There are cameras that track what you took, and you are automatically charged for the items you left with. Items in Amazon Go stores are placed in a way that they can be easily discovered, allowing customers to decide what they need. Likewise, in GraphQL you structure the data in graph relationships and then leave it for those using your service to define what they need.

This modeling technique enables building a perfect request to fetch all the data that is needed by the client without making multiple calls.

Strengths: GraphQL works particularly well for customer-facing web and mobile applications, because once you set up the system, frontend developers can craft their own requests to get and modify information without requiring backend work to build more routes.

Weaknesses: There is initially some upfront development work required to set up this communication system, both on the frontend and backend. It is also less friendly for external users when compared with REST APIs, where documentation can be generated automatically. In addition, GraphQL is not suitable for use cases where certain data needs to be aggregated on the backend.

52
Q

Cap Theorem

A

(C)onsistency means that every node in a network will have access to the same data.

(A)vailability means that the system is always available to the users.

(P)artition tolerance means that in case of a fault in the network or communication, the system will still work.

53
Q

Authentication VS Authorization

A
  • Authentication refers to verifying the identity of our service’s users.
    • passwords (Hashed + Salt)
      • hash password
      • salt: salt the password by adding in some random words that aren’t obvious. For example, instead of groupPasswordRainToMainArea you could salt it by adding “spider” to the password. (Protects against rainbow tables -> enormous lookup table containing millions of common passwords and their variations, along with their hashes for a common hash algorithm)
  • Session Tokens: A classic, simple way to track authentication is to generate a token the user can submit with subsequent requests to track that they are, in fact, signed in the session token is equivalent to a password. Should come w/ expiration date, short as feasible
  • JSON Web Tokens: Rather than plain session tokens, you may also opt to use JSON Web Tokens, or JWTs. While a session token is an opaque string that means nothing without access to the session database, a JWT explicitly encodes the user’s access.
    1. Sign the payload
      * Attaching a signature from a private key held only by your service verifies the token’s legitimacy. You can use it on the client side to tell who’s logged in and optionally what permissions they have. The other advantage, and perhaps the strongest selling point for JWT as a whole, is that you can use the token with other services
    2. Encrypt the payload
      * not as useful since it gets closer to session tokens and loses advantages of signing the payload

Cookies are used to store the token on the client side, Set-Cookie with “Secure”(HttpOnly) flag to only send on HTTPS request

  • Authorization, which is the related but separate concept of determining which users have permission to take which actions.
54
Q

Web authentication and basic security

A
  1. The user signs up. At this point, we need to salt and hash their password and store those values (but not the password itself!).
  2. The user logs in with their username and password. We verify the password by hashing it with the stored salt and checking to see if it matches the stored hash (ideally using a secure library to make the comparison). We then send some kind of identifying token, either a simple session token or a JWT or similar token, back to the client in a cookie set header.
  3. On subsequent requests, the browser sends the cookie back to the server, where we can verify the session token or check the signature on/decrypt a JWT.
  4. Periodically, the session token or JWT should be expired and a new one generated and sent down to the client with a cookie set header.
  5. Eventually, the user’s session may expire from inactivity. In this case, we go back to step 2.
55
Q

Load Balancers

A

Helps distribute traffic across the machines. As distributed systems are designed to scale, the basic requirement is to add or remove machines in case of increased load or in case of failures. Load balancers also help manage this.

  • Round Robin
  • Least Connections / Response time
  • Hashing
    • The key for hashing can be a request id for a given user or an IP address
      *
56
Q

Caching Patterns

A

Cache aside pattern
* This is the most popular cache pattern. In this pattern, we have an application which will try to fetch data from the cache, and if the data is not found (also known as a “cache miss”) it will fetch data from the database. Or it will do an expensive computation. And then it will put that data back to the cache before returning the query back to the user.
* Advantage:
* we only cache the data that we need
* Disadvantages:
* data can become stale if there are lots of updates to the database (mitigate with TTL)
* If there are a lot of cache misses in our application, then the application has to do a lot more work than in the regular flow of just fetching data solely from the database.

Write-through and write-back patterns
* the application directly writes the data to the cache. And then the cache synchronously (or asynchronously) writes the data to the database. When we write it synchronously it’s called “write-through,” and when we write it asynchronously it’s called “write-back” (or “write-behind”). In the asynchronous example, we put the data in a queue, which writes the data back to the database and improves the latency of writes on your application.
* If you’re experiencing slow writes, a quick fix is async writes to the database.

  • Advantages:
  • Disadvantages:
    • Here we are writing all the data to the cache, which might not even be read. Hence, we are overloading the cache (or cache memory) with expensive calls that might not even be required.
    • Also, if the database goes down before the data is written to the database, this causes inconsistency
57
Q

Cache invalidation

A

One of the problems we have observed with caching is that data can become stale if there are lots of updates to the database. Therefore, it is important to expire or invalidate data from the cache, so your data doesn’t get stale! It’s a good practice to include a brief point about cache invalidation during system design interviews.

Policies:
* Least Recently Used (LRU)

For most systems 20% of the data accounts for 80% of the reads. So using LRU will result in fewer cache misses. Because of the 80/20 rule, we want to give special treatment to the most popular data! That’s why we use LRU. As a result, we can throw stuff in the cache (and not miss), which reduces latency for 80% of your requests.

58
Q

Queues

A

Advantages

  • A queue stores messages that need to be stored in a database. This is used if there’s the possibility that traffic will spike, causing the CPU of the database to go up like crazy and kill the server. That would cause the database to be down and probably lose data. Instead… throw it in the queue!
  • If a message has to be processed by some very expensive code, you may also hold them in a queue while previous messages are being processed so you don’t overload (and potentially kill) servers.
  • Queues can deliver messages to multiple systems, instead of the client having to send them to all the required systems.
  • Queues decouple the client from the server by eliminating the need to know the server address.

Can have these properties:
* Guaranteed delivery.
* No duplicate messages are delivered. (dedupID)
* Ensure that the order of messages is maintained. (FIFO)
* At least once delivery with idempotent consumers.

59
Q

Failover

A

When a leader node fails:
1. One of the follower nodes needs to be promoted to the leader.
2. Client node(s) must be reconfigured to send the write request to the new leader.
3. Other followers need to be reconfigured to consume data from the new leader.

For failover to be triggered, a prerequisite is that the leader’s failures should be tracked. We can track it by sending some health-status pings to the nodes from time to time, and use the response time to determine the failure.

Failover can lead to some tricky issues:
1. Failover leads to lost updates in the case of asynchronous replication when the leader goes down.
2. How to detect if the leader has gone down? Deciding the threshold for when to mark the leader as unavailable can be challenging. Sometimes it can happen that the traffic on the system is high, and that’s why it is taking longer to respond. In such scenarios, if we bring down the leader, the system would be even more stressed.
3. The problem of having more than one leader.

60
Q

Replication

A

Why replication?
Replication is done to achieve one or more of the following goals:

  1. To avoid a single point of failure and increase availability when machines go down.
  2. To better serve the global users by organizing copies by distinct geological locations in order to serve users from copies that are close by.
  3. To increase throughput. With more machines, more requests can be served.

Some key terms to understand for replication
* Replica: Copy of data

  • Leader: Machine that handles write requests to the data store.
  • Followers: Machines that are replicas of the leader node, and cater to read requests.
61
Q

Synchronous vs asynchronous replication

A

When a write request to a replica is marked as acknowledged, only then is it called synchronous replication. This means that the leader waits for an acknowledgment from all of the followers.

When the leader doesn’t wait for the acknowledgment from the followers before marking the client’s write requests as successful, it is called asynchronous replication.

Synchronous replication ensures that the information is replicated before moving on. This can be nice when it is vital that nothing is missed. The downside is that it slows down the stream of information being passed.

Sync replication ensures guaranteed delivery to all the followers, while async replication is less time-consuming for the client.

Sometimes in the database a semi-synchronous approach is taken, where only one follower is synchronously updated and the rest are asynchronous. When the former crashes, one of the latter is made a synchronous follower. This ensures that the up-to-date copy is at least available in two nodes and the client is also not kept waiting for long.

62
Q

Most common types of Replication Systems

A

Single leader
* a single machine acts as a leader, and all write requests (or updates to the data store) go through that machine. All the other machines are used to cater to the read requests. This was previously known as “master-slave” replication, but it’s currently known as “primary-standby” or “active-passive” replication.
* The leader also needs to pass down the information about all the writes to the follower nodes to keep them up to date. In case the leader goes down, one of the follower nodes (mostly with the most up-to-date data) is promoted to be the leader. This is called failover.

Multi leader
* this means that more than one machine can take the write requests. This makes the system more reliable in case a leader goes down. This also means that every machine (including leaders) needs to catch up with the writes that happen over other machines.

Conflict resolution for concurrent writes:
1. Keeping the update with the largest client timestamp.
2. Sticky routing—writes from same client/index go to the same leader.
3. Keeping and returning all the updates.

Leaderless replication
* all machines can cater to write and read requests. In some cases, the client directly writes to all the machines, and requests are read from all the machines based on quorum. Quorum refers to the minimum number of acknowledgements (for writes) and consistent data values (for reads) for the action to be valid. In other cases, the client request reaches the coordinator that broadcasts the request to all the nodes.

63
Q

What is consistent hashing?

A

Consistent hashing is a way to effectively distribute the keys in any distributed storage system—cache, database, or otherwise—to a large number of nodes or servers while allowing us to add or remove nodes without incurring a large performance hit.

64
Q

Functional Requirements

A
  1. Identify the main objects and their relations.
  2. What information do these objects hold? Are they mutable?
  3. Think about access patterns. “Given object X, return all related objects Y.” Consider the cross product of all related objects.
  4. ## List all the requirements you’ve identified and validate with your interviewer.

You should start with the functional requirements first—that is, the core product features and use cases that the system needs to support.

  1. Identify the main business objects and their relations
    * Start by identifying the main business objects and their relations. For example, in the case of Twitter there are two main objects of interest: (1) Accounts and (2) Tweets.
    * Now think about clarifying the relation between these objects.
    • An account can follow other accounts (Account x Account)
    • An account can publish a tweet (Account x Tweet)
    • A tweet can reference another tweet, i.e., be a “retweet”. (Tweet x Tweet)
  2. Think about the possible access patterns for these objects
    * Access patterns are probably the single most influential part of design because they determine how data will be stored.
    * Let’s think about the cross product of our objects again. This time we want to identify how data will be retrieved from the system.
    * The general shape of an access pattern requirement is:
    • Given [object A], get all related [object B]
  • So, applying this idea to our Twitter example, we might end up with the following access patterns:
    • Given an account:
      • Get all of its followers. (Account → Account)
      • Get all the other accounts they follow. (Account → Account)
      • Get all of its tweets. (Account → Tweet)
      • Get a curated feed of tweets for accounts they follow. (Account → Tweet)
    • Given a tweet:
      • Get all accounts that liked it. (Tweet → Account)
      • Get all accounts that retweeted it. (Tweet → Account)

For these access patterns, you should also consider ranking. Are there any access patterns that require ranking the object? In this example, “creating a curated feed of tweets” will require further clarification. Strive for simplicity first. Can you return them sorted by chronological time? Identify these access patterns of interest, like the curated feed, and get a feel for what your interviewer is looking for: do they want you to suggest an algorithm for a feed?

  1. Consider mutability
    * Finally, as we do throughout this guide, you should always consider mutability. Can the objects the system holds be mutated? Or can they be assumed to be immutable?

For example: Can tweets be edited after they’re published?

Another flavor of mutability is deletion. Can these business objects be deleted? What would the consequences be?

For example: Can tweets be deleted? Can accounts be deleted? What happens to tweets when an account is deleted?

It might sound like a small detail at first, but mutability can limit our ability to use caching in our design (more on this in step 3).

65
Q

Non-Functional Requirements

A

Once functional requirements have been laid out, you should move onto non-functional requirements (NFRs). These are quality attributes that specify how the system should perform a certain function.

The most common non-functional requirements you should consider in a system design interview are:

  1. Performance: Which access patterns, if any, require good performance?
  2. Availability: What’s the cost of downtime for this system?
  3. Security: Is there any workflow that requires special security considerations (e.g., code execution)?
  4. Consistency

Good candidates can view non-functional requirements mainly as opportunities to relax one specific requirement, such as “We don’t need to focus on [Insert requirement, such as “consistency”] as much in this case because [Insert reason, such as “it’s okay in this scenario of TikTok if some users get access to certain videos later than the rest of our users”].”

If NFRs are over-specified, the solution may be too expensive to be practical; if they are under-specified, the system will not be suitable for its intended purpose. Use your common sense, and ask the right questions to land on a set of non-functional requirements that make sense for the system you are designing.

66
Q

NFR Performance

A

Performance is pretty straightforward. It’s the system’s ability to respond quickly to user requests. While speed is always welcome, it might not be the right thing to optimize for in every system. Better performance may come at the cost of consistency or just an overall more complex solution.

It makes the most sense when we have synchronous user-facing workflows. That is, the user is expecting an immediate response from the system. In addition, we want to optimize for the synchronous workflows that are accessed the most frequently.

67
Q

NFR Availability

A

Availability refers to how much downtime the service can tolerate. Just like with performance, we might not always want to optimize for availability. A good question to guide this decision is: What’s the cost of downtime? This is as easy as it sounds. If taking downtime will result in financial losses or correctness issues, we might want to put some thought into making the system highly available.

Think, for example, about a banking system. One of the most important mandates of the system would be consistency. Operations need to be transactional. In this case, it might be acceptable if our system is unavailable/stale for small periods of time, as long as it is consistent.

68
Q

NFR Security

A

We want to learn if there’s some workflow that might require a special design to account for security. For example, imagine you were designing LeetCode, an online judge for coding questions. One security constraint that would come to mind is that user-submitted code should be run in isolation. User submissions should run in some sort of sandbox where they get limited resources and are guaranteed not to affect or see other submissions.

Whenever there is user-generated code execution involved (aka low trust code), running it in isolation should be a non-functional security requirement.

69
Q

So what is “design”?

A

Design simply means two components:

  1. Data storage. We already know from previous steps “what” we are storing. Now the question is where are we storing it?
  2. Microservices. How do we store our data? How do we retrieve it to the API? Think of these as the middlemen between storage and the API.

We know the what (Functional / Non-functional, API, Scale, data types), so now we focus on the where and the how. We will start with designing the data storage layer first and then think about the microservices that access this data.

69
Q

Data Types, Scale, and Access patterns

A

Once you know your requirements, it’s time to get specific.

  1. Data Types: Start by identifying the main business objects that you need to store.
  2. API: How are these going to be accessed?
  3. Scale: Is the system read-heavy or write-heavy?
70
Q

Data Storage

A

A blob (Binary Large Object) is basically just binary data. We store and retrieve these as a single item. For example, ZIP files or other binaries.

Say the generic name of the component, not the brand name. Unless you are very familiar with a specific brand (like S3), don’t say the specific brand. Instead, say “some kind of blob storage.” Because if you say, “we should use S3 here,” the next question out of your interviewer’s mouth will be, “why not Azure blob instead of S3?”

Database
There are a few considerations for this step:

  1. Relational vs. Non-Relational
  2. Entities to store
71
Q

Relational Vs Non-Relational Database Rule of Thumb

A

You nee to store some data
Is it important for your data to have structured relationships?
* Yes: SQL
* No:
* Do you need strong consistency? With strong ACID guarantees?
* Yes: SQL
* No: NoSQL

If you picked relational:
”Although I think a relational database better fits this requirement, we should also be mindful of the downsides. For example, our database will have a more rigid structure and schema, so it might be harder for us to incorporate changes. We’ll also need to scale up vertically, meaning that as we get more load we’ll upscale existing servers rather than dividing the work over more servers.”

If you picked non-relational:
”Although I think a non-relational database better fits this requirement, we should also be mindful of the downsides. We’ll be able to scale horizontally at the cost of not having ACID guarantees. I’m assuming there will be no need for strong consistency in the future.”

  1. Design a banking system.
    This is a textbook example of strong consistency. Transactions in a banking system need ACID guarantees. As such, we are probably better off picking a relational database that can give us this strong consistency.

Be mindful of any “get all” access patterns. These usually need to be guarded by paging. You don’t want a single endpoint returning the entire tweet history of an account. Depending on the account, that might be a very expensive query, and degrade user experience. Usually these will be behind logic that pages the response. That’s why Twitter will load pages of tweets, even if it seems like an “infinite scroll” in the UI.

72
Q

How to Get Yourself Unstuck

A
  1. To simplify the problem, think about the data in the problem as immutable. You can choose to add mutability back in later after you have an initial working design.
  2. Not sure if you got all the requirements? Think you’re missing something important, but you don’t know what? Turn your requirements gathering into a conversation and get the interviewer involved. Ask your interviewer: “Are there any important requirements you have in mind that I’ve overlooked?” This is totally allowed!
  3. Some interviewers hate this step and really don’t want to see you stumble through 5th-grade math calculations for 15 minutes. Similar to step 2, ask your interviewer if they’d like to see some calculations before jumping in and starting them—you might be able to skip these entirely if the interviewer doesn’t care about it! With quite a few system design interviews, you’d be fine as long as you do mention that the system you plan to present will be durable, resilient, and scalable.
  4. Sometimes it’s difficult to know what to calculate during this part of the interview. If you’ve already confirmed that the interviewer wants to see calculations as mentioned in Tip #3, then follow these rough guides to get the basic estimates for any system.
  • Storage Estimation:
    • Storage = daily data used by 1 user * DAU count * length of time to store data
  • Bandwidth Estimation:
    • Bandwidth per second = (daily data used by 1 user * DAU count ) / total seconds in a day

Also, there are roughly 100K seconds in a day, which is five orders of magnitude. If your API gateway expects to see a billion requests on a busy day, that’s approximately 10K requests per second, as 9 zeroes minus 5 zeroes is 4 zeroes. The true figure is ~15% larger, as 100K / (60 * 60 * 24) is around 1.15.

  1. Consistency is important because the order in which distributed messages are sent matters. If you send your friend a diatribe about how Vim is superior to Emacs (why are we even still arguing this? 😩), and your long-winded chat messages are received out of order, then your friend will be totally lost and won’t comprehend your persuasive brilliance.
  2. Ask yourself, “Would it be fine if the data in my system was occasionally wrong for a split second or so?” If the answer is yes, then you probably want eventual consistency. If the answer is no, then you’re looking for a strong type of consistency called linearizability.
  3. Don’t confuse consistency terms. Besides being common terms talked about in system design, what do all three of these terms have in common?
    * ACID
    * CAP Theorem
    * BASE
    All three terms talk about consistency! The ‘C’ in both ACID and CAP stands for Consistency, and the ‘E’ in BASE stands for Eventual Consistency.

What makes matters worse is that the term means something different in each context. Be sure to separate these ideas in your head before you talk about consistency in an interview!

  • ACID consistency discusses transaction guarantees within the context of database constraints.
  • BASE eventual consistency discusses guarantees around objects being updated and what will be returned by all nodes when the same info is queried after an update.
  • CAP consistency is about the tradeoffs in distributed systems between partitioned networks, nodes always being up to date with the latest values, and the system always being available.
  1. It’s common to try to detail every part of the system’s design like you see people do on YouTube. Realistically, these videos are scripted, and the drawings are fast-forwarded. In a real interview, you won’t have time to actually detail every part of the system, and that’s OK! It’s expected that you’ll abstract away pieces that aren’t particularly relevant. It’s a good practice to call out what you’re abstracting, but just focus on the general data flow of the system.
  2. As usual, we begin from requirements. In fact, it’s best to postulate the problem right in the form of requirements! This way we also develop a habit of thinking of system design problems from the grounds of how to solve them, as asking the right questions is at least half of solving them.

Don’t worry if you don’t know in detail what the Ticketmaster problem is about. In fact, for any problem, if you don’t fully understand its statement, jump straight to functional requirements, and clarify them—with your interviewer or with your peers—until they are crystal clear!

  1. The best way to reason about the value of consistency is to think of what could possibly go wrong.