SD Flashcards

Question

Define the API Endpoints:

Answer 1

* Design clear and intuitive API endpoints based on the system’s functionalities and data model. * Use appropriate HTTP methods (e.g., GET, POST, PUT, DELETE) for each endpoint to indicate the desired action. * createProfile(name, email, password string) * postTweet(userID, content string, timestamp time.Time) * followUser(userID1, userID2 string)

Answer 2

* Choose the data formats for the API requests and responses. * Common formats include JSON (JavaScript Object Notation) and XML (eXtensible Markup Language). * Consider factors such as readability, parsing efficiency, and compatibility with the clients and system components.

Answer 3

* HTTPS: Commonly used for RESTful APIs and web-based communication. * WebSockets: Useful for real-time, bidirectional communication between clients and servers (e.g., chat applications). * gRPC (gRPC Remote Procedure Call): Efficient for inter-service communication in microservices architectures. * Messaging Protocols: AMQP, MQTT for asynchronous messaging (often used with message queues).

Answer 4

Your interviewer will likely want to focus on specific areas so pay attention and discuss those things in more detail. Common Areas for Deep Dives: * Databases: How would you handle a massive increase in data volume? Discuss sharding (splitting data across multiple databases), replication (read/write replicas). * Web Servers/Application Servers: How do you add more servers behind the load balancer for increased traffic? * Load Balancers: Which Load Balancing techniques and algorithms to use (e.g., round-robin, least connections). * Caching: Where would you add more cache layers (in front of web servers? in the application layer?), and how would you deal with cache invalidation? * Single Points of Failure: Identify components whose failure would take down the system and discuss how to address it. * Authentication/Authorization: How would you manage user access and permissions securely? * Rate Limiting: How would you prevent excessive use or abuse of your APIs? the only important thing is to consider tradeoffs between different options while keeping system constraints in mind. * Since we will be storing a massive amount of data, how should we partition our data to distribute it to multiple databases? Should we try to store all the data of a user on the same database? What issue could it cause? * How will we handle hot users who tweet a lot or follow lots of people? * Since users’ timelines will contain the most recent (and relevant) tweets, should we try to store our data in such a way that is optimized for scanning the latest tweets? * How much and at which layer should we introduce cache to speed things up? * What components need better load balancing clients ---> LB ----> [appServer1, 2, ...] ----> [AggSvc1, 2, ...] ----> [DB Shard1, 2, ...] <---> [CacheSvc1, 2, ...] ----> LB ----> [CacheSvc1, 2, ...] ----> file storage ----> [CacheSvc1, 2, ...]

Answer 5

This step involves identifying and addressing the core challenges that your system design is likely to encounter. These challenges can range from scalability and performance to reliability, security, and cost concerns. * Addressing Scalability and Performance Concerns * Addressing Reliability Try to discuss as many bottlenecks as possible and different approaches to mitigate them. * Is there any single point of failure in our system? What are we doing to mitigate it? * Do we have enough replicas of the data so that if we lose a few servers, we can still serve our users? * Similarly, do we have enough copies of different services running such that a few failures will not cause a total system shutdown? * How are we monitoring the performance of our service? Do we get alerts whenever critical components fail, or their performance degrades?

Answer 6

* Scale horizontally (Scale-out) by adding more nodes and use load balancers to evenly distribute the traffic among the nodes. * This means adding more machines to your system to spread the workload across multiple servers. It’s often considered the most effective way to scale for large systems. * Scale vertically (Scale-up) by increasing the capacity of individual resources (e.g., CPU, memory, storage). * This means adding more power to your existing machines by upgrading server with more RAM, faster CPUs, or additional storage. * It’s a good approach for simpler architectures but has limitations in how far you can go. There is a risk of the single point of failure * Load Balancing: Load balancing is the process of distributing traffic across multiple servers to ensure no single server becomes overwhelmed. * Google employs load balancing extensively across its global infrastructure to distribute search queries and traffic evenly across its massive server farms. * Implement caching to reduce the load on backend systems and improve response times. * Caching is a technique to store frequently accessed data in-memory (like RAM) to reduce the load on the server or database. Implementing caching can dramatically improve response times. * Reddit uses caching to store frequently accessed content like hot posts and comments so that they can be served quickly without querying the database each time. * Consider using caching when all three of these are true: * Computing the result is costly * Once computed, the result tends to not change very often (or at all) * The objects we are caching are read often * Select efficient data structures and algorithms for critical operations. * Optimize database queries and indexes. * Denormalize data when necessary to reduce join operations. * Use database partitioning and sharding for improved query performance. * Partitioning means splitting data or functionality across multiple nodes/servers to distribute workload and avoid bottlenecks. * Implement content delivery networks (CDNs) to serve static assets from geographically distributed servers. * CDN distributes static assets (images, videos, etc.) closer to users. This can reduce latency and result in faster load times. * Cloudflare provides CDN services, speeding up website access for users worldwide by caching content in servers located close to users. * Utilize asynchronous programming models to handle concurrent requests efficiently. * Asynchronous communication means deferring long-running or non-critical tasks to background queues or message brokers. * This ensures your main application remains responsive to users. * Slack uses asynchronous communication for messaging. When a message is sent, the sender’s interface doesn’t freeze; it continues to be responsive while the message is processed and delivered in the background. * Microservices Architecture * Micro-services architecture breaks down application into smaller, independent services that can be scaled independently. * This improves resilience and allows teams to work on specific components in parallel. * Uber has evolved its architecture into microservices to handle different functions like billing, notifications, and ride matching independently, allowing for efficient scaling and rapid development. * Auto Scaling: * Automatically adjust the number of active servers based on the current load. This ensures that the system can handle spikes in traffic without manual intervention. * AWS Auto Scaling monitors applications and automatically adjusts capacity to maintain steady, predictable performance at the lowest possible cost. * Multi-region Deployment * Deploy the application in multiple data centers or cloud regions to reduce latency and improve redundancy. * Spotify uses multi-region deployments to ensure their music streaming service remains highly available and responsive to users all over the world, regardless of where they are located.

Answer 7

Reliability refers to a system’s ability to function correctly and consistently, even in the presence of failures or errors. Here are some key considerations for making our system more reliable: * Analyze the system architecture and identify potential single point of failures. * Design redundancy into the system components (multiple load balancers, database replicas) to eliminate single points of failure. * Consider geographical redundancy to protect against regional failures or disasters. * Implement data replication strategies to ensure data availability and durability. * Implement circuit breaker patterns to prevent cascading failures and protect the system from overload. (A design pattern used in modern software development, applied to detect failures and encapsulate the logic of preventing a failure from constantly recurring.) * Implement retry mechanisms with exponential backoff to handle temporary failures and prevent overwhelming the system during recovery. * Implement comprehensive monitoring and alerting systems to detect failures, performance issues, and anomalies.

Answer 8

As a system grows, the performance starts to DEGRADE unless we adapt it to deal with that growth. Scalability is the property of a system to handle a growing amount of load by ADDING RESOURCES to the system. A system that can continuously evolve to support a growing amount of work is scalable.

Answer 9

1. Growth in User Base More users started using the system, leading to increased number of requests. * Example: A social media platform experiencing a surge in new users. 2. Growth in Features More features were introduced to expand the system’s capabilities. * Example: An e-commerce website adding support for a new payment method. 3. Growth in Data Volume Growth in the amount of data the system stores and manages due to user activity or logging. * Example: A video streaming platform like youtube storing more video content over time. 4. Growth in Complexity The system’s architecture evolves to accommodate new features, scale, or integrations, resulting in additional components and dependencies. * Example: A system that started as a simple application is broken into smaller, independent systems. 5. Growth in Geographic Reach The system is expanded to serve users in new regions or countries. * Example: An e-commerce company launching websites and distribution in new international markets.

Answer 10

5.1: Payment Service Payment services handle transactions between customers and businesses. Integrating a reliable payment service is crucial for e-commerce and subscription-based platforms. Popular payment service providers include Stripe, PayPal, and Square. These services usually provide APIs to facilitate secure transactions and manage recurring payments, refunds, etc. 5.2: Analytic Service Analytic services enable data collection, processing, and visualization to help businesses make informed decisions. These services can track user behaviour, monitor system performance, and analyze trends. Standard analytic service providers include Google Analytics, Mixpanel, and Amplitude. Integrating analytic services into a system can help businesses optimize their offerings and improve the user experience. 5.3: Notification Notification services keep users informed about updates, alerts, and important information. These services can deliver notifications through various channels, such as email, SMS, and push notifications. Examples of notification service providers include Firebase Cloud Messaging (FCM), Amazon Simple Notification Service (SNS), and Twilio. 5.4: Search Integrating a powerful search component is essential for systems with large amounts of data or content. A search service should provide fast, relevant, and scalable search capabilities. Elasticsearch, Apache Solr, and Amazon CloudSearch are popular choices for implementing search functionality. These services typically support full-text search, faceted search, and filtering, enabling users to find the information they’re looking for quickly and efficiently. 5.5: Recommendation Service Recommendation services use algorithms to provide personalized suggestions to users based on their preferences, behaviour, and other factors. These services can significantly improve user engagement and satisfaction. Techniques for generating recommendations include collaborative filtering, content-based filtering, and hybrid approaches. Machine learning algorithms, such as matrix factorization and deep learning, can also be used to generate more sophisticated recommendations.

Answer 11

* Scalability: Individual microservices can be scaled independently based on demand, optimizing resource usage. * Flexibility: Different microservices can be developed, tested, deployed, and maintained using different technologies. * Faster Development: Smaller, focused teams can work on separate microservices concurrently, speeding up development cycles and release times. * Resilience: Failures in one microservice are isolated and less likely to affect the entire system, improving overall reliability. * Easier Maintenance: Smaller codebases for each microservice are easier to understand, modify, and debug, reducing technical debt. * Flexible to outsourcing: Intellectual property protection can be a concern when outsourcing business functions to third-party partners. A microservices architecture can help by isolating partner-specific components, ensuring the core services remain secure and unaffected.

Answer 12

* Complexity: Developing and maintaining a microservices-based application typically demands more effort than a monolithic approach. Each service requires its own codebase, testing, deployment pipeline, and documentation. * Inter-Service Communication: Microservices rely on network communication, which can introduce latency, failures, and complexities in handling inter-service communication. * Data Management: Distributed data management can be challenging, as each microservice may have its own database, leading to issues with consistency, data synchronization, and transactions. * Deployment Overhead: Managing the deployment, versioning, and scaling of multiple microservices can require sophisticated orchestration and automation tools like Kubernetes. * Security: Each microservice can introduce new potential vulnerabilities, increasing the attack surface and requiring careful attention to security practices.

Answer 13

* Database Per Service Pattern * API Gateway Pattern * Backend For Frontend Pattern * Command Query Responsibility Segregation (CQRS) * Event Sourcing Pattern * Saga Pattern * Sidecar Pattern * Circuit Breaker Pattern * Anti-Corruption Layer * Aggregator Pattern https://medium.com/@sylvain.tiset/top-10-microservices-design-patterns-you-should-know-1bac6a7d6218

Answer 14

The Database per Service pattern is a design approach in microservices architecture where each microservice has its own dedicated database. Each database is accessible only via its microservice own API. The service’s database is effectively part of the implementation of that service. It cannot be accessed directly by other services. If a relational database is chosen, there are 3 ways to keep the data private to other databases: * Private tables per service: Each service owns a set of tables that must only be accessed by that service * Schema per service: Each service has a database schema that’s private to that service * Database server per service: Each service has it’s own database server. Here are the main benefits to use this pattern: * Loose Coupling: Services are less dependent on each other, making the system more modular. * Technology Flexibility: Teams can choose the best database technology, chosing a proper database size, for their specific service requirements for each microservice. A design pattern always comes with trade offs, here are some challenges that are not solved with this pattern: * Complexity: Managing multiple databases, including backup, recovery, and scaling, adds complexity to the system. * Cross-Service Queries: Hard to implement queries for data spread across multiple databases. API gateway or Aggregator pattern can be used to tackle this issue. * Data Consistency: Maintaining consistency across different services’ databases requires careful design and often involves other patterns like Event sourcing or Saga pattern.

Answer 15

The API Gateway pattern is a design approach in microservices architecture where a single entry point (the API gateway), handles all client requests. The API gateway acts as an intermediary between clients and the microservices, routing requests to the appropriate service, aggregating responses, and often managing cross-cutting concerns like authentication, load balancing, logging, and rate limiting. Here are the main advantages of using an API Gateway in a microservice architecture: * Simplified Client Interaction: Clients interact with a single, unified API instead of dealing directly with multiple microservices. * Centralized Management: Cross-cutting concerns are handled in one place, reducing duplication of code across services. * Improved Security: The API gateway can enforce security policies and access controls, protecting the underlying microservices. Here are the main drawbacks: * Single Point of Failure: If the API gateway fails, the entire system could become inaccessible, so it must be highly available and resilient. * Performance Overhead: The gateway can introduce latency and become a bottleneck if not properly optimized when scaling.

Answer 16

The Backend for Frontend (BFF) pattern is a design approach where a dedicated backend service is created for each specific frontend or client application, such as a web app, mobile app, or desktop app. Each BFF is designed to respond to the specific needs of its corresponding frontend, handling data aggregation, transformation, and communication with underlying microservices or APIs. The BFF pattern is best used in situations where there are multiple front-end applications that have different requirements. Here are the benefits of such a pattern: * Optimized Communication with Frontends: Frontends get precisely what they need, leading to faster load times and a better user experience. * Reduced Complexity for Frontends: The frontend is simplified as the BFF handles complex data aggregation, transformation, and business logic. * Independent Evolution: Each frontend and its corresponding BFF can evolve independently, allowing for more flexibility in development. However, this pattern comes with these drawbacks: * Complexity: Maintaining separate BFFs for different frontends adds to the development and maintenance complexity. * Potential Duplication: Common functionality across BFFs might lead to code duplication if not managed properly. * Consistency: Ensuring consistent behavior across different BFFs can be challenging, especially in large systems.

Answer 17

The CQRS pattern is a design approach where the responsibilities of reading data (queries) and writing data (commands) are separated into different models or services. The separation of concerns enables each model to be tailored to its specific function: * Command Model: Can be optimized for handling complex business logic and state changes. * Query Model: Can be optimized for efficient data retrieval and presentation, often using denormalized views or caches. Communication between the read and the write services can be done in several ways like message queues or by using Event sourcing pattern described below. Here are the main benefits of the CQRS pattern: * Performance Optimization: Each model can be optimized for its specific operations, enhancing overall system performance. * Scalability: Read and write operations can be scaled independently, improving resource utilization. * Maintainability: By separating command and query responsibilities, the codebase becomes more organized and easier to understand and modify. Here are the challenges with this pattern: * Complexity: The need to manage and synchronize separate models for commands and queries add complexity to the system. * Data Consistency: Ensuring consistency between the command and query models, especially in distributed systems where data updates may not be immediately propagated, can be challenging. * Data Synchronization: Synchronizing the read and write models can be challenging, particularly with large volumes of data or complex transformations. Techniques such as event sourcing or message queues can assist in managing this complexity.

Answer 18

The Event Sourcing pattern captures state changes as a sequence of events, stored in an event store instead of directly saving the current state. This event store acts like a message broker, allowing services to subscribe to events via an API. When a service records an event, it is sent to all interested subscribers. To reconstruct the current state, all events in the event store are replayed in sequence. The last process can be optimzed using snapshots to avoid replaying every events but only the last ones. Here are the main benefits of event sourcing pattern: * Audit Trail: Provides a complete history of changes, which is useful for auditing, debugging, and understanding how the system evolved over time. * Scalability: By storing only events, write operations can be easily scaled. This allows the system to handle a high volume of writes across multiple consumers without performance concerns. * Evolutivity: Easy addition of new functionality by introducing new event types, as the business logic for processing events is separated from the event storage. It comes with these drawbacks: * Complexity: The need to manage event streams and reconstruct state can be more complex than a traditional approach, also there is a learning curve to master this practice. * Higher storage requirements: Event Sourcing usually demands more storage than traditional methods, as all events must be stored and retained for historical purposes. * Complex querying: Querying event data can be more challenging than with traditional databases because the current state must be reconstructed from events.

Answer 19

The Saga Pattern is used in distributed systems to manage long-running business transactions across multiple microservices or databases. It does this by breaking the transaction into a sequence of local transactions, each updating the database and triggering the next step via an event. If a transaction fails, the saga runs compensating transactions to undo the changes made by previous steps. Sagas can be coordinated in two ways: * Choreography: Each service listens to events and triggers the next step in the saga. This is a decentralized approach where services communicate directly with each other. * Orchestration: A central orchestrator service directs the saga, telling each service when to perform its transaction and managing the flow of the entire process. Here are the main benefits of the saga pattern: * Data eventual consistency: It enables an application to maintain data consistency across multiple services. * Improved Resilience: By breaking down transactions into smaller, independent steps with compensating actions, the Saga Pattern enhances the system’s ability to handle failures without losing data consistency. It comes with its drawbacks: * Complexity: Implementing the Saga Pattern can add complexity, especially in managing compensating transactions and ensuring all steps are correctly coordinated. * Lack of automatic rollback: Unlike ACID transactions, sagas do not have automatic rollback, so developers must design compensating transactions to explicitly undo changes made earlier in the saga. * Lack of isolation: The absence of isolation (the “I” in ACID) in sagas increases the risk of data anomalies during concurrent saga execution.

Answer 20

The Sidecar Pattern involves deploying an auxiliary service (sidecar), alongside a primary application service within the same environment, such as a container or pod. The sidecar handles supporting tasks like logging, monitoring, or security, enhancing the primary service’s functionality without modifying its code. This pattern promotes modularity and scalability by offloading non-core responsibilities to the sidecar, allowing the primary service to focus on its main functionality. Before going into pros and cons of this pattern, let’s see some use cases of the pattern: * Logging and Monitoring: A sidecar can collect logs or metrics from the primary service and forward them to centralized systems for analysis. * Security: Sidecars can manage security functions like authentication, authorization, and encryption. Offloading these responsibilities to the sidecar allows the core service to concentrate on its business logic. Here are the main advantages of this pattern: * Modularity and Extensibility: The Sidecar pattern allows developers to easily add or remove functionalities by attaching or detaching sidecar containers, enhancing code reuse and system maintainability without affecting the core service. * Isolation of Concerns: The sidecar operates separately from the core service, isolating auxiliary functions and minimizing the impact of sidecar failures. * Scalability: By decoupling the core service from the sidecar, each component can scale independently based on its specific needs, ensuring that scaling the core service or sidecar does not affect the other. Here comes the main disavantages: * Increased Complexity: Adds a layer of complexity, requiring management and coordination of multiple containers, which can increase deployment and operational overhead. * Potential Single Point of Failure: The sidecar container can become a single point of failure, necessitating resilience mechanisms like redundancy and health checks. * Latency: Introduces additional communication overhead, which can affect performance, especially in latency-sensitive applications. * Synchronization and Coordination: Ensuring proper synchronization between the primary service and the sidecar can be challenging, particularly in dynamic environments.

Answer 21

The Circuit Breaker Pattern is a design approach used to enhance the resilience and stability of distributed systems by preventing cascading failures. It functions like an electrical circuit breaker: when a service encounters a threshold of consecutive failures, the circuit breaker trips, stopping all requests to the failing service for a timeout period. During this timeout, the system can recover without further strain. After the timeout, the circuit breaker allows a limited number of test requests to check if the service has recovered. If successful, normal operations resume; if not, the timeout resets. This pattern helps manage service availability, prevent system overload, and ensure graceful degradation in microservices environments. The Circuit Breaker pattern typically operates in three main states: Closed, Open, and Half-Open. Each state represents a different phase in the management of interactions between services. Here’s an explanation for each state: * Closed: The circuit breaker allows requests to pass through to the service. It monitors the responses and failures. If failures exceed a predefined threshold, the circuit breaker transitions to the “Open” state. * Open: The circuit breaker prevents any requests from reaching the failing service, redirecting them to a fallback mechanism or returning an error. This state allows the service time to recover from its issues. * Half-Open: After a predefined recovery period, the circuit breaker transitions to the “Half-Open” state, where it allows a limited number of requests to test if the service has recovered. If these requests succeed, the circuit breaker returns to the “Closed” state; otherwise, it goes back to “Open.” Here are the main benefits of this pattern: * Prevents Cascading Failures: By halting requests to a failing service, the pattern prevents the failure from affecting other parts of the system. Improves System * Resilience: Provides a mechanism for systems to handle failures gracefully and recover from issues without complete outages. * Enhances Reliability: Helps maintain system reliability and user experience by managing and isolating faults. Here are the main challenges that comes in with this pattern: * Configuration Complexity: Setting appropriate thresholds and recovery periods requires careful tuning based on the system’s behavior and requirements. * Fallback Management: Ensuring effective fallback mechanisms that provide meaningful responses or handle requests appropriately is crucial. Note there exists other design pattern to reduce damage done by failures like the Bulkhead pattern, which isolates different parts of a system into separate pools to prevent a failure in one part from impacting others.

Answer 22

The Anti-Corruption Layer (ACL) Pattern is a design pattern used to prevent the influence of external systems’ design and data models from corrupting the internal design and data models of a system. It acts as a barrier or translator between two systems, ensuring that the internal system remains isolated from and unaffected by the complexities or inconsistencies of external systems. Here are the main benefits from the ACL pattern: * Protection: Shields the internal system from external changes and potential corruption. * Flexibility: Easier integration with external systems by managing differences in data models and protocols. * Maintainability: Simplifies modifications and updates to either the internal or external systems without affecting the other. On the other hand, here are the main challenges of the ACL pattern: * Latency: Latency can be added by calls made between the two systems. * Scaling: Scaling ACL with many microservices or monolith applications can be a concern for the development team. * Added Complexity: Introduces additional complexity due to the need for translation and adaptation logic.

Answer 23

The Aggregator Pattern is a design pattern used to consolidate data or responses from multiple sources into a single, unified result. An aggregator component or service manages the collection of data from different sources, coordinating the process of fetching, merging, and processing the data. Here are the main benefits from the Aggregator pattern: * Simplified Client Interaction: Clients interact with one service or endpoint, reducing complexity and improving ease of use. * Reduced Network Calls: Aggregates data from multiple sources in one place, minimizing the number of calls or requests needed from clients and improving overall efficiency. * Centralized Data Processing: Handles data processing and transformation centrally, ensuring consistency and coherence across different data sources. Here are the drawbacks of this pattern: * Added Complexity: Implementing the aggregation logic can be complex, especially when dealing with diverse data sources and formats. * Single Point of Failure: Since the aggregator serves as the central point for data collection, any issues or failures with the aggregator can impact the availability or functionality of the entire system. * Increased Latency: Aggregating data from multiple sources may introduce additional latency, particularly if the sources are distributed or if the aggregation involves complex processing. * Scalability Challenges: Scaling the aggregator to handle increasing amounts of data or requests can be challenging, requiring careful design to manage load and ensure responsiveness.

Answer 24

1. APIs 2. Databases (SQL vs NoSQL) 3. Scaling 4. CAP theorem 5.Web authentication and basic security 6. Load balancers 7. Caching 8. Message queues 9. Indexing 10.Failovers 11.Replication 12. Consistent hashing

Answer 25

Strengths: This approach creates structured ways of getting and modifying information from your database. It is the most universally used and works for most circumstances. This method also tends to have tooling that supports generation of documentation that can make it easier for developers to understand, especially for external services accessing the API through network calls. Weaknesses: It requires you to write the requests for each type of entity in your database, in contrast to GraphQL where no separate inquiries are required to grab all the data the caller needs. It also isn't as space efficient as RPC.

Answer 26

RPC is like communication in a family or with close friends. When you are with family and you notice your favorite snacks in the fridge, you can usually skip a lot of communication and make assumptions that you can eat some without asking. Since you have close and frequent communication, you make certain processes more efficient. RPC allows the execution of a procedure or command in a remote machine. In other words, you can write code that executes on another computer internally in the same way you write code that runs on the current machine. In this approach, the API is more thought of as an action or command. And it is easier to add these functions to extend the functionality. RPC - /placeAnOrder (OrderDetails order) REST - POST /order/orderNumber={} [Order body] Strengths: It is more space efficient than REST, and it makes development easier since the code you write that requires communication to other computers does not require much special syntax. Weaknesses: It can be only used for internal communication. There are complications that can occur, such as timing issues, when you are communicating between machines, and RPC could make this distinction less clear, leading developers to miss corner cases that cause faults in the system.

Answer 27

GraphQL can be thought of as those Amazon Go stores where you can walk in, grab what you need, and walk out. There are cameras that track what you took, and you are automatically charged for the items you left with. Items in Amazon Go stores are placed in a way that they can be easily discovered, allowing customers to decide what they need. Likewise, in GraphQL you structure the data in graph relationships and then leave it for those using your service to define what they need. This modeling technique enables building a perfect request to fetch all the data that is needed by the client without making multiple calls. Strengths: GraphQL works particularly well for customer-facing web and mobile applications, because once you set up the system, frontend developers can craft their own requests to get and modify information without requiring backend work to build more routes. Weaknesses: There is initially some upfront development work required to set up this communication system, both on the frontend and backend. It is also less friendly for external users when compared with REST APIs, where documentation can be generated automatically. In addition, GraphQL is not suitable for use cases where certain data needs to be aggregated on the backend.

Answer 28

(C)onsistency means that every node in a network will have access to the same data. (A)vailability means that the system is always available to the users. (P)artition tolerance means that in case of a fault in the network or communication, the system will still work.

Answer 29

* Authentication refers to verifying the identity of our service's users. * passwords (Hashed + Salt) * hash password * salt: salt the password by adding in some random words that aren't obvious. For example, instead of groupPasswordRainToMainArea you could salt it by adding "spider" to the password. (Protects against rainbow tables -> enormous lookup table containing millions of common passwords and their variations, along with their hashes for a common hash algorithm) * Session Tokens: A classic, simple way to track authentication is to generate a token the user can submit with subsequent requests to track that they are, in fact, signed in the session token is equivalent to a password. Should come w/ expiration date, short as feasible * JSON Web Tokens: Rather than plain session tokens, you may also opt to use JSON Web Tokens, or JWTs. While a session token is an opaque string that means nothing without access to the session database, a JWT explicitly encodes the user's access. 1. Sign the payload * Attaching a signature from a private key held only by your service verifies the token's legitimacy. You can use it on the client side to tell who's logged in and optionally what permissions they have. The other advantage, and perhaps the strongest selling point for JWT as a whole, is that you can use the token with other services 2. Encrypt the payload * not as useful since it gets closer to session tokens and loses advantages of signing the payload Cookies are used to store the token on the client side, Set-Cookie with "Secure"(HttpOnly) flag to only send on HTTPS request * Authorization, which is the related but separate concept of determining which users have permission to take which actions.

Answer 30

1. The user signs up. At this point, we need to salt and hash their password and store those values (but not the password itself!). 2. The user logs in with their username and password. We verify the password by hashing it with the stored salt and checking to see if it matches the stored hash (ideally using a secure library to make the comparison). We then send some kind of identifying token, either a simple session token or a JWT or similar token, back to the client in a cookie set header. 3. On subsequent requests, the browser sends the cookie back to the server, where we can verify the session token or check the signature on/decrypt a JWT. 4. Periodically, the session token or JWT should be expired and a new one generated and sent down to the client with a cookie set header. 5. Eventually, the user's session may expire from inactivity. In this case, we go back to step 2.

Answer 31

Helps distribute traffic across the machines. As distributed systems are designed to scale, the basic requirement is to add or remove machines in case of increased load or in case of failures. Load balancers also help manage this. * Round Robin * Least Connections / Response time * Hashing * The key for hashing can be a request id for a given user or an IP address *

Answer 32

Cache aside pattern * This is the most popular cache pattern. In this pattern, we have an application which will try to fetch data from the cache, and if the data is not found (also known as a “cache miss”) it will fetch data from the database. Or it will do an expensive computation. And then it will put that data back to the cache before returning the query back to the user. * Advantage: * we only cache the data that we need * Disadvantages: * data can become stale if there are lots of updates to the database (mitigate with TTL) * If there are a lot of cache misses in our application, then the application has to do a lot more work than in the regular flow of just fetching data solely from the database. Write-through and write-back patterns * the application directly writes the data to the cache. And then the cache synchronously (or asynchronously) writes the data to the database. When we write it synchronously it’s called “write-through,” and when we write it asynchronously it’s called “write-back” (or “write-behind”). In the asynchronous example, we put the data in a queue, which writes the data back to the database and improves the latency of writes on your application. * If you’re experiencing slow writes, a quick fix is async writes to the database. * Advantages: * * Disadvantages: * Here we are writing all the data to the cache, which might not even be read. Hence, we are overloading the cache (or cache memory) with expensive calls that might not even be required. * Also, if the database goes down before the data is written to the database, this causes inconsistency

Answer 33

One of the problems we have observed with caching is that data can become stale if there are lots of updates to the database. Therefore, it is important to expire or invalidate data from the cache, so your data doesn’t get stale! It’s a good practice to include a brief point about cache invalidation during system design interviews. Policies: * Least Recently Used (LRU) For most systems 20% of the data accounts for 80% of the reads. So using LRU will result in fewer cache misses. Because of the 80/20 rule, we want to give special treatment to the most popular data! That’s why we use LRU. As a result, we can throw stuff in the cache (and not miss), which reduces latency for 80% of your requests.

Answer 34

Advantages * A queue stores messages that need to be stored in a database. This is used if there's the possibility that traffic will spike, causing the CPU of the database to go up like crazy and kill the server. That would cause the database to be down and probably lose data. Instead… throw it in the queue! * If a message has to be processed by some very expensive code, you may also hold them in a queue while previous messages are being processed so you don't overload (and potentially kill) servers. * Queues can deliver messages to multiple systems, instead of the client having to send them to all the required systems. * Queues decouple the client from the server by eliminating the need to know the server address. Can have these properties: * Guaranteed delivery. * No duplicate messages are delivered. (dedupID) * Ensure that the order of messages is maintained. (FIFO) * At least once delivery with idempotent consumers.

Answer 35

When a leader node fails: 1. One of the follower nodes needs to be promoted to the leader. 2. Client node(s) must be reconfigured to send the write request to the new leader. 3. Other followers need to be reconfigured to consume data from the new leader. For failover to be triggered, a prerequisite is that the leader's failures should be tracked. We can track it by sending some health-status pings to the nodes from time to time, and use the response time to determine the failure. Failover can lead to some tricky issues: 1. Failover leads to lost updates in the case of asynchronous replication when the leader goes down. 2. How to detect if the leader has gone down? Deciding the threshold for when to mark the leader as unavailable can be challenging. Sometimes it can happen that the traffic on the system is high, and that’s why it is taking longer to respond. In such scenarios, if we bring down the leader, the system would be even more stressed. 3. The problem of having more than one leader.

Answer 36

Why replication? Replication is done to achieve one or more of the following goals: 1. To avoid a single point of failure and increase availability when machines go down. 2. To better serve the global users by organizing copies by distinct geological locations in order to serve users from copies that are close by. 3. To increase throughput. With more machines, more requests can be served. Some key terms to understand for replication * Replica: Copy of data * Leader: Machine that handles write requests to the data store. * Followers: Machines that are replicas of the leader node, and cater to read requests.

Answer 37

When a write request to a replica is marked as acknowledged, only then is it called synchronous replication. This means that the leader waits for an acknowledgment from all of the followers. When the leader doesn’t wait for the acknowledgment from the followers before marking the client’s write requests as successful, it is called asynchronous replication. Synchronous replication ensures that the information is replicated before moving on. This can be nice when it is vital that nothing is missed. The downside is that it slows down the stream of information being passed. Sync replication ensures guaranteed delivery to all the followers, while async replication is less time-consuming for the client. Sometimes in the database a semi-synchronous approach is taken, where only one follower is synchronously updated and the rest are asynchronous. When the former crashes, one of the latter is made a synchronous follower. This ensures that the up-to-date copy is at least available in two nodes and the client is also not kept waiting for long.

Answer 38

Single leader * a single machine acts as a leader, and all write requests (or updates to the data store) go through that machine. All the other machines are used to cater to the read requests. This was previously known as “master-slave” replication, but it’s currently known as “primary-standby” or “active-passive” replication. * The leader also needs to pass down the information about all the writes to the follower nodes to keep them up to date. In case the leader goes down, one of the follower nodes (mostly with the most up-to-date data) is promoted to be the leader. This is called failover. Multi leader * this means that more than one machine can take the write requests. This makes the system more reliable in case a leader goes down. This also means that every machine (including leaders) needs to catch up with the writes that happen over other machines. Conflict resolution for concurrent writes: 1. Keeping the update with the largest client timestamp. 2. Sticky routing—writes from same client/index go to the same leader. 3. Keeping and returning all the updates. Leaderless replication * all machines can cater to write and read requests. In some cases, the client directly writes to all the machines, and requests are read from all the machines based on quorum. Quorum refers to the minimum number of acknowledgements (for writes) and consistent data values (for reads) for the action to be valid. In other cases, the client request reaches the coordinator that broadcasts the request to all the nodes.

Answer 39

Consistent hashing is a way to effectively distribute the keys in any distributed storage system—cache, database, or otherwise—to a large number of nodes or servers while allowing us to add or remove nodes without incurring a large performance hit.

Answer 40

1. Identify the main objects and their relations. 2. What information do these objects hold? Are they mutable? 3. Think about access patterns. “Given object X, return all related objects Y.” Consider the cross product of all related objects. 4. List all the requirements you’ve identified and validate with your interviewer. ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- You should start with the functional requirements first—that is, the core product features and use cases that the system needs to support. 1. Identify the main business objects and their relations * Start by identifying the main business objects and their relations. For example, in the case of Twitter there are two main objects of interest: (1) Accounts and (2) Tweets. * Now think about clarifying the relation between these objects. * An account can follow other accounts (Account x Account) * An account can publish a tweet (Account x Tweet) * A tweet can reference another tweet, i.e., be a “retweet”. (Tweet x Tweet) 2. Think about the possible access patterns for these objects * Access patterns are probably the single most influential part of design because they determine how data will be stored. * Let's think about the cross product of our objects again. This time we want to identify how data will be retrieved from the system. * The general shape of an access pattern requirement is: * Given [object A], get all related [object B] * So, applying this idea to our Twitter example, we might end up with the following access patterns: * Given an account: * Get all of its followers. (Account → Account) * Get all the other accounts they follow. (Account → Account) * Get all of its tweets. (Account → Tweet) * Get a curated feed of tweets for accounts they follow. (Account → Tweet) * Given a tweet: * Get all accounts that liked it. (Tweet → Account) * Get all accounts that retweeted it. (Tweet → Account) For these access patterns, you should also consider ranking. Are there any access patterns that require ranking the object? In this example, “creating a curated feed of tweets” will require further clarification. Strive for simplicity first. Can you return them sorted by chronological time? Identify these access patterns of interest, like the curated feed, and get a feel for what your interviewer is looking for: do they want you to suggest an algorithm for a feed? 3. Consider mutability * Finally, as we do throughout this guide, you should always consider mutability. Can the objects the system holds be mutated? Or can they be assumed to be immutable? For example: Can tweets be edited after they’re published? Another flavor of mutability is deletion. Can these business objects be deleted? What would the consequences be? For example: Can tweets be deleted? Can accounts be deleted? What happens to tweets when an account is deleted? It might sound like a small detail at first, but mutability can limit our ability to use caching in our design (more on this in step 3).

Answer 41

Once functional requirements have been laid out, you should move onto non-functional requirements (NFRs). These are quality attributes that specify how the system should perform a certain function. The most common non-functional requirements you should consider in a system design interview are: 1. Performance: Which access patterns, if any, require good performance? 2. Availability: What’s the cost of downtime for this system? 3. Security: Is there any workflow that requires special security considerations (e.g., code execution)? 4. Consistency Good candidates can view non-functional requirements mainly as opportunities to relax one specific requirement, such as “We don’t need to focus on [Insert requirement, such as “consistency”] as much in this case because [Insert reason, such as “it’s okay in this scenario of TikTok if some users get access to certain videos later than the rest of our users”].” If NFRs are over-specified, the solution may be too expensive to be practical; if they are under-specified, the system will not be suitable for its intended purpose. Use your common sense, and ask the right questions to land on a set of non-functional requirements that make sense for the system you are designing.

Answer 42

Performance is pretty straightforward. It’s the system’s ability to respond quickly to user requests. While speed is always welcome, it might not be the right thing to optimize for in every system. Better performance may come at the cost of consistency or just an overall more complex solution. It makes the most sense when we have synchronous user-facing workflows. That is, the user is expecting an immediate response from the system. In addition, we want to optimize for the synchronous workflows that are accessed the most frequently.

Answer 43

Availability refers to how much downtime the service can tolerate. Just like with performance, we might not always want to optimize for availability. A good question to guide this decision is: What’s the cost of downtime? This is as easy as it sounds. If taking downtime will result in financial losses or correctness issues, we might want to put some thought into making the system highly available. Think, for example, about a banking system. One of the most important mandates of the system would be consistency. Operations need to be transactional. In this case, it might be acceptable if our system is unavailable/stale for small periods of time, as long as it is consistent.

Answer 44

We want to learn if there’s some workflow that might require a special design to account for security. For example, imagine you were designing LeetCode, an online judge for coding questions. One security constraint that would come to mind is that user-submitted code should be run in isolation. User submissions should run in some sort of sandbox where they get limited resources and are guaranteed not to affect or see other submissions. Whenever there is user-generated code execution involved (aka low trust code), running it in isolation should be a non-functional security requirement.

Answer 45

Design simply means two components: 1. Data storage. We already know from previous steps “what” we are storing. Now the question is where are we storing it? 2. Microservices. How do we store our data? How do we retrieve it to the API? Think of these as the middlemen between storage and the API. We know the what (Functional / Non-functional, API, Scale, data types), so now we focus on the where and the how. We will start with designing the data storage layer first and then think about the microservices that access this data.

Answer 46

Once you know your requirements, it’s time to get specific. 1. Data Types: Start by identifying the main business objects that you need to store. 2. API: How are these going to be accessed? 3. Scale: Is the system read-heavy or write-heavy?

Answer 47

A blob (Binary Large Object) is basically just binary data. We store and retrieve these as a single item. For example, ZIP files or other binaries. Say the generic name of the component, not the brand name. Unless you are very familiar with a specific brand (like S3), don’t say the specific brand. Instead, say “some kind of blob storage.” Because if you say, “we should use S3 here,” the next question out of your interviewer’s mouth will be, “why not Azure blob instead of S3?” Database There are a few considerations for this step: 1. Relational vs. Non-Relational 2. Entities to store

Answer 48

You nee to store some data Is it important for your data to have structured relationships? * Yes: SQL * No: * Do you need strong consistency? With strong ACID guarantees? * Yes: SQL * No: NoSQL If you picked relational: ”Although I think a relational database better fits this requirement, we should also be mindful of the downsides. For example, our database will have a more rigid structure and schema, so it might be harder for us to incorporate changes. We’ll also need to scale up vertically, meaning that as we get more load we’ll upscale existing servers rather than dividing the work over more servers.” If you picked non-relational: ”Although I think a non-relational database better fits this requirement, we should also be mindful of the downsides. We’ll be able to scale horizontally at the cost of not having ACID guarantees. I’m assuming there will be no need for strong consistency in the future.” 1. Design a banking system. This is a textbook example of strong consistency. Transactions in a banking system need ACID guarantees. As such, we are probably better off picking a relational database that can give us this strong consistency. Be mindful of any “get all” access patterns. These usually need to be guarded by paging. You don’t want a single endpoint returning the entire tweet history of an account. Depending on the account, that might be a very expensive query, and degrade user experience. Usually these will be behind logic that pages the response. That’s why Twitter will load pages of tweets, even if it seems like an “infinite scroll” in the UI.

Answer 49

1. To simplify the problem, think about the data in the problem as immutable. You can choose to add mutability back in later after you have an initial working design. 2. Not sure if you got all the requirements? Think you’re missing something important, but you don’t know what? Turn your requirements gathering into a conversation and get the interviewer involved. Ask your interviewer: “Are there any important requirements you have in mind that I’ve overlooked?” This is totally allowed! 3. Some interviewers hate this step and really don’t want to see you stumble through 5th-grade math calculations for 15 minutes. Similar to step 2, ask your interviewer if they’d like to see some calculations before jumping in and starting them—you might be able to skip these entirely if the interviewer doesn’t care about it! With quite a few system design interviews, you’d be fine as long as you do mention that the system you plan to present will be durable, resilient, and scalable. 4. Sometimes it’s difficult to know what to calculate during this part of the interview. If you’ve already confirmed that the interviewer wants to see calculations as mentioned in Tip #3, then follow these rough guides to get the basic estimates for any system. * Storage Estimation: * Storage = daily data used by 1 user * DAU count * length of time to store data * Bandwidth Estimation: * Bandwidth per second = (daily data used by 1 user * DAU count ) / total seconds in a day Also, there are roughly 100K seconds in a day, which is five orders of magnitude. If your API gateway expects to see a billion requests on a busy day, that’s approximately 10K requests per second, as 9 zeroes minus 5 zeroes is 4 zeroes. The true figure is ~15% larger, as 100K / (60 * 60 * 24) is around 1.15. 5. Consistency is important because the order in which distributed messages are sent matters. If you send your friend a diatribe about how Vim is superior to Emacs (why are we even still arguing this? 😩), and your long-winded chat messages are received out of order, then your friend will be totally lost and won’t comprehend your persuasive brilliance. 6. Ask yourself, “Would it be fine if the data in my system was occasionally wrong for a split second or so?” If the answer is yes, then you probably want eventual consistency. If the answer is no, then you’re looking for a strong type of consistency called linearizability. 7. Don’t confuse consistency terms. Besides being common terms talked about in system design, what do all three of these terms have in common? * ACID * CAP Theorem * BASE All three terms talk about consistency! The ‘C’ in both ACID and CAP stands for Consistency, and the ‘E’ in BASE stands for Eventual Consistency. What makes matters worse is that the term means something different in each context. Be sure to separate these ideas in your head before you talk about consistency in an interview! * ACID consistency discusses transaction guarantees within the context of database constraints. * BASE eventual consistency discusses guarantees around objects being updated and what will be returned by all nodes when the same info is queried after an update. * CAP consistency is about the tradeoffs in distributed systems between partitioned networks, nodes always being up to date with the latest values, and the system always being available. 8. It’s common to try to detail every part of the system’s design like you see people do on YouTube. Realistically, these videos are scripted, and the drawings are fast-forwarded. In a real interview, you won’t have time to actually detail every part of the system, and that’s OK! It’s expected that you’ll abstract away pieces that aren’t particularly relevant. It’s a good practice to call out what you’re abstracting, but just focus on the general data flow of the system. 9. As usual, we begin from requirements. In fact, it’s best to postulate the problem right in the form of requirements! This way we also develop a habit of thinking of system design problems from the grounds of how to solve them, as asking the right questions is at least half of solving them. Don’t worry if you don’t know in detail what the Ticketmaster problem is about. In fact, for any problem, if you don’t fully understand its statement, jump straight to functional requirements, and clarify them—with your interviewer or with your peers—until they are crystal clear! 10. The best way to reason about the value of consistency is to think of what could possibly go wrong.

SD Flashcards

(73 cards)