System Design & Optimization Flashcards
What is a cookie?
Imagine Bob goes to a coffee shop for the first time, orders a medium-sized espresso with two sugars. The cashier records Bob’s identity and preferences on a card and hands it over to Bob with a cup of coffee.
The next time Bob goes to the cafe, he shows the cashier the preference card. The cashier
immediately knows who the customer is and what kind of coffee he likes.
A cookie acts as the preference card. When we log in to a website, the server issues a cookie to us with a small amount of data. The cookie is stored on the client side, so the next time we send a request to the server with the cookie, the server knows our identity and preferences immediately without looking into the database.
API Vs SDK
API (Application Programming Interface) and SDK (Software Development Kit) are essential tools in the software development world, but they serve distinct purposes:
API: An API is a set of rules and protocols that allows different software applications and services to communicate with each other.
- It defines how software components should interact.
- Facilitates data exchange and functionality access between software components.
- Typically consists of endpoints, requests, and responses.
SDK: An SDK is a comprehensive package of tools, libraries, sample code, and documentation that assists developers in building applications for a particular platform, framework, or hardware.
- Offers higher-level abstractions, simplifying development for a specific platform.
- Tailored to specific platforms or frameworks, ensuring compatibility and optimal performance on that platform.
- Offer access to advanced features and capabilities specific to the platform, which might be otherwise challenging to implement from scratch.
The choice between APIs and SDKs depends on the development goals and requirements of the project.
What is GraphQL? Is it a replacement for the REST API?
GraphQL is a query language for APIs and a runtime for executing those queries by using a type system you define for your data. It was developed internally by Meta in 2012 before being publicly released in 2015.
Unlike the more traditional REST API, GraphQL allows clients to request exactly the data they need, making it possible to fetch data from multiple sources with a single query. This efficiency in data retrieval can lead to improved performance for web and mobile applications. GraphQL servers sit in between the client and the backend services. It can aggregate multiple REST requests into one query. GraphQL server organizes the resources in a graph.
GraphQL supports queries, mutations (applying data modifications to resources), and subscriptions (receiving notifications on schema modifications).
Benefits of GraphQL:
1. GraphQL is more efficient in data fetching.
2. GraphQL returns more accurate results.
3. GraphQL has a strong type system to manage the structure of entities, reducing errors.
4. GraphQL is suitable for managing complex microservices.
Disadvantages of GraphQL
- Increased complexity.
- Over fetching by design
- Caching complexity
Different monitoring infrastructure in cloud services
Let’s delve into the essential monitoring aspects covered:
- Data Collection: Gather information from diverse sources to enhance decision-making.
- Data Storage: Safely store and manage data for future analysis and reference.
- Data Analysis: Extract valuable insights from data to drive informed actions.
- Alerting: Receive real-time notifications about critical events or anomalies.
- Visualization: Present data in a visually comprehensible format for better understanding.
- Reporting and Compliance: Generate reports and ensure adherence to regulatory standards.
- Automation: Streamline processes and tasks through automated workflows.
- Integration: Seamlessly connect and exchange data between different systems or tools.
- Feedback Loops: Continuously refine strategies based on feedback and performance analysis.
System Design Blueprint: The Ultimate Guide
We’ve created a template to tackle various system design problems in interviews.
Hope this checklist is useful to guide your discussions during the interview process.
This briefly touches on the following discussion points:
- Load Balancing
- API Gateway
- Communication Protocols
- Content Delivery Network (CDN)
- Database
- Cache
- Message Queue
- Unique ID Generation
- Scalability
- Availability
- Performance
- Security
- Fault Tolerance and Resilience
- And more
REST API Vs. GraphQL
When it comes to API design, REST and GraphQL each have their own strengths and weaknesses.
REST
- Uses standard HTTP methods like GET, POST, PUT, DELETE for CRUD operations.
- Works well when you need simple, uniform interfaces between separate services/applications.
- Caching strategies are straightforward to implement.
- The downside is it may require multiple roundtrips to assemble related data from separate endpoints.
GraphQL
- Provides a single endpoint for clients to query for precisely the data they need.
- Clients specify the exact fields required in nested queries, and the server returns optimized payloads containing just those fields.
- Supports Mutations for modifying data and Subscriptions for real-time notifications.
- Great for aggregating data from multiple sources and works well with rapidly evolving
frontend requirements.
- However, it shifts complexity to the client side and can allow abusive queries if not properly safeguarded
- Caching strategies can be more complicated than REST.
The best choice between REST and GraphQL depends on the specific requirements of the
application and development team. GraphQL is a good fit for complex or frequently changing frontend needs, while REST suits applications where simple and consistent contracts are preferred.
6 Key Use Cases for Load Balancers
● Traffic Distribution - Load balancers evenly distribute incoming traffic among multiple servers, preventing any single server from becoming overwhelmed. This helps maintain optimal performance, scalability, and reliability of applications or websites.
● High Availability - Load balancers enhance system availability by rerouting traffic away from failed or unhealthy servers to healthy ones. This ensures uninterrupted service even if certain servers experience issues.
● SSL Termination - Load balancers can offload SSL/TLS encryption and decryption tasks from backend servers, reducing their workload and improving overall performance.
● Session Persistence - for applications that require maintaining a user’s session on a specific server, load balancers can ensure that subsequent requests from a user are sent to the same server.
● Scalability - Load balancers facilitate horizontal scaling by effectively managing increased traffic. Additional servers can be easily added to the pool, and the load balancer will distribute traffic across all servers.
● Health Monitoring - Load balancers continuously monitor the health and performance of servers, removing failed or unhealthy servers from the pool to maintain optimal performance.
Top 6 Firewall Use Cases
● Port-Based Rules - Firewall rules can be set to allow or block traffic based on specific ports. For example, allowing only traffic on ports 80 (HTTP) and 443 (HTTPS) for web browsing.
● IP Address Filtering - Rules can be configured to allow or deny traffic based on source or destination IP addresses. This can include whitelisting trusted IP addresses or blacklisting known malicious ones.
● Protocol-Based Rules - Firewalls can be configured to allow or block traffic based on specific network protocols such as TCP, UDP, ICMP, etc. For instance, allowing only TCP traffic on port 22 (SSH).
● Time-Based Rules - Firewalls can be configured to enforce rules based on specific times or schedules. This can be useful for setting different access rules during business hours versus after-hours.
● Stateful Inspection - Stateful Inspection: Stateful firewalls monitor the state of active connections and allow traffic only if it matches an established connection, preventing unauthorized access from the outside.
● Application-Based Rules - Some firewalls offer application-level control by allowing or blocking traffic based on specific applications or services. For instance, allowing or restricting access to certain applications
like Skype, BitTorrent, etc.
Types of memory. Which ones do you know?
Memory types vary by speed, size, and function, creating a multi-layered architecture that balances cost with the need for rapid data access.
By grasping the roles and capabilities of each memory type, developers and system architects can design systems that effectively leverage the strengths of each storage layer, leading to improved overall system performance and user experience.
Some of the common Memory types are:
- Registers: Tiny, ultra-fast storage within the CPU for immediate data access.
- Caches: Small, quick memory located close to the CPU to speed up data retrieval.
- Main Memory (RAM): Larger, primary storage for currently executing programs and data.
- Solid-State Drives (SSDs): Fast, reliable storage with no moving parts, used for persistent data.
- Hard Disk Drives (HDDs): Mechanical drives with large capacities for long-term storage.
- Remote Secondary Storage: Offsite storage for data backup and archiving, accessible over a network.
Top 6 Load Balancing Algorithms
● Static Algorithms
- Round robin - The client requests are sent to different service instances in sequential order. The services are usually required to be stateless.
- Sticky round-robin - This is an improvement of the round-robin algorithm. If Alice’s first request goes to service A, the following requests go to service A as well.
- Weighted round-robin - The admin can specify the weight for each service. The ones with a higher weight handle more requests than others.
- Hash - This algorithm applies a hash function on the incoming requests’ IP or URL. The requests are routed to relevant instances based on the hash function result.
● Dynamic Algorithms
- Least connections - A new request is sent to the service instance with the least concurrent connections.
- Least response time - A new request is sent to the service instance with the fastest response time.
How does Git work?
To begin with, it’s essential to identify where our code is stored. The common assumption is that there are only two locations - one on a remote server like Github and the other on our local machine.
However, this isn’t entirely accurate. Git maintains three local storages on our machine, which means that our code can be found in four places:
- Working directory: where we edit files
- Staging area: a temporary location where files are kept for the next commit
- Local repository: contains the code that has been committed
- Remote repository: the remote server that stores the code
Most Git commands primarily move files between these four locations.
HTTP Cookies Explained
HTTP, the language of the web, is naturally “stateless.” But hey, we all want that seamless,
continuous browsing experience, right? Enter the unsung heroes - Cookies!
So, here’s the scoop in this cookie flyer:
- HTTP is like a goldfish with no memory - it forgets you instantly! But cookies swoop in to the rescue, adding that “session secret sauce” to your web interactions.
- Cookies? Think of them as little notes you pass to the web server, saying, “Remember me, please!” And yes, they’re stored there, like cherished mementos.
- Browsers are like cookie bouncers, making sure your cookies don’t party crash at the wrong website.
- Finally, meet the cookie celebrities - SameSite, Name, Value, Secure, Domain, and HttpOnly. They’re the cool kids setting the rules in the cookie jar!
A cheat sheet for system designs - 15 core concepts when we design systems.
● Requirement gathering
● System architecture
● Data design
● Domain design
● Scalability
● Reliability
● Availability
● Performance
● Security
● Maintainability
● Testing
● User experience design
● Cost estimation
● Documentation
● Migration plan
Cloud Disaster Recovery Strategies
An effective Disaster Recovery (DR) plan is not just a precaution; it’s a necessity. The key to any robust DR strategy lies in understanding and setting two pivotal benchmarks:
Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
- Recovery Time Objective (RTO) refers to the maximum acceptable length of time that your
application or network can be offline after a disaster. - Recovery Point Objective (RPO), on the other hand, indicates the maximum acceptable
amount of data loss measured in time.
Let’s explore four widely adopted DR strategies:
1. Backup and Restore Strategy:
This method involves regular backups of data and systems to facilitate post-disaster
recovery. - Typical RTO: From several hours to a few days.
- Typical RPO: From a few hours up to the time of the last successful backup.
2. Pilot Light Approach:
Maintains crucial components in a ready-to-activate mode, enabling rapid scaling in response to a disaster. - Typical RTO: From a few minutes to several hours.
- Typical RPO: Depends on how often data is synchronized.
3. Warm Standby Solution:
Establishes a semi-active environment with current data to reduce recovery time. - Typical RTO: Generally within a few minutes to hours.
- Typical RPO: Up to the last few minutes or hours.
4. Hot Site / Multi-Site Configuration:
Ensures a fully operational, duplicate environment that runs parallel to the primary system. - Typical RTO: Almost immediate, often just a few minutes.
- Typical RPO: Extremely minimal, usually only a few seconds old.
Polling Vs Webhooks
Polling
Polling involves repeatedly checking the external service or endpoint at fixed intervals to retrieve updated information.
It’s like constantly asking, “Do you have something new for me?” even where there might not be any update. This approach is resource-intensive and inefficient.
Also, you get updates only when you ask for it, thereby missing any real-time information.
However, developers have more control over when and how the data is fetched.
- Webhooks
Webhooks are like having a built-in notification system.
You don’t continuously ask for information.
Instead you create an endpoint in your application server and provide it as a callback to the external service (such as a payment processor or a shipping vendor)
Every time something interesting happens, the external service calls the endpoint and provides the information.
This makes webhooks ideal for dealing with real-time updates because data is pushed to your application as soon as it’s available.
So, when to use Polling or Webhook?
Polling is a solid option when there is some infrastructural limitation that prevents the use of webhooks. Also, with webhooks there is a risk of missed notifications due to network issues, hence proper retry mechanisms are needed.
Webhooks are recommended for applications that need instant data delivery. Also, webhooks are efficient in terms of resource utilization especially in high throughput environments.
Explaining 9 types of API testing
● Smoke Testing - This is done after API development is complete. Simply validate if the APIs are working and nothing breaks.
● Functional Testing - This creates a test plan based on the functional requirements and compares the results with the expected results.
● Integration Testing - This test combines several API calls to perform end-to-end tests. The intra-service communications and data transmissions are tested.
● Regression Testing - This test ensures that bug fixes or new features shouldn’t break the existing behaviors of APIs.
● Load Testing - This tests applications’ performance by simulating different loads. Then we can calculate the capacity of the application.
● Stress Testing - We deliberately create high loads to the APIs and test if the APIs are able to function normally.
● Security Testing - This tests the APIs against all possible external threats.
● UI Testing - This tests the UI interactions with the APIs to make sure the data can be displayed properly.
● Fuzz Testing - This injects invalid or unexpected input data into the API and tries to crash the API. In this way, it identifies the API vulnerabilities.
Git Merge vs. Rebase vs.Squash Commit!
What are the differences?
When we 𝐦𝐞𝐫𝐠𝐞 𝐜𝐡𝐚𝐧𝐠𝐞𝐬 from one Git branch to another, we can use ‘git merge’ or ‘git rebase’. The diagram below shows how the two commands work.
𝐆𝐢𝐭 𝐌𝐞𝐫𝐠𝐞
This creates a new commit G’ in the main branch. G’ ties the histories of both main and feature branches.
Git merge is 𝐧𝐨𝐧-𝐝𝐞𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐯𝐞. Neither the main nor the feature branch is changed.
𝐆𝐢𝐭 𝐑𝐞𝐛𝐚𝐬𝐞
Git rebase moves the feature branch histories to the head of the main branch. It creates new
commits E’, F’, and G’ for each commit in the feature branch.
The benefit of rebase is that it has 𝐥𝐢𝐧𝐞𝐚𝐫 𝐜𝐨𝐦𝐦𝐢𝐭 𝐡𝐢𝐬𝐭𝐨𝐫𝐲.
Rebase can be dangerous if “the golden rule of git rebase” is not followed.
𝐓𝐡𝐞 𝐆𝐨𝐥𝐝𝐞𝐧 𝐑𝐮𝐥𝐞 𝐨𝐟 𝐆𝐢𝐭 𝐑𝐞𝐛𝐚𝐬𝐞
Never use it on public branches!
How are notifications pushed to our phones or PCs?
A messaging solution (Firebase) can be used to support the notification push.
The diagram below shows how Firebase Cloud Messaging (FCM) works.
FCM is a cross-platform messaging solution that can compose, send, queue, and route notifications reliably. It provides a unified API between message senders (app servers) and receivers (client apps). The app developer can use this solution to drive user retention.
Steps 1 - 2: When the client app starts for the first time, the client app sends credentials to FCM, including Sender ID, API Key, and App ID. FCM generates Registration Token for the client app instance (so the Registration Token is also called Instance ID). This token must be included in the notifications.
Step 3: The client app sends the Registration Token to the app server. The app server caches the token for subsequent communications. Over time, the app server has too many tokens to maintain, so the recommended practice is to store the token with timestamps and to remove stale tokens from time to time.
Step 4: There are two ways to send messages. One is to compose messages directly in the console GUI (Step 4.1,) and the other is to send the messages from the app server (Step 4.2.) We can use the Firebase Admin SDK or HTTP for the latter.
Step 5: FCM receives the messages, and queues the messages in the storage if the devices are not online.
Step 6: FCM forwards the messages to platform-level transport. This transport layer handles platform-specific configurations.
Step 7: The messages are routed to the targeted devices. The notifications can be displayed according to the configurations sent from the app server [1].
Over to you: We can also send messages to a “topic” (just like Kafka) in Step 4. When should the client app subscribe to the topic?
How do companies ship code to production?
Step 1: The process starts with a product owner creating user stories based on requirements.
Step 2: The dev team picks up the user stories from the backlog and puts them into a sprint for a two-week dev cycle.
Step 3: The developers commit source code into the code repository Git.
Step 4: A build is triggered in Jenkins. The source code must pass unit tests, code coverage threshold, and gates in SonarQube.
Step 5: Once the build is successful, the build is stored in artifactory. Then the build is deployed into the dev environment.
Step 6: There might be multiple dev teams working on different features. The features need to be tested independently, so they are deployed to QA1 and QA2.
Step 7: The QA team picks up the new QA environments and performs QA testing, regression testing, and performance testing.
Steps 8: Once the QA builds pass the QA team’s verification, they are deployed to the UAT environment.
Step 9: If the UAT testing is successful, the builds become release candidates and will be deployed to the production environment on schedule.
Step 10: SRE (Site Reliability Engineering) team is responsible for prod monitoring.
How does a VPN work?
A VPN, or Virtual Private Network, is a technology that creates a secure, encrypted connection over a less secure network, such as the public internet. The primary purpose of a VPN is to provide privacy and security to data and communications.
A VPN acts as a tunnel through which the encrypted data goes from one location to another. Any external party cannot see the data transferring.
A VPN works in 4 steps:
● Step 1 - Establish a secure tunnel between our device and the VPN server.
● Step 2 - Encrypt the data transmitted.
● Step 3 - Mask our IP address, so it appears as if our internet activity is coming from the VPN server.
● Step 4 - Our internet traffic is routed through the VPN server.
Advantages of a VPN:
- Privacy
- Anonymity
- Security
- Encryption
- Masking the original IP address
Disadvantages of a VPN:
- VPN blocking
- Slow down connections
- Trust in VPN provider
Encoding vs Encryption vs Tokenization
Encoding, encryption, and tokenization are three distinct processes that handle data in different ways for various purposes, including data transmission, security, and compliance.
In system designs, we need to select the right approach for handling sensitive information.
🔹 Encoding
Encoding converts data into a different format using a scheme that can be easily reversed.
Examples include Base64 encoding, which encodes binary data into ASCII characters, making it easier to transmit data over media that are designed to deal with textual data.
Encoding is not meant for securing data. The encoded data can be easily decoded using the same scheme without the need for a key.
🔹 Encryption
Encryption involves complex algorithms that use keys for transforming data. Encryption can be symmetric (using the same key for encryption and decryption) or asymmetric (using a public key for encryption and a private key for decryption).
Encryption is designed to protect data confidentiality by transforming readable data (plaintext) into an unreadable format (ciphertext) using an algorithm and a secret key. Only those with the correct key can decrypt and access the original data.
🔹 Tokenization
Tokenization is the process of substituting sensitive data with non-sensitive placeholders called tokens. The mapping between the original data and the token is stored securely in a token vault. These tokens can be used in various systems and processes without exposing the original data, reducing the risk of data breaches. Tokenization is often used for protecting credit card information, personal identification numbers, and other sensitive data. Tokenization is highly secure, as the tokens do not contain any part of the original data and thus cannot be reverse-engineered to reveal the original data. It is particularly
useful for compliance with regulations like PCI DSS.
Where do we cache data?
There are 𝐦𝐮𝐥𝐭𝐢𝐩𝐥𝐞 𝐥𝐚𝐲𝐞𝐫𝐬 along the flow.
1. Client apps: HTTP responses can be cached by the browser. We request data over HTTP for the first time, and it is returned with an expiry policy in the HTTP header; we request data again, and the client app tries to retrieve the data from the browser cache first.
- CDN: CDN caches static web resources. The clients can retrieve data from a CDN node
nearby. - Load Balancer: The load Balancer can cache resources as well.
- Messaging infra: Message brokers store messages on disk first, and then consumers
retrieve them at their own pace. Depending on the retention policy, the data is cached in
Kafka clusters for a period of time. - Services: There are multiple layers of cache in a service. If the data is not cached in the CPU cache, the service will try to retrieve the data from memory. Sometimes the service has a second-level cache to store data on disk.
- Distributed Cache: Distributed cache like Redis hold key-value pairs for multiple services in memory. It provides much better read/write performance than the database.
- Full-text Search: we sometimes need to use full-text searches like Elastic Search for
document search or log search. A copy of data is indexed in the search engine as well. - Database: Even in the database, we have different levels of caches:
- WAL(Write-ahead Log): data is written to WAL first before building the B tree index
- Bufferpool: A memory area allocated to cache query results
- Materialized View: Pre-compute query results and store them in the database tables
for better query performance
- Transaction log: record all the transactions and database updates
- Replication Log: used to record the replication state in a database cluster
Over to you: With the data cached at so many levels, how can we guarantee the 𝐬𝐞𝐧𝐬𝐢𝐭𝐢𝐯𝐞 𝐮𝐬𝐞𝐫 𝐝𝐚𝐭𝐚 is completely erased from the systems?
How does Docker work?
The diagram below shows the architecture of Docker and how it works when we run “docker build”, “docker pull” and “docker run”.
There are 3 components in Docker architecture:
🔹 Docker client
The docker client talks to the Docker daemon.
🔹 Docker host
The Docker daemon listens for Docker API requests and manages Docker objects such as images, containers, networks, and volumes.
🔹 Docker registry
A Docker registry stores Docker images. Docker Hub is a public registry that anyone can use.
Let’s take the “docker run” command as an example.
1. Docker pulls the image from the registry.
2. Docker creates a new container.
3. Docker allocates a read-write filesystem to the container.
4. Docker creates a network interface to connect the container to the default network.
5. Docker starts the container.
What are the 5 components of SQL
There are 5 components of the SQL language:
- DDL: data definition language, such as CREATE, ALTER, DROP
- DQL: data query language, such as SELECT
- DML: data manipulation language, such as INSERT, UPDATE, DELETE