Additional System Design Flashcards
How can caches go wrong?
- Thunder herd problem - when large number of keys in the cache expire at the same time. Then the query requests directly hit the database which overloads the database.
Mitigate this issue - 1. Avoid setting the same expiry time for the keys, adding a random number in the config. 2- allow only the core business data to hit the database & prevent non core data to access the dB until the cache is back up. - Cache penetration - when the key doesn’t exist in cache or the database. The app can’t retrieve relevant data from the dB to update the cache. This problem creates alot of pressure for cache and database.
Solution - 1. Create null value for non-exisitng keys, avoid hitting the dB. 2. Use a bloom filter to check key existence first and if the key doesn’t exist, avoid hitting the dB. - Cache breakdown - similar to the thunder herd problem. It happens when a hot key expires. A large number of requests hits the dB. Since the hot keys take up 80% of the queries, we don’t set an expiry time for them.
- Cache crash - this happens when the cache is down and all the requests go to the database.
Solution - 1. Set up a circuit breaker, and when the cache is down, the app services can’t visit the cache or the dB. 2. Set up a cluster for the cache to improve cache availability.
4 most popular cases for UDP (user datagram protocol)
UDP is used in various software architectures for its simplicity, speed and low overhead compared to other protocols like TCP.
- Live video streaming - many VoIP & video conferencing apps leverage UDP due to its lower overhead & ability to tolerate packet loss. Real-time communication benefits from UDP’s reduced latency compared to TCP.
- DNS (domain name server) - DNS queries typically use UDP for their fast and lightweight nature. Although DNS can also use TCP for large responses or zone transfers, most queries are handled via UDP.
- Market data multicast - in low latency trading, UDP is utilised for efficient market data delivery to multiple recipients simultaneously.
- IoT - UDP is often used in IoT devices for communications, sending small packets of data between devices.
How does a typical push notification system work?
The architecture of a notification system that covers major notification channels:
- In app notifications
- Email notifications
- SMS & OTP notifications
- Social media pushes
Steps
1. The business services sends notifications to the notification gateway. The gateway can handle 2 modes. One mode receives one notification each time, and the other mode receives notifications in batches.
- The notification gateway forwards the notifications to the distribution service, where the messages are validated, formatted and scheduled based on settings. The notification template repository allows users to pre-define the message format. The channel preference repository allows users to pre-define the preferred delivery channels.
- The notifications are then sent to the routers, normally message queues.
- The channel services communicate with various internal and external delivery channels, including in-app notifications, email delivery, SMS delivery, and social media apps.
- The delivery metrics are captured by the notification tracking and analytics service, where the operations team can view the analytical reports & improve user experiences.
Have you heard of the 12-Factor App?
The “12 Factor App” offers a set of best practices for building modern software applications.
Following these 12 principles can help developers and teams in building reliable, scalable, and manageable applications.
Here’s a brief overview of each principle:
1. Codebase:
Have one place to keep all your code, and manage it using version control like Git.
- Dependencies:
List all the things your app needs to work properly, and make sure they’re easy to install - Config:
Keep important settings like database credentials separate from your code, so you can change them without rewriting code. - Backing Services:
Use other services (like databases or payment processors) as separate components that
your app connects to. - Build, Release, Run:
Make a clear distinction between preparing your app, releasing it, and running it in
production. - Processes:
Design your app so that each part doesn’t rely on a specific computer or memory. It’s like
making LEGO blocks that fit together. - Port Binding:
Let your app be accessible through a network port, and make sure it doesn’t store critical
information on a single computer. - Concurrency:
Make your app able to handle more work by adding more copies of the same thing, like
hiring more workers for a busy restaurant. - Disposability:
Your app should start quickly and shut down gracefully, like turning off a light switch instead of yanking out the power cord. - Dev/Prod Parity:
Ensure that what you use for developing your app is very similar to what you use in
production, to avoid surprises. - Logs:
Keep a record of what happens in your app so you can understand and fix issues, like a
diary for your software - Admin Processes:
Run special tasks separately from your app, like doing maintenance work in a workshop
instead of on the factory floor.
Visualizing a SQL query
SQL statements are executed by the database system in several steps, including:
- Parsing the SQL statement and checking its validity
- Transforming the SQL into an internal representation, such as relational algebra
- Optimizing the internal representation and creating an execution plan that utilizes index
information
- Executing the plan and returning the results
SELECT
FROM
JOIN
ON
WHERE
GROUP BY
HAVING
ORDER BY
LIMIT
How does Redis architecture evolve?
Redis is a popular in-memory cache. How did it evolve to the architecture it is today?
🔹 2010 - Standalone Redis
When Redis 1.0 was released in 2010, the architecture was quite simple. It is usually used as a cache to the business application.
However, Redis stores data in memory. When we restart Redis, we will lose all the data and the traffic directly hits the database.
🔹 2013 - Persistence
When Redis 2.8 was released in 2013, it addressed the previous restrictions. Redis introduced RDB in-memory snapshots to persist data. It also supports AOF (Append-Only-File), where each write command is written to an AOF file.
🔹 2013 - Replication
Redis 2.8 also added replication to increase availability. The primary instance handles real-time read and write requests, while replica synchronizes the primary’s data.
🔹 2013 - Sentinel
Redis 2.8 introduced Sentinel to monitor the Redis instances in real time. is a system designed to help managing Redis instances. It performs the following four tasks: monitoring, notification, automatic failover and configuration provider.
🔹 2015 - Cluster
In 2015, Redis 3.0 was released. It added Redis clusters. A Redis cluster is a distributed database solution that manages data through sharding. The data is divided into 16384 slots, and each node is responsible for a portion of the slot.
🔹 Looking Ahead
Redis is popular because of its high performance and rich data structures that dramatically reduce
the complexity of developing a business application.
In 2017, Redis 5.0 was released, adding the stream data type.
In 2020, Redis 6.0 was released, introducing the multi-threaded I/O in the network module. Redis model is divided into the network module and the main processing module. The Redis developers the network module tends to become a bottleneck in the system.
Over to you - have you used Redis before? If so, for what use case?
How does “scan to pay” work?
How do you pay from your digital wallet, such as Paypal, Venmo, Paytm, by scanning the QR code?
To understand the process involved, we need to divide the “scan to pay” process into two
sub-processes:
- Merchant generates a QR code and displays it on the screen
- Consumer scans the QR code and pays
Here are the steps for generating the QR code:
1. When you want to pay for your shopping, the cashier tallies up all the goods and calculates the total amount due, for example, $123.45. The checkout has an order ID of SN129803. The cashier clicks the “checkout” button.
2. The cashier’s computer sends the order ID and the amount to PSP.
3. The PSP saves this information to the database and generates a QR code URL.
4. PSP’s Payment Gateway service reads the QR code URL.
5. The payment gateway returns the QR code URL to the merchant’s computer.
6. The merchant’s computer sends the QR code URL (or image) to the checkout counter.
7. The checkout counter displays the QR code.
These 7 steps complete in less than a second.
Now it’s the consumer’s turn to pay from their digital wallet by scanning the QR code:
1. The consumer opens their digital wallet app to scan the QR code.
2. After confirming the amount is correct, the client clicks the “pay” button.
3. The digital wallet App notifies the PSP that the consumer has paid the given QR code.
4. The PSP payment gateway marks this QR code as paid and returns a success message to the consumer’s digital wallet App.
5. The PSP payment gateway notifies the merchant that the consumer has paid the given QR code.
How do Search Engines Work?
● Step 1 - Crawling
Web Crawlers scan the internet for web pages. They follow the URL links from one page to
another and store URLs in the URL store. The crawlers discover new content, including web
pages, images, videos, and files.
● Step 2 - Indexing
Once a web page is crawled, the search engine parses the page and indexes the content
found on the page in a database. The content is analyzed and categorized. For example,
keywords, site quality, content freshness, and many other factors are assessed to
understand what the page is about.
● Step 3 - Ranking
Search engines use complex algorithms to determine the order of search results. These
algorithms consider various factors, including keywords, pages’ relevance, content quality,
user engagement, page load speed, and many others. Some search engines also personalize results based on the user’s past search history, location, device, and other personal factors.
● Step 4 - Querying
When a user performs a search, the search engine sifts through its index to provide the most relevant results.
The Payments Ecosystem
How do fintech startups find new
opportunities among so many payment companies? What do PayPal, Stripe, and Square do exactly?
Steps 0-1: The cardholder opens an account in the issuing bank and gets the debit/credit card. The merchant registers with ISO (Independent Sales Organization) or MSP (Member Service Provider) for in-store sales. ISO/MSP partners with payment processors to open merchant accounts.
Steps 2-5: The acquiring process.
The payment gateway accepts the purchase transaction and collects payment information. It is then sent to a payment processor, which uses customer information to collect payments. The acquiring processor sends the transaction to the card network. It also owns and operates the merchant’s account during settlement, which doesn’t happen in real-time.
Steps 6-8: The issuing process. The issuing processor talks to the card network on the issuing bank’s behalf. It validates and operates the customer’s account.
I’ve listed some companies in different verticals in the diagram. Notice payment companies usually start from one vertical, but later expand to multiple verticals.
Cloud Cost Reduction Techniques
Irrational Cloud Cost is the biggest challenge many organizations are battling as they navigate the complexities of cloud computing.
Efficiently managing these costs is crucial for optimizing cloud usage and maintaining financial health. The following techniques can help businesses effectively control and minimize their cloud expenses.
- Reduce Usage:
Fine-tune the volume and scale of resources to ensure efficiency without compromising on the performance of applications (e.g., downsizing instances, minimizing storage space, consolidating services). - Terminate Idle Resources:
Locate and eliminate resources that are not in active use, such as dormant instances, databases, or storage units. - Right Sizing:
Adjust instance sizes to adequately meet the demands of your applications, ensuring neither underuse nor overuse. - Shutdown Resources During Off-Peak Times: Set up automatic mechanisms or schedules for turning off non-essential resources when they are not in use, especially during low-activity periods.
- Reserve to Reduce Rate:
Adopt cost-effective pricing models like Reserved Instances or Savings Plans that align with your specific workload needs.
Bonus Tip: Consider using Spot Instances and lower-tier storage options for additional cost savings.
- Optimize Data Transfers:
Utilize methods such as data compression and Content Delivery Networks (CDNs) to cut down on bandwidth expenses, and strategically position resources to reduce data transfer costs, focusing on intra-region transfers.
Over to you: Which technique fits in well with your current cloud infra setup?
How do live streaming platforms like YouTube Live, TikTok
Live, or Twitch work?
Live streaming is challenging because the video content is sent over the internet in near real-time. Video processing is compute-intensive. Sending a large volume of video content over the internet
takes time. These factors make live streaming challenging.
The diagram below explains what happens behind the scenes to make this possible.
Step 1: The streamer starts their stream. The source could be any video and audio source wired up to an encode
Step 2: To provide the best upload condition for the streamer, most live streaming platforms provide point-of-presence servers worldwide. The streamer connects to a point-of-presence server closest to them.
Step 3: The incoming video stream is transcoded to different resolutions, and divided into smaller video segments a few seconds in length.
Step 4: The video segments are packaged into different live streaming formats that video players can understand. The most common live-streaming format is HLS, or HTTP Live Streaming.
Step 5: The resulting HLS manifest and video chunks from the packaging step are cached by the CDN.
Step 6: Finally, the video starts to arrive at the viewer’s video player.
Step 7-8: To support replay, videos can be optionally stored in storage such as Amazon S3.
9 Best Practices for Building Microservices
Creating a system using microservices is extremely difficult unless you follow some strong principles.
1 - Design For Failure
A distributed system with microservices is going to fail. You must design the system to tolerate failure at multiple levels such as infrastructure, database, and individual services. Use circuit breakers, bulkheads, or graceful degradation methods to deal
with failures.
2 - Build Small Services
A microservice should not do multiple things at once. A good microservice is designed to do one thing well.
3 - Use lightweight protocols for communication. Communication is the core of a distributed system. Microservices must talk to each other using lightweight protocols. Options include REST, gRPC, or
message brokers.
4 - Implement service discovery
To communicate with each other, microservices need to discover each other over the network. Implement service discovery using tools such as Consul, Eureka, or Kubernetes Services
5 - Data Ownership
In microservices, data should be owned and managed by the individual services. The goal should be to reduce coupling between services so that they can evolve independently.
6 - Use resiliency patterns
Implement specific resiliency patterns to improve the availability of the services.
Examples: retry policies, caching, and rate limiting
7 - Security at all levels
In a microservices-based system, the attack surface is quite large. You must implement security at every level of the service communication path.
8 - Centralized logging
Logs are important to finding issues in a system. With multiple services, they become critical.
9 - Use containerization techniques
To deploy microservices in an isolated manner, use containerization techniques.
Tools like Docker and Kubernetes can help with this as they are meant to simplify the scaling and deployment of a microservice.
Over to you: what other best practice would you recommend?
Linux Boot Process Illustrated
Step 1 - When we turn on the power, BIOS (Basic Input/Output System) or UEFI (Unified Extensible Firmware Interface) firmware is loaded from non-volatile memory, and executes POST (Power On Self Test).
Step 2 - BIOS/UEFI detects the devices connected to the system, including CPU, RAM, and storage.
Step 3 - Choose a booting device to boot the OS from. This can be the hard drive, the network server, or CD ROM.
Step 4 - BIOS/UEFI runs the boot loader (GRUB), which provides a menu to choose the OS or the kernel functions.
Step 5 - After the kernel is ready, we now switch to the user space. The kernel starts up systemd as the first user-space process, which manages the processes and services, probes all remaining hardware, mounts filesystems, and runs a desktop environment.
Step 6 - systemd activates the default. target unit by default when the system boots. Other analysis units are executed as well.
Step 7 - The system runs a set of startup scripts and configures the environment.
Step 8 - The users are presented with a login window. The system is now ready.
How does Visa make money?
Why is the credit card called “𝐭𝐡𝐞 𝐦𝐨𝐬𝐭 𝐩𝐫𝐨𝐟𝐢𝐭𝐚𝐛𝐥𝐞 product in banks”? How does VISA/Mastercard make money?
- The cardholder pays a merchant $100 to buy a product.
- The merchant benefits from the use of the credit card with higher sales volume, and needs to compensate the issuer and the card network for providing the payment service. The acquiring bank sets a fee with the merchant, called the “𝐦𝐞𝐫𝐜𝐡𝐚𝐧𝐭 𝐝𝐢𝐬𝐜𝐨𝐮𝐧𝐭 𝐟𝐞𝐞.”
3 - 4. The acquiring bank keeps $0.25 as the 𝐚𝐜𝐪𝐮𝐢𝐫𝐢𝐧𝐠 𝐦𝐚𝐫𝐤𝐮𝐩, and $1.75 is paid to the issuing bank as the 𝐢𝐧𝐭𝐞𝐫𝐜𝐡𝐚𝐧𝐠𝐞 𝐟𝐞𝐞. The merchant discount fee should cover the interchange fee. The interchange fee is set by the card network because it is less efficient for each issuing bank to negotiate fees with each merchant.
- The card network sets up the 𝐧𝐞𝐭𝐰𝐨𝐫𝐤 𝐚𝐬𝐬𝐞𝐬𝐬𝐦𝐞𝐧𝐭𝐬 𝐚𝐧𝐝 𝐟𝐞𝐞𝐬 with each bank, which pays the card network for its services every month. For example, VISA charges a 0.11% assessment, plus a $0.0195 usage fee, for every swipe.
- The cardholder pays the issuing bank for its services. Why should the issuing bank be compensated?
● The issuer pays the merchant even if the cardholder fails to pay the issuer.
● The issuer pays the merchant before the cardholder pays the issuer.
● The issuer has other operating costs, including managing customer accounts, providing statements, fraud detection, risk management, clearing & settlement, etc.
How do we manage configurations in a system?
A comparison between traditional configuration management and IaC
(Infrastructure as Code).
● Configuration Management
The practice is designed to manage and provision IT infrastructure through systematic and repeatable processes. This is critical for ensuring that the system performs as intended. Traditional configuration management focuses on maintaining the desired state of the system’s configuration items, such as servers, network devices, and applications, after they have been provisioned.
It usually involves initial manual setup by DevOps. Changes are managed by step-by-step commands.
● What is IaC?
IaC, on the hand, represents a shift in how infrastructure is provisioned and managed,
treating infrastructure setup and changes as software development practices.
IaC automates the provisioning of infrastructure, starting and managing the system through
code. It often uses a declarative approach, where the desired state of the infrastructure is described.
Tools like Terraform, AWS CloudFormation, Chef, and Puppet are used to define
infrastructure in code files that are source controlled.
IaC represents an evolution towards automation, repeatability, an
What is CSS (Cascading Style Sheets)?
Front-end development requires not only content presentation, but also good-looking. CSS is a markup language used to describe how elements on a web page should be rendered.
▶️ What CSS does?
CSS separates the content and presentation of a document. In the early days of web development, HTML acted as both content and style.
CSS divides structure (HTML) and style (CSS). This has many benefits, for example, when we
change the color scheme of a web page, all we need to do is to tweak the CSS file.
▶️How CSS works?
CSS consists of a selector and a set of properties, which can be thought of as individual rules.
Selectors are used to locate HTML elements that we want to change the style of, and properties are.the specific style descriptions for those elements, such as color, size, position, etc. For example, if we want to make all the text in a paragraph blue, we write CSS code like this: p { color: blue; }
Here “p” is the selector and “color: blue” is the attribute that declares the color of the paragraph text to be blue.
▶️ Cascading in CSS
The concept of cascading is crucial to understanding CSS. When multiple style rules conflict, the browser needs to decide which rule to use based on a specific prioritization rule. The one with the highest weight wins. The weight can be determined by a variety of factors, including selector type and the order of the source.
▶️ Powerful Layout Capabilities of CSS
In the past, CSS was only used for simple visual effects such as text colors, font styles, or backgrounds. Today, CSS has evolved into a powerful layout tool capable of handling complex design layouts. The “Flexbox” and “Grid” layout modules are two popular CSS layout modules that make it easy to create responsive designs and precise placement of web elements, so web developers no longer have to rely on complex tables or floating layouts.
▶️ CSS Animation
Animation and interactive elements can greatly enhance the user experience.
CSS3 introduces animation features that allow us to transform and animate elements without using JavaScript. For example, “@keyframes” rule defines animation sequences, and the transition
property can be used to set animated transitions from one state to another.
▶️ Responsive Design
CSS allows the layout and style of a website to be adapted to different screen sizes and resolutions, so that we can provide an optimized browsing experience for different devices such as cell phones, tablets and computers.
Roadmap for Learning Cyber Security
Cybersecurity is crucial for protecting information and systems from theft, damage, and unauthorized access. Whether you’re a beginner or looking to advance your technical skills, there are numerous resources and paths you can take to learn more about cybersecurity. Here are some structured
suggestions to help you get started or deepen your knowledge:
🔹 Security Architecture
🔹 Frameworks & Standards
🔹 Application Security
🔹 Risk Assessment
🔹 Enterprise Risk Management
🔹 Threat Intelligence
🔹 Security Operation
How will you design the Stack Overflow website?
If your answer is on-premise servers and monolith (on the right), you would likely fail the interview, but that’s how it is built in reality!
𝐖𝐡𝐚𝐭 𝐩𝐞𝐨𝐩𝐥𝐞 𝐭𝐡𝐢𝐧𝐤 𝐢𝐭 𝐬𝐡𝐨𝐮𝐥𝐝 𝐥𝐨𝐨𝐤 𝐥𝐢𝐤𝐞
The interviewer is probably expecting something on the left side.
1. Microservice is used to decompose the system into small components.
2. Each service has its own database. Use cache heavily.
3. The service is sharded.
4. The services talk to each other asynchronously through message queues.
5. The service is implemented using Event Sourcing with CQRS.
6. Showing off knowledge in distributed systems such as eventual consistency, CAP theorem, etc.
𝐖𝐡𝐚𝐭 𝐢𝐭 𝐚𝐜𝐭𝐮𝐚𝐥𝐥𝐲 𝐢𝐬
Stack Overflow serves all the traffic with only 9 on-premise web servers, and it’s on monolith! It has its own servers and does not run on the cloud.
This is contrary to all our popular beliefs these days.
Over to you: what is good architecture, the one that looks fancy during the interview or the one that works in reality?
The one-line change that reduced clone times by a whopping 99%, says Pinterest
While it may sound cliché, small changes can definitely create a big impact.
The Engineering Productivity team at Pinterest witnessed this first-hand.
They made a small change in the Jenkins build pipeline of their monorepo codebase called
Pinboard.
And it brought down clone times from 40 minutes to a staggering 30 seconds.
For reference, Pinboard is the oldest and largest monorepo at Pinterest. Some facts about it:
- 350K commits
- 20 GB in size when cloned fully
- 60K git pulls on every business day
Cloning monorepos having a lot of code and history is time consuming. This was exactly what was happening with Pinboard.
The build pipeline (written in Groovy) started with a “Checkout” stage where the repository was cloned for the build and test steps.
The clone options were set to shallow clone, no fetching of tags and only fetching the last 50 commits.
But it missed a vital piece of optimization.
The Checkout step didn’t use the Git refspec option.
This meant that Git was effectively fetching all refspecs for every build. For the Pinboard monorepo, it meant fetching more than 2500 branches.
𝐒𝐨 - 𝐰𝐡𝐚𝐭 𝐰𝐚𝐬 𝐭𝐡𝐞 𝐟𝐢𝐱?
The team simply added the refspec option and specified which ref they cared about. It was the “master” branch in this case.
This single change allowed Git clone to deal with only one branch and significantly reduced the overall build time of the monorepo.
How does Javascript Work?
The cheat sheet below shows most important characteristics of Javascript.
🔹 Interpreted Language
JavaScript code is executed by the browser or JavaScript engine rather than being compiled into machine language beforehand. This makes it highly portable across different platforms. Modern engines such as V8 utilize Just-In-Time (JIT) technology to compile code into directly executable machine code.
🔹 Function is First-Class Citizen
In JavaScript, functions are treated as first-class citizens, meaning they can be stored in variables, passed as arguments to other functions, and returned from functions.
🔹 Dynamic Typing
JavaScript is a loosely typed or dynamic language, meaning we don’t have to declare a variable’s type ahead of time, and the type can change at runtime.
🔹 Client-Side Execution
JavaScript supports asynchronous programming, allowing operations like reading files, making HTTP requests, or querying databases to run in the background and trigger callbacks or promises when complete. This is particularly useful in web development for improving performance and user experience.
🔹 Prototype-Based OOP
Unlike class-based object-oriented languages, JavaScript uses prototypes for inheritance. This means that objects can inherit properties and methods from other objects.
🔹 Automatic Garbage Collection
Garbage collection in JavaScript is a form of automatic memory management. The primary goal of garbage collection is to reclaim memory occupied by objects that are no longer in use by the program, which helps prevent memory leaks and optimizes the performance of the application.
🔹 Compared with Other Languages
JavaScript is special compared to programming languages like Python or Java because of its position as a major language for web development. While Python is known to provide good code readability and versatility, and Java is known for its structure and robustness, JavaScript is an interpreted language that runs directly on the browser without compilation, emphasizing flexibility and dynamism.
🔹 Relationship with Typescript
TypeScript is a superset of JavaScript, which means that it extends JavaScript by adding features to the language, most notably type annotations. This relationship allows any valid JavaScript code to also be considered valid TypeScript code.
🔹 Popular Javascript Frameworks
React is known for its flexibility and large number of community-driven plugins, while Vue is clean and intuitive with highly integrated and responsive features. Angular, on the other hand, offers a strict set of development specifications for enterprise-level JS development.
How does gRPC work?
RPC (Remote Procedure Call) is called “𝐫𝐞𝐦𝐨𝐭𝐞” because it enables communications between
remote services when services are deployed to different servers under microservice architecture.
From the user’s point of view, it acts like a local function call.
The diagram below illustrates the overall data flow for 𝐠𝐑𝐏𝐂.
Step 1: A REST call is made from the client. The request body is usually in JSON format.
Steps 2 - 4: The order service (gRPC client) receives the REST call, transforms it, and makes an RPC call to the payment service. gPRC encodes the 𝐜𝐥𝐢𝐞𝐧𝐭 𝐬𝐭𝐮𝐛 into a binary format and sends it to the low-level transport layer.
Step 5: gRPC sends the packets over the network via HTTP2. Because of binary encoding and network optimizations, gRPC is said to be 5X faster than JSON.
Steps 6 - 8: The payment service (gRPC server) receives the packets from the network, decodes them, and invokes the server application.
Steps 9 - 11: The result is returned from the server application, and gets encoded and sent to the transport layer.
Steps 12 - 14: The order service receives the packets, decodes them, and sends the result to the client application.
How Netflix Really Uses Java?
Netflix is predominantly a Java shop.
Every backend application (including internal apps, streaming, and movie production apps) at Netflix is a Java application.
However, the Java stack is not static and has gone through multiple iterations over the years.
Here are the details of those iterations:
1 - API Gateway
Netflix follows a microservices architecture. Every piece of functionality and data is owned by a microservice built using Java (initially version 8)
This means that rendering one screen (such as the List of List of Movies or LOLOMO) involved
fetching data from 10s of microservices. But making all these calls from the client created a
performance problem.
Netflix initially used the API Gateway pattern using Zuul to handle the orchestration.
2 - BFFs with Groovy & RxJava
Using a single gateway for multiple clients was a problem for Netflix because each client (such as TV, mobile apps, or web browser) had subtle differences. To handle this, Netflix used the Backend-for-Frontend (BFF) pattern. Zuul was moved to the role of a proxy. In this pattern, every frontend or UI gets its own mini backend that performs the request fanout and orchestration for multiple services. The BFFs were built using Groovy scripts and the service fanout was done using RxJava for thread management.
3 - GraphQL Federation
The Groovy and RxJava approach required more work from the UI developers in creating the Groovy scripts. Also, reactive programming is generally hard. Recently, Netflix moved to GraphQL Federation. With GraphQL, a client can specify exactly what set of fields it needs, thereby solving the problem of overfetching and underfetching with REST APIs.
The GraphQL Federation takes care of calling the necessary microservices to fetch the data.
These microservices are called Domain Graph Service (DGS) and are built using Java 17, Spring Boot 3, and Spring Boot Netflix OSS packages. The move from Java 8 to Java 17 resulted in 20% CPU gains.
More recently, Netflix has started to migrate to Java 21 to take advantage of features like virtual threads.
OSI Model
How is data sent over the network? Why do we need so many layers in the OSI model?
The diagram below shows how data is encapsulated and de-encapsulated when transmitting over the network.
Step 1: When Device A sends data to Device B over the network via the HTTP protocol, it is first added an HTTP header at the application layer.
Step 2: Then a TCP or a UDP header is added to the data. It is encapsulated into TCP segments at the transport layer. The header contains the source port, destination port, and sequence number.
Step 3: The segments are then encapsulated with an IP header at the network layer. The IP header contains the source/destination IP addresses.
Step 4: The IP datagram is added a MAC header at the data link layer, with source/destination MAC addresses.
Step 5: The encapsulated frames are sent to the physical layer and sent over the network in binary bits.
Steps 6-10: When Device B receives the bits from the network, it performs the de-encapsulation process, which is a reverse processing of the encapsulation process. The headers are removed layer by layer, and eventually, Device B can read the data.
We need layers in the network model because each layer focuses on its own responsibilities. Each layer can rely on the headers for processing instructions and does not need to know the meaning of the data from the last layer.
Over to you: Do you know which layer is responsible for resending lost data?
8 Key Data Structures That Power Modern Databases
🔹Skiplist: a common in-memory index type. Used in Redis
🔹Hash index: a very common implementation of the “Map” data structure (or “Collection”)
🔹SSTable: immutable on-disk “Map” implementation
🔹LSM tree: Skiplist + SSTable. High write throughput
🔹B-tree: disk-based solution. Consistent read/write performance
🔹Inverted index: used for document indexing. Used in Lucene
🔹Suffix tree: for string pattern search
🔹R-tree: multi-dimension search, such as finding the nearest neighbor