System Design Flashcards
Sending a request as many times as you want but the effect is as if it only happens once.
Idempotency
Type of index where write first goes to an in-memory balanced binary search tree (memtable), and eventually written to disk when tree becomes too large. Write the contents of it (sorted by key name) to a table file
SSTables and LSM trees index
A type of DB commonly optimized for query aggregating system
OLAP (Online Analytical Processing) DB
- Adhering to robot.txt to not overload website servers inappropriately
- Websites can have robot.txt on their servers which defines how often they can be crawled.
Politeness
- Guarantees that every request (read or write) receives a response, regardless of the success or failure of the request.
- During Black Friday Customers can browse products, add items to their cart, and complete purchases without facing downtime, even if some parts of the system are experiencing network issues.
Availability
A distributed algorithm used to ensure all-or-nothing outcomes (atomicity) in a distributed transaction system.
* Coordinates global transaction across multiple nodes or database ensuring all participants either commit or roll back transactions maintaining consistency across distributed system.
Two-Phase Commit
- A network communication protocol
- establishes connection between sender and receiver before data is sent.
- ensures all data packets arrive at the destination in correct order. Lost packets are retransmitted)
- error checking mechanism
- web browsing (HTTPS), EMAIL (SMTP,IMAP,POP3) FileStransfer)
Advantages
* Reliable and ensures data integrity
* Suitable for applications where data accuracy and order are critical.
Disadvantages:
* Higher overhead due to connection management and error checking.
* Slower than UDP due to the additional overhead.
TCP (Transmission Control Protocol)
Supports Node query caching: cache on each instances of elastic search and caches the top 10k queries via LRU cache
AWS OpenSearch
- Better Sharding Keys: Choose sharding keys that ensure a more even distribution of data and traffic.
- Hash-Based Sharding: Use hash functions to distribute data more uniformly across shards.
- Dynamic Sharding: Adjust the number of shards dynamically based on the load and traffic patterns.
- Load Balancing: Implement load balancing strategies to distribute traffic more evenly across shards.
- Partitioning Within Shards: Further partition data within shards to distribute load within the shard more evenly.
- Caching: Implement caching strategies to reduce the load on hot shards by serving frequently accessed data from a cache.
Hot Shard Mitigations
A method of processing data where a large volume of data is collected, processed, and output at once, rather than in real-time. This approach is suitable for scenarios where immediate processing is not required, allowing for efficient handling of extensive datasets and complex computations.
Batch processing
Way(s) Kafka can help with fault tolerance / data integrity?
Kafka: Retention/Replayability
- Used in machine learning where in order to find similarities between two entities, we need to represent them as numbers or in most cases, vectors
Embedding
System architecture simplifies the data processing model by eliminating the batch layer entirely. It was introduced by Jay Kreps to address the complexity of the Lambda Architecture.
Kappa Architecture
A distribute event streaming platform used for
* building real-time data pipelines and streaming applications.
* designed to handle high throughput and low latency for data ingestion and processing, enabling the real-time processing of data streams.
Kafka
System architecture designed to handle massive quantities of data by using both batch and real-time processing methods. It was proposed to address the challenges of latency, throughput, and fault-tolerance.
Lambda Architecture
- A computing paradigm that focuses on continuously processing and analyzing data in real-time as it arrives.
- Deals with live data streams, allowing for immediate insights and actions based on current data.
Stream processing
Dividing a single database or dataset into smaller segments
Partition
a replication strategy where multiple nodes (leaders) can accept write operations. Each leader replicates its changes to other leaders, allowing writes to be processed on multiple nodes. This provides higher availability and fault tolerance, but introduces challenges in maintaining data consistency and conflict resolution.
Multi Leader Replication
A concept used in stream processing and system design to group a continuous stream of data into fixed, non-overlapping chunks of time. Each chunk captures events that occur within a specific time period. This is useful for analyzing data streams in discrete intervals, allowing for time-based aggregations and computations.
- A company wants to monitor a number of transactions on their website in a 10 min intervals. THis way they can easily compute metrics like total transactions, average trasaction value and other aggregations each 10 min window
Tumbling Window
a web communication technique used to achieve near-real-time interaction between a client and a server. It is a method where the client requests information from the server, and the server holds the request open until new information is available or a timeout occurs.
Long polling
- refers to intentional or unintentional mechanisms that can cause web robots (such as search engine bots) to get stuck in a loop or spend excessive amounts of time on a particular site.
- This can lead to inefficient crawling, wasting resources, and potentially causing the crawler to miss other valuable content on the web.
Crawler Traps
- refers to the ability of a system to continue functioning correctly even when some of its components fail
Fault Tolerance
- A strategy used in network communication protocols and other distributed systems to manage retries after encountering transient failures.
- The key idea is to increase the wait time between successive retries exponentially, thereby reducing the likelihood of overwhelming the system or causing further congestion.
- This approach is particularly useful in scenarios where multiple clients may be attempting to access the same resource, and it helps to prevent a “thundering herd” problem, where many clients retry simultaneously.
Exponential Backoff
- Involves adding extra components that can take over if one component fails.
Redundancy
- Refers to the goal of minimizing the delay between user’s action and the system’s response.
Low Latency
A distributive event stream processing framework providing robust and scalable solutions for real-time data processing.
* keeps state in memory
* supports stream processing, enabling real-time analysis of data as it arrives. I
* supports event time processing, allowing applications to reason about data based on the timestamps embedded in the events themselves
* maintain and manage state within their streaming applications.
Flink
- a probabilistic data structure used to test whether an element is a member of a set.
- It is highly space-efficient but allows for a certain probability of false positives.
- particularly useful in scenarios where memory space is a concern and where it is acceptable to have some false positives but no false negatives.
Bloom Filter
- It’s built with an inverted index to make searching for documents by term fast.
- good for low latency search
Elasticseach
- A technique used to distribute data across a distributed system in a way that minimizes the number of changes required when nodes are added or remove.
- Useful for distributed caches, load-balancing, partitioning database
- hash ring: nodes in the system are arranged in circular structure.
Consistent Hashing
A replication strategy used in distributed databases where there is no single leader. Instead, all nodes in the system are equal and can accept read and write requests. This approach enhances fault tolerance and availability by avoiding single points of failure and distributing the load more evenly across the nodes.
Leaderless Replication
A type of load balancer algorithm that rotates requests evenly across servers
Round Robin
- A network communication protocol
- no connection is establish between sender and receiver
- unreliable (does not guarantee order or error
- Fast! lower overhead and latency compared to TCP
- live video, and audio streaming (voIP, online gaming)
- broadcasting
- low latency and fast transmission
UDP (User Datagram Protocol)
A data replication strategy where one node handles all write operations and propagates changes to one or more follower nodes (replicas). The followers handle read operations, allowing the system to scale read traffic and improve fault tolerance.
Single Leader Replication
Ways to Achieve Fault Tolerance
- Redundancy
- Replication
- Graceful Degradation
- Checkpointing and Rollback
- Error Detection and Correction
- Failover and Recovery
a type of concurrency bug that occurs when multiple threads or processes access shared resources simultaneously and the outcome depends on the specific timing of their execution. This can lead to unpredictable and incorrect behavior in a program.
Race Condition
Web Crawler High Level Data Flow
- Take seed urls from a frontier (set of urls yet to be crawled) and the IP from DNS
- Fetch HTML
- Extract text from HTML
- Store text in database
- Extract the urls in the text and add to frontier
- Repeat steps (1-5) until the frontier set is empty.
- A Redis feature that allows for storing an unordered collection of unique strings. They
- useful for efficiently performing operations such as testing membership, computing intersections, unions, and differences between sets.
*Redis provides several commands to work with sets, allowing you to add, remove, and query elements with high performance.
Redis Set
- Ensures that all nodes see the same data at the same time. Any read operation after a write operation should return the latest written value.
- In banking system, it’s important that all transactions are recorded accurately and that all nodes have the same data, maintaining financial integrity.
Consistency
- Can be used to optimize network resources and ensures high-quality experience for users
- Can help with bandwidth management (allows only forwarding essential data (i.e like audio or video streams) over less critical data (high resolutio video in non essential views)
- Quality of service: in a video conferencing application: we may only forward streams of the client that is considered the active speaker’s video or receive higher priority and better quality over less critical streams are forwarded at lower quality or dropped if needed
Selective Forwarding
a big data processing and analytics. It provides an efficient, general-purpose, and fault-tolerant data processing engine that supports both batch and stream processing. It is known for its speed, ease of use, and ability to handle large-scale data processing tasks across a distributed cluster of machines.
Spark
- an effective way to manage high traffic situations for web applications or services,
- allowing users to wait in a queue rather than being denied access due to server overload.
- it can help maintain a positive user experience during peak times.
Virtual Waiting Queue (+redis)
- a fully managed message queuing service provided by Amazon Web Services (AWS). It enables decoupling and scalability of microservices, distributed systems, and serverless applications by allowing components to communicate asynchronously.
- Supports exponential back / retries out of the box
Amazon SQ
states that in a distributed data store, it is impossible to simultaneously achieve all three of the following guarantees:
1. Consistency (C): Every read receives the most recent write or an error.
2. Availability (A): Guarantees that every request (read or write) receives a response, regardless of the success or failure of the request.
3. Partition Tolerance (P): The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.
CAP Theorem
When a shard handles a significantly higher number of read and/or write operations compared to other shards.
Hot Shard
- specialized type of database optimized for storing and querying time-stamped or time series data.
- Logs, metrics, sensor readings etc
- Popular implementation include: TimeScale DB, Influx DB, Druid
Time Series Database
- A design pattern used in database management and data integration to track and collect changes (inserts, updates, and deletes) in a source database, and then apply those changes to a target system.
- This pattern is essential for keeping different systems in sync, enabling real-time analytics, and supporting various data integration and ETL (Extract, Transform, Load) processe
Change Data Capture
A practice of splitting a large dataset into smaller, more manageable pieces. each of this subset of data is stored on separate db or node
Sharding
- Latency: Not suitable for real-time data processing requirements.
- Complexity: Can become complex to manage, especially with large and diverse datasets.
- Error Handling: Identifying and resolving errors can be challenging as they are often detected after batch completion.
Batch Processing - Disadvantages
A type of load balancer algorithm that sends request to server that has fewest connection
Least Connection (LB)
A type of load balancer algorithm based on IP address ensuring same ip routes to the same server each request
IP Hash (LB)
Global Secondary Index
Amazon DynamoDB’s powerful feature that allows you to query data using alternate keys other than the primary key. This is useful when you need to perform complex queries that your main table’s primary key structure doesn’t support.
refers to the ability of a system to process a large amount of data or transactions in a given period of time. It is a critical performance metric for applications that need to handle a high volume of requests, such as web servers, databases, and data processing frameworks.
Throughput
- In stream processing and real-time analytics, a type of windowing operation used to group events or data points into overlapping time-based windows.
- Allows for continuous and overlapping aggregation or processing of data over specified time intervals, providing more frequent and granular insights compared to another type
Hopping / Sliding Window
- Helps manage concurrent access to shared resources and ensure data consistency by preventing race conditions
- it does this via configurable locking mechanism and expiration time
Redis Lock (Distributed Lock)
A distributed coordination service that helps manage large sets of hosts. It is used for centralized configuration management, synchronization, and providing group services.
* Leader Election: When a server in distributed system goes down, it can help by facilitating a new leader.
* Service Discover: When server goes down, it updates service registry accordingly, allowing clients to discover available services dynamically
* Config management: stores config data
Zookeeper
- Elasticsearch service that has built in caching mechanism
- Node Query Cache which caches frequently executed queries (LRU)
- Filed Data Cache which caches field values in memory to speed up sorting and aggregation
- Caches results of queries expected to be frequently repeated
- Supports for geoIndex
AWS Open search
A type of database optimized for read-heavy, analytical workloads where operations often involve scanning and aggregating large amount of data by storing data in a column oriented way rather than row (i.e aggregating age column in a traditional row orientation db requires extracting the value row by row before u can perform aggregation vs. directly taking and agregating the entire column values
Column-Oriented database