System Design Flashcards
Sending a request as many times as you want but the effect is as if it only happens once.
Idempotency
Type of index where write first goes to an in-memory balanced binary search tree (memtable), and eventually written to disk when tree becomes too large. Write the contents of it (sorted by key name) to a table file
SSTables and LSM trees index
A type of DB commonly optimized for query aggregating system
OLAP (Online Analytical Processing) DB
- Adhering to robot.txt to not overload website servers inappropriately
- Websites can have robot.txt on their servers which defines how often they can be crawled.
Politeness
- Guarantees that every request (read or write) receives a response, regardless of the success or failure of the request.
- During Black Friday Customers can browse products, add items to their cart, and complete purchases without facing downtime, even if some parts of the system are experiencing network issues.
Availability
A distributed algorithm used to ensure all-or-nothing outcomes (atomicity) in a distributed transaction system.
* Coordinates global transaction across multiple nodes or database ensuring all participants either commit or roll back transactions maintaining consistency across distributed system.
Two-Phase Commit
- A network communication protocol
- establishes connection between sender and receiver before data is sent.
- ensures all data packets arrive at the destination in correct order. Lost packets are retransmitted)
- error checking mechanism
- web browsing (HTTPS), EMAIL (SMTP,IMAP,POP3) FileStransfer)
Advantages
* Reliable and ensures data integrity
* Suitable for applications where data accuracy and order are critical.
Disadvantages:
* Higher overhead due to connection management and error checking.
* Slower than UDP due to the additional overhead.
TCP (Transmission Control Protocol)
Supports Node query caching: cache on each instances of elastic search and caches the top 10k queries via LRU cache
AWS OpenSearch
- Better Sharding Keys: Choose sharding keys that ensure a more even distribution of data and traffic.
- Hash-Based Sharding: Use hash functions to distribute data more uniformly across shards.
- Dynamic Sharding: Adjust the number of shards dynamically based on the load and traffic patterns.
- Load Balancing: Implement load balancing strategies to distribute traffic more evenly across shards.
- Partitioning Within Shards: Further partition data within shards to distribute load within the shard more evenly.
- Caching: Implement caching strategies to reduce the load on hot shards by serving frequently accessed data from a cache.
Hot Shard Mitigations
A method of processing data where a large volume of data is collected, processed, and output at once, rather than in real-time. This approach is suitable for scenarios where immediate processing is not required, allowing for efficient handling of extensive datasets and complex computations.
Batch processing
Way(s) Kafka can help with fault tolerance / data integrity?
Kafka: Retention/Replayability
- Used in machine learning where in order to find similarities between two entities, we need to represent them as numbers or in most cases, vectors
Embedding
System architecture simplifies the data processing model by eliminating the batch layer entirely. It was introduced by Jay Kreps to address the complexity of the Lambda Architecture.
Kappa Architecture
A distribute event streaming platform used for
* building real-time data pipelines and streaming applications.
* designed to handle high throughput and low latency for data ingestion and processing, enabling the real-time processing of data streams.
Kafka
System architecture designed to handle massive quantities of data by using both batch and real-time processing methods. It was proposed to address the challenges of latency, throughput, and fault-tolerance.
Lambda Architecture
- A computing paradigm that focuses on continuously processing and analyzing data in real-time as it arrives.
- Deals with live data streams, allowing for immediate insights and actions based on current data.
Stream processing
Dividing a single database or dataset into smaller segments
Partition
a replication strategy where multiple nodes (leaders) can accept write operations. Each leader replicates its changes to other leaders, allowing writes to be processed on multiple nodes. This provides higher availability and fault tolerance, but introduces challenges in maintaining data consistency and conflict resolution.
Multi Leader Replication
A concept used in stream processing and system design to group a continuous stream of data into fixed, non-overlapping chunks of time. Each chunk captures events that occur within a specific time period. This is useful for analyzing data streams in discrete intervals, allowing for time-based aggregations and computations.
- A company wants to monitor a number of transactions on their website in a 10 min intervals. THis way they can easily compute metrics like total transactions, average trasaction value and other aggregations each 10 min window
Tumbling Window
a web communication technique used to achieve near-real-time interaction between a client and a server. It is a method where the client requests information from the server, and the server holds the request open until new information is available or a timeout occurs.
Long polling
- refers to intentional or unintentional mechanisms that can cause web robots (such as search engine bots) to get stuck in a loop or spend excessive amounts of time on a particular site.
- This can lead to inefficient crawling, wasting resources, and potentially causing the crawler to miss other valuable content on the web.
Crawler Traps
- refers to the ability of a system to continue functioning correctly even when some of its components fail
Fault Tolerance