System Design Flashcards Flashcards by Michael DLC

Sending a request as many times as you want but the effect is as if it only happens once.

Idempotency

How well did you know this?

Not at all

Perfectly

Type of index where write first goes to an in-memory balanced binary search tree (memtable), and eventually written to disk when tree becomes too large. Write the contents of it (sorted by key name) to a table file

SSTables and LSM trees index

How well did you know this?

Not at all

Perfectly

A type of DB commonly optimized for query aggregating system

OLAP (Online Analytical Processing) DB

How well did you know this?

Not at all

Perfectly

Adhering to robot.txt to not overload website servers inappropriately
Websites can have robot.txt on their servers which defines how often they can be crawled.

Politeness

How well did you know this?

Not at all

Perfectly

Guarantees that every request (read or write) receives a response, regardless of the success or failure of the request.
During Black Friday Customers can browse products, add items to their cart, and complete purchases without facing downtime, even if some parts of the system are experiencing network issues.

Availability

How well did you know this?

Not at all

Perfectly

A distributed algorithm used to ensure all-or-nothing outcomes (atomicity) in a distributed transaction system.
* Coordinates global transaction across multiple nodes or database ensuring all participants either commit or roll back transactions maintaining consistency across distributed system.

Two-Phase Commit

How well did you know this?

Not at all

Perfectly

A network communication protocol
establishes connection between sender and receiver before data is sent.
ensures all data packets arrive at the destination in correct order. Lost packets are retransmitted)
error checking mechanism
web browsing (HTTPS), EMAIL (SMTP,IMAP,POP3) FileStransfer)

Advantages
* Reliable and ensures data integrity
* Suitable for applications where data accuracy and order are critical.

Disadvantages:
* Higher overhead due to connection management and error checking.
* Slower than UDP due to the additional overhead.

TCP (Transmission Control Protocol)

How well did you know this?

Not at all

Perfectly

Supports Node query caching: cache on each instances of elastic search and caches the top 10k queries via LRU cache

AWS OpenSearch

How well did you know this?

Not at all

Perfectly

Better Sharding Keys: Choose sharding keys that ensure a more even distribution of data and traffic.
Hash-Based Sharding: Use hash functions to distribute data more uniformly across shards.
Dynamic Sharding: Adjust the number of shards dynamically based on the load and traffic patterns.
Load Balancing: Implement load balancing strategies to distribute traffic more evenly across shards.
Partitioning Within Shards: Further partition data within shards to distribute load within the shard more evenly.
Caching: Implement caching strategies to reduce the load on hot shards by serving frequently accessed data from a cache.

Hot Shard Mitigations

How well did you know this?

Not at all

Perfectly

A method of processing data where a large volume of data is collected, processed, and output at once, rather than in real-time. This approach is suitable for scenarios where immediate processing is not required, allowing for efficient handling of extensive datasets and complex computations.

Batch processing

How well did you know this?

Not at all

Perfectly

Way(s) Kafka can help with fault tolerance / data integrity?

Kafka: Retention/Replayability

How well did you know this?

Not at all

Perfectly

Used in machine learning where in order to find similarities between two entities, we need to represent them as numbers or in most cases, vectors

Embedding

How well did you know this?

Not at all

Perfectly

System architecture simplifies the data processing model by eliminating the batch layer entirely. It was introduced by Jay Kreps to address the complexity of the Lambda Architecture.

Kappa Architecture

How well did you know this?

Not at all

Perfectly

A distribute event streaming platform used for
* building real-time data pipelines and streaming applications.
* designed to handle high throughput and low latency for data ingestion and processing, enabling the real-time processing of data streams.

Kafka

How well did you know this?

Not at all

Perfectly

System architecture designed to handle massive quantities of data by using both batch and real-time processing methods. It was proposed to address the challenges of latency, throughput, and fault-tolerance.

Lambda Architecture

How well did you know this?

Not at all

Perfectly

A computing paradigm that focuses on continuously processing and analyzing data in real-time as it arrives.
Deals with live data streams, allowing for immediate insights and actions based on current data.

Stream processing

How well did you know this?

Not at all

Perfectly

Dividing a single database or dataset into smaller segments

Partition

How well did you know this?

Not at all

Perfectly

a replication strategy where multiple nodes (leaders) can accept write operations. Each leader replicates its changes to other leaders, allowing writes to be processed on multiple nodes. This provides higher availability and fault tolerance, but introduces challenges in maintaining data consistency and conflict resolution.

Multi Leader Replication

How well did you know this?

Not at all

Perfectly

A concept used in stream processing and system design to group a continuous stream of data into fixed, non-overlapping chunks of time. Each chunk captures events that occur within a specific time period. This is useful for analyzing data streams in discrete intervals, allowing for time-based aggregations and computations.

A company wants to monitor a number of transactions on their website in a 10 min intervals. THis way they can easily compute metrics like total transactions, average trasaction value and other aggregations each 10 min window

Tumbling Window

How well did you know this?

Not at all

Perfectly

a web communication technique used to achieve near-real-time interaction between a client and a server. It is a method where the client requests information from the server, and the server holds the request open until new information is available or a timeout occurs.

Long polling

How well did you know this?

Not at all

Perfectly

refers to intentional or unintentional mechanisms that can cause web robots (such as search engine bots) to get stuck in a loop or spend excessive amounts of time on a particular site.
This can lead to inefficient crawling, wasting resources, and potentially causing the crawler to miss other valuable content on the web.

Crawler Traps

How well did you know this?

Not at all

Perfectly

refers to the ability of a system to continue functioning correctly even when some of its components fail

Fault Tolerance

How well did you know this?

Not at all

Perfectly

A strategy used in network communication protocols and other distributed systems to manage retries after encountering transient failures.
The key idea is to increase the wait time between successive retries exponentially, thereby reducing the likelihood of overwhelming the system or causing further congestion.
This approach is particularly useful in scenarios where multiple clients may be attempting to access the same resource, and it helps to prevent a “thundering herd” problem, where many clients retry simultaneously.

Study These Flashcards

Exponential Backoff

Involves adding extra components that can take over if one component fails.

Study These Flashcards

Redundancy

* Refers to the goal of minimizing the delay between user’s action and the system’s response.

Low Latency

A distributive event stream processing framework providing robust and scalable solutions for real-time data processing. * keeps state in memory * supports stream processing, enabling real-time analysis of data as it arrives. I * supports event time processing, allowing applications to reason about data based on the timestamps embedded in the events themselves * maintain and manage state within their streaming applications.

Flink

* a probabilistic data structure used to test whether an element is a member of a set. * It is highly space-efficient but allows for a certain probability of false positives. * particularly useful in scenarios where memory space is a concern and where it is acceptable to have some false positives but no false negatives.

Bloom Filter

* It’s built with an inverted index to make searching for documents by term fast. * good for low latency search

Elasticseach

* A technique used to distribute data across a distributed system in a way that minimizes the number of changes required when nodes are added or remove. * Useful for distributed caches, load-balancing, partitioning database * hash ring: nodes in the system are arranged in circular structure.

Consistent Hashing

A replication strategy used in distributed databases where there is no single leader. Instead, all nodes in the system are equal and can accept read and write requests. This approach enhances fault tolerance and availability by avoiding single points of failure and distributing the load more evenly across the nodes.

Leaderless Replication

A type of load balancer algorithm that rotates requests evenly across servers

Round Robin

* A network communication protocol * no connection is establish between sender and receiver * unreliable (does not guarantee order or error * Fast! lower overhead and latency compared to TCP * live video, and audio streaming (voIP, online gaming) * broadcasting * low latency and fast transmission

UDP (User Datagram Protocol)

A data replication strategy where one node handles all write operations and propagates changes to one or more follower nodes (replicas). The followers handle read operations, allowing the system to scale read traffic and improve fault tolerance.

Single Leader Replication

Ways to Achieve Fault Tolerance

1. Redundancy 2. Replication 3. Graceful Degradation 4. Checkpointing and Rollback 5. Error Detection and Correction 6. Failover and Recovery

a type of concurrency bug that occurs when multiple threads or processes access shared resources simultaneously and the outcome depends on the specific timing of their execution. This can lead to unpredictable and incorrect behavior in a program.

Race Condition

Web Crawler High Level Data Flow

1. Take seed urls from a frontier (set of urls yet to be crawled) and the IP from DNS 2. Fetch HTML 3. Extract text from HTML 4. Store text in database 5. Extract the urls in the text and add to frontier 6. Repeat steps (1-5) until the frontier set is empty.

* A Redis feature that allows for storing an unordered collection of unique strings. They * useful for efficiently performing operations such as testing membership, computing intersections, unions, and differences between sets. *Redis provides several commands to work with sets, allowing you to add, remove, and query elements with high performance.

Redis Set

* Ensures that all nodes see the same data at the same time. Any read operation after a write operation should return the latest written value. * In banking system, it’s important that all transactions are recorded accurately and that all nodes have the same data, maintaining financial integrity.

Consistency

* Can be used to optimize network resources and ensures high-quality experience for users * Can help with bandwidth management (allows only forwarding essential data (i.e like audio or video streams) over less critical data (high resolutio video in non essential views) * Quality of service: in a video conferencing application: we may only forward streams of the client that is considered the active speaker’s video or receive higher priority and better quality over less critical streams are forwarded at lower quality or dropped if needed

Selective Forwarding

a big data processing and analytics. It provides an efficient, general-purpose, and fault-tolerant data processing engine that supports both batch and stream processing. It is known for its speed, ease of use, and ability to handle large-scale data processing tasks across a distributed cluster of machines.

Spark

* an effective way to manage high traffic situations for web applications or services, * allowing users to wait in a queue rather than being denied access due to server overload. * it can help maintain a positive user experience during peak times.

Virtual Waiting Queue (+redis)

* a fully managed message queuing service provided by Amazon Web Services (AWS). It enables decoupling and scalability of microservices, distributed systems, and serverless applications by allowing components to communicate asynchronously. * Supports exponential back / retries out of the box

Amazon SQ

states that in a distributed data store, it is impossible to simultaneously achieve all three of the following guarantees: 1. Consistency (C): Every read receives the most recent write or an error. 2. Availability (A): Guarantees that every request (read or write) receives a response, regardless of the success or failure of the request. 3. Partition Tolerance (P): The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.

CAP Theorem

When a shard handles a significantly higher number of read and/or write operations compared to other shards.

Hot Shard

* specialized type of database optimized for storing and querying time-stamped or time series data. * Logs, metrics, sensor readings etc * Popular implementation include: TimeScale DB, Influx DB, Druid

Time Series Database

* A design pattern used in database management and data integration to track and collect changes (inserts, updates, and deletes) in a source database, and then apply those changes to a target system. * This pattern is essential for keeping different systems in sync, enabling real-time analytics, and supporting various data integration and ETL (Extract, Transform, Load) processe

Change Data Capture

A practice of splitting a large dataset into smaller, more manageable pieces. each of this subset of data is stored on separate db or node

Sharding

* Latency: Not suitable for real-time data processing requirements. * Complexity: Can become complex to manage, especially with large and diverse datasets. * Error Handling: Identifying and resolving errors can be challenging as they are often detected after batch completion.

Batch Processing - Disadvantages

A type of load balancer algorithm that sends request to server that has fewest connection

Least Connection (LB)

A type of load balancer algorithm based on IP address ensuring same ip routes to the same server each request

IP Hash (LB)

Global Secondary Index

Amazon DynamoDB’s powerful feature that allows you to query data using alternate keys other than the primary key. This is useful when you need to perform complex queries that your main table's primary key structure doesn't support.

refers to the ability of a system to process a large amount of data or transactions in a given period of time. It is a critical performance metric for applications that need to handle a high volume of requests, such as web servers, databases, and data processing frameworks.

Throughput

* In stream processing and real-time analytics, a type of windowing operation used to group events or data points into overlapping time-based windows. * Allows for continuous and overlapping aggregation or processing of data over specified time intervals, providing more frequent and granular insights compared to another type

Hopping / Sliding Window

* Helps manage concurrent access to shared resources and ensure data consistency by preventing race conditions * it does this via configurable locking mechanism and expiration time

Redis Lock (Distributed Lock)

A distributed coordination service that helps manage large sets of hosts. It is used for centralized configuration management, synchronization, and providing group services. * Leader Election: When a server in distributed system goes down, it can help by facilitating a new leader. * Service Discover: When server goes down, it updates service registry accordingly, allowing clients to discover available services dynamically * Config management: stores config data

Zookeeper

* Elasticsearch service that has built in caching mechanism * Node Query Cache which caches frequently executed queries (LRU) * Filed Data Cache which caches field values in memory to speed up sorting and aggregation * Caches results of queries expected to be frequently repeated * Supports for geoIndex

AWS Open search

A type of database optimized for read-heavy, analytical workloads where operations often involve scanning and aggregating large amount of data by storing data in a column oriented way rather than row (i.e aggregating age column in a traditional row orientation db requires extracting the value row by row before u can perform aggregation vs. directly taking and agregating the entire column values

Column-Oriented database

System Design Flashcards

(57 cards)