System Design Flashcards

1
Q

Sending a request as many times as you want but the effect is as if it only happens once.

A

Idempotency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Type of index where write first goes to an in-memory balanced binary search tree (memtable), and eventually written to disk when tree becomes too large. Write the contents of it (sorted by key name) to a table file

A

SSTables and LSM trees index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

A type of DB commonly optimized for query aggregating system

A

OLAP (Online Analytical Processing) DB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  • Adhering to robot.txt to not overload website servers inappropriately
  • Websites can have robot.txt on their servers which defines how often they can be crawled.
A

Politeness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  • Guarantees that every request (read or write) receives a response, regardless of the success or failure of the request.
  • During Black Friday Customers can browse products, add items to their cart, and complete purchases without facing downtime, even if some parts of the system are experiencing network issues.
A

Availability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

A distributed algorithm used to ensure all-or-nothing outcomes (atomicity) in a distributed transaction system.
* Coordinates global transaction across multiple nodes or database ensuring all participants either commit or roll back transactions maintaining consistency across distributed system.

A

Two-Phase Commit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  • A network communication protocol
  • establishes connection between sender and receiver before data is sent.
  • ensures all data packets arrive at the destination in correct order. Lost packets are retransmitted)
  • error checking mechanism
  • web browsing (HTTPS), EMAIL (SMTP,IMAP,POP3) FileStransfer)

Advantages
* Reliable and ensures data integrity
* Suitable for applications where data accuracy and order are critical.

Disadvantages:
* Higher overhead due to connection management and error checking.
* Slower than UDP due to the additional overhead.

A

TCP (Transmission Control Protocol)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Supports Node query caching: cache on each instances of elastic search and caches the top 10k queries via LRU cache

A

AWS OpenSearch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  1. Better Sharding Keys: Choose sharding keys that ensure a more even distribution of data and traffic.
  2. Hash-Based Sharding: Use hash functions to distribute data more uniformly across shards.
  3. Dynamic Sharding: Adjust the number of shards dynamically based on the load and traffic patterns.
  4. Load Balancing: Implement load balancing strategies to distribute traffic more evenly across shards.
  5. Partitioning Within Shards: Further partition data within shards to distribute load within the shard more evenly.
  6. Caching: Implement caching strategies to reduce the load on hot shards by serving frequently accessed data from a cache.
A

Hot Shard Mitigations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A method of processing data where a large volume of data is collected, processed, and output at once, rather than in real-time. This approach is suitable for scenarios where immediate processing is not required, allowing for efficient handling of extensive datasets and complex computations.

A

Batch processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Way(s) Kafka can help with fault tolerance / data integrity?

A

Kafka: Retention/Replayability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  • Used in machine learning where in order to find similarities between two entities, we need to represent them as numbers or in most cases, vectors
A

Embedding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

System architecture simplifies the data processing model by eliminating the batch layer entirely. It was introduced by Jay Kreps to address the complexity of the Lambda Architecture.

A

Kappa Architecture

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

A distribute event streaming platform used for
* building real-time data pipelines and streaming applications.
* designed to handle high throughput and low latency for data ingestion and processing, enabling the real-time processing of data streams.

A

Kafka

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

System architecture designed to handle massive quantities of data by using both batch and real-time processing methods. It was proposed to address the challenges of latency, throughput, and fault-tolerance.

A

Lambda Architecture

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
  • A computing paradigm that focuses on continuously processing and analyzing data in real-time as it arrives.
  • Deals with live data streams, allowing for immediate insights and actions based on current data.
A

Stream processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Dividing a single database or dataset into smaller segments

A

Partition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

a replication strategy where multiple nodes (leaders) can accept write operations. Each leader replicates its changes to other leaders, allowing writes to be processed on multiple nodes. This provides higher availability and fault tolerance, but introduces challenges in maintaining data consistency and conflict resolution.

A

Multi Leader Replication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

A concept used in stream processing and system design to group a continuous stream of data into fixed, non-overlapping chunks of time. Each chunk captures events that occur within a specific time period. This is useful for analyzing data streams in discrete intervals, allowing for time-based aggregations and computations.

  • A company wants to monitor a number of transactions on their website in a 10 min intervals. THis way they can easily compute metrics like total transactions, average trasaction value and other aggregations each 10 min window
A

Tumbling Window

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

a web communication technique used to achieve near-real-time interaction between a client and a server. It is a method where the client requests information from the server, and the server holds the request open until new information is available or a timeout occurs.

A

Long polling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q
  • refers to intentional or unintentional mechanisms that can cause web robots (such as search engine bots) to get stuck in a loop or spend excessive amounts of time on a particular site.
  • This can lead to inefficient crawling, wasting resources, and potentially causing the crawler to miss other valuable content on the web.
A

Crawler Traps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q
  • refers to the ability of a system to continue functioning correctly even when some of its components fail
A

Fault Tolerance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q
  • A strategy used in network communication protocols and other distributed systems to manage retries after encountering transient failures.
  • The key idea is to increase the wait time between successive retries exponentially, thereby reducing the likelihood of overwhelming the system or causing further congestion.
  • This approach is particularly useful in scenarios where multiple clients may be attempting to access the same resource, and it helps to prevent a “thundering herd” problem, where many clients retry simultaneously.
A

Exponential Backoff

24
Q
  • Involves adding extra components that can take over if one component fails.
A

Redundancy

25
Q
  • Refers to the goal of minimizing the delay between user’s action and the system’s response.
A

Low Latency

26
Q

A distributive event stream processing framework providing robust and scalable solutions for real-time data processing.
* keeps state in memory
* supports stream processing, enabling real-time analysis of data as it arrives. I
* supports event time processing, allowing applications to reason about data based on the timestamps embedded in the events themselves
* maintain and manage state within their streaming applications.

A

Flink

27
Q
  • a probabilistic data structure used to test whether an element is a member of a set.
  • It is highly space-efficient but allows for a certain probability of false positives.
  • particularly useful in scenarios where memory space is a concern and where it is acceptable to have some false positives but no false negatives.
A

Bloom Filter

28
Q
  • It’s built with an inverted index to make searching for documents by term fast.
  • good for low latency search
A

Elasticseach

29
Q
  • A technique used to distribute data across a distributed system in a way that minimizes the number of changes required when nodes are added or remove.
  • Useful for distributed caches, load-balancing, partitioning database
  • hash ring: nodes in the system are arranged in circular structure.
A

Consistent Hashing

30
Q

A replication strategy used in distributed databases where there is no single leader. Instead, all nodes in the system are equal and can accept read and write requests. This approach enhances fault tolerance and availability by avoiding single points of failure and distributing the load more evenly across the nodes.

A

Leaderless Replication

31
Q

A type of load balancer algorithm that rotates requests evenly across servers

A

Round Robin

32
Q
  • A network communication protocol
  • no connection is establish between sender and receiver
  • unreliable (does not guarantee order or error
  • Fast! lower overhead and latency compared to TCP
  • live video, and audio streaming (voIP, online gaming)
  • broadcasting
  • low latency and fast transmission
A

UDP (User Datagram Protocol)

33
Q

A data replication strategy where one node handles all write operations and propagates changes to one or more follower nodes (replicas). The followers handle read operations, allowing the system to scale read traffic and improve fault tolerance.

A

Single Leader Replication

34
Q

Ways to Achieve Fault Tolerance

A
  1. Redundancy
  2. Replication
  3. Graceful Degradation
  4. Checkpointing and Rollback
  5. Error Detection and Correction
  6. Failover and Recovery
35
Q

a type of concurrency bug that occurs when multiple threads or processes access shared resources simultaneously and the outcome depends on the specific timing of their execution. This can lead to unpredictable and incorrect behavior in a program.

A

Race Condition

36
Q

Web Crawler High Level Data Flow

A
  1. Take seed urls from a frontier (set of urls yet to be crawled) and the IP from DNS
  2. Fetch HTML
  3. Extract text from HTML
  4. Store text in database
  5. Extract the urls in the text and add to frontier
  6. Repeat steps (1-5) until the frontier set is empty.
37
Q
  • A Redis feature that allows for storing an unordered collection of unique strings. They
  • useful for efficiently performing operations such as testing membership, computing intersections, unions, and differences between sets.
    *Redis provides several commands to work with sets, allowing you to add, remove, and query elements with high performance.
A

Redis Set

38
Q
  • Ensures that all nodes see the same data at the same time. Any read operation after a write operation should return the latest written value.
  • In banking system, it’s important that all transactions are recorded accurately and that all nodes have the same data, maintaining financial integrity.
A

Consistency

39
Q
  • Can be used to optimize network resources and ensures high-quality experience for users
  • Can help with bandwidth management (allows only forwarding essential data (i.e like audio or video streams) over less critical data (high resolutio video in non essential views)
  • Quality of service: in a video conferencing application: we may only forward streams of the client that is considered the active speaker’s video or receive higher priority and better quality over less critical streams are forwarded at lower quality or dropped if needed
A

Selective Forwarding

40
Q

a big data processing and analytics. It provides an efficient, general-purpose, and fault-tolerant data processing engine that supports both batch and stream processing. It is known for its speed, ease of use, and ability to handle large-scale data processing tasks across a distributed cluster of machines.

A

Spark

41
Q
  • an effective way to manage high traffic situations for web applications or services,
  • allowing users to wait in a queue rather than being denied access due to server overload.
  • it can help maintain a positive user experience during peak times.
A

Virtual Waiting Queue (+redis)

42
Q
  • a fully managed message queuing service provided by Amazon Web Services (AWS). It enables decoupling and scalability of microservices, distributed systems, and serverless applications by allowing components to communicate asynchronously.
  • Supports exponential back / retries out of the box
A

Amazon SQ

43
Q

states that in a distributed data store, it is impossible to simultaneously achieve all three of the following guarantees:
1. Consistency (C): Every read receives the most recent write or an error.
2. Availability (A): Guarantees that every request (read or write) receives a response, regardless of the success or failure of the request.
3. Partition Tolerance (P): The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.

A

CAP Theorem

44
Q

When a shard handles a significantly higher number of read and/or write operations compared to other shards.

A

Hot Shard

45
Q
  • specialized type of database optimized for storing and querying time-stamped or time series data.
  • Logs, metrics, sensor readings etc
  • Popular implementation include: TimeScale DB, Influx DB, Druid
A

Time Series Database

46
Q
  • A design pattern used in database management and data integration to track and collect changes (inserts, updates, and deletes) in a source database, and then apply those changes to a target system.
  • This pattern is essential for keeping different systems in sync, enabling real-time analytics, and supporting various data integration and ETL (Extract, Transform, Load) processe
A

Change Data Capture

47
Q

A practice of splitting a large dataset into smaller, more manageable pieces. each of this subset of data is stored on separate db or node

A

Sharding

48
Q
  • Latency: Not suitable for real-time data processing requirements.
  • Complexity: Can become complex to manage, especially with large and diverse datasets.
  • Error Handling: Identifying and resolving errors can be challenging as they are often detected after batch completion.
A

Batch Processing - Disadvantages

49
Q

A type of load balancer algorithm that sends request to server that has fewest connection

A

Least Connection (LB)

50
Q

A type of load balancer algorithm based on IP address ensuring same ip routes to the same server each request

A

IP Hash (LB)

51
Q

Global Secondary Index

A

Amazon DynamoDB’s powerful feature that allows you to query data using alternate keys other than the primary key. This is useful when you need to perform complex queries that your main table’s primary key structure doesn’t support.

52
Q

refers to the ability of a system to process a large amount of data or transactions in a given period of time. It is a critical performance metric for applications that need to handle a high volume of requests, such as web servers, databases, and data processing frameworks.

A

Throughput

53
Q
  • In stream processing and real-time analytics, a type of windowing operation used to group events or data points into overlapping time-based windows.
  • Allows for continuous and overlapping aggregation or processing of data over specified time intervals, providing more frequent and granular insights compared to another type
A

Hopping / Sliding Window

54
Q
  • Helps manage concurrent access to shared resources and ensure data consistency by preventing race conditions
  • it does this via configurable locking mechanism and expiration time
A

Redis Lock (Distributed Lock)

55
Q

A distributed coordination service that helps manage large sets of hosts. It is used for centralized configuration management, synchronization, and providing group services.
* Leader Election: When a server in distributed system goes down, it can help by facilitating a new leader.
* Service Discover: When server goes down, it updates service registry accordingly, allowing clients to discover available services dynamically
* Config management: stores config data

A

Zookeeper

56
Q
  • Elasticsearch service that has built in caching mechanism
  • Node Query Cache which caches frequently executed queries (LRU)
  • Filed Data Cache which caches field values in memory to speed up sorting and aggregation
  • Caches results of queries expected to be frequently repeated
  • Supports for geoIndex
A

AWS Open search

57
Q

A type of database optimized for read-heavy, analytical workloads where operations often involve scanning and aggregating large amount of data by storing data in a column oriented way rather than row (i.e aggregating age column in a traditional row orientation db requires extracting the value row by row before u can perform aggregation vs. directly taking and agregating the entire column values

A

Column-Oriented database