System Design Terms Flashcards

1
Q

ACID

A

Atomicity: Atomicity ensures that a transaction is treated as a single unit of work. Either all of the operations within the transaction succeed and are committed, or if any operation fails, the entire transaction is rolled back and the database is left unchanged. This property helps maintain data integrity by preventing partial updates that could leave the database in an inconsistent state.

Consistency: Consistency ensures that the database remains in a valid state before and after the transaction. In other words, any transaction must preserve all integrity constraints, such as foreign key constraints, uniqueness constraints, etc. This property guarantees that the database remains consistent even in the event of system failures or concurrent transactions.

Isolation: Isolation ensures that the execution of transactions concurrently produces results that are equivalent to those obtained if the transactions were executed sequentially. This property prevents interference between transactions, thereby avoiding issues such as dirty reads, non-repeatable reads, and phantom reads.

Durability: Durability ensures that once a transaction is committed, its effects are permanently stored in the database and will not be lost, even in the event of system failures. This is typically achieved by writing transaction changes to non-volatile storage, such as disk, so that they can be recovered in case of a crash.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

BASE

A

Basically Available: BASE relaxes the consistency guarantee provided by ACID in favor of availability and partition tolerance. In distributed systems or NoSQL databases, maintaining strict consistency (as in ACID) across all nodes can be challenging and may impact availability. BASE acknowledges that in some cases, it’s acceptable for data to be inconsistent temporarily or for different users to see different versions of data.

Soft state: Soft state means that the state of the system can change over time, even without input. In BASE systems, data might be eventually consistent rather than immediately consistent. This means that updates to the database may take some time to propagate across all nodes in a distributed system.

Eventually consistent: Eventually consistent systems guarantee that if no new updates are made to a given data item, eventually all accesses to that item will return the same value. This is in contrast to immediately consistent systems, which provide strong consistency guarantees at all times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Examples of ACID DBs

A

Traditional Relational Database Management Systems (RDBMS) such as Oracle, MySQL, PostgreSQL, and SQL Server often adhere to ACID principles.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Examples of BASE DBs

A

NoSQL databases like Cassandra, MongoDB, and Couchbase often follow BASE principles.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

CockroachDB

A

Pros:

ACID

Atomicity: CockroachDB ensures that transactions are atomic, meaning they are treated as a single unit of work. Either all operations within a transaction succeed and are committed, or if any operation fails, the entire transaction is rolled back, maintaining data integrity.

Consistency: It maintains consistency by enforcing integrity constraints and ensuring that transactions preserve the validity of the database schema. CockroachDB guarantees that transactions leave the database in a consistent state, adhering to the defined constraints.

Isolation: CockroachDB provides isolation between transactions to prevent interference and maintain data integrity. It ensures that transactions execute concurrently without impacting each other’s outcomes, avoiding issues like dirty reads and non-repeatable reads.

Durability: CockroachDB ensures durability by persistently storing transaction changes on disk. Once a transaction is committed, its effects are durable and will not be lost even in the event of system failures, ensuring data durability and reliability.

BASE (Partial)

Basically Available: CockroachDB emphasizes high availability by distributing data across multiple nodes in a cluster. It ensures that data remains available for reads and writes even in the presence of node failures or network partitions. However, it doesn’t compromise on consistency entirely; it still maintains strong consistency within partitions (ranges) of data.

Soft state and Eventually consistent: While CockroachDB aims for strong consistency within each partition (range) of data, it does sacrifice some level of immediate consistency across the entire cluster for the sake of availability and partition tolerance. It employs a mechanism called “Consensus Protocol” (like Raft) to ensure consistency within each range, but there may be a brief period where data may be eventually consistent across the entire distributed system.

Cons:

While CockroachDB offers many benefits such as scalability, fault tolerance, and strong consistency, there are also some potential drawbacks or considerations to be aware of:

Complexity: Setting up and managing a distributed database system like CockroachDB can be more complex compared to traditional single-node databases. It requires expertise in distributed systems, network configurations, and cluster management.

Performance Overhead: Due to its distributed nature and strong consistency guarantees, CockroachDB may introduce some performance overhead compared to single-node databases, especially for highly concurrent workloads or transactions that span multiple nodes.

Storage Overhead: Distributed databases often require redundant copies of data to ensure fault tolerance and data durability. This can result in higher storage requirements compared to non-distributed databases.

Learning Curve: Developers and administrators may need to invest time in learning CockroachDB’s architecture, SQL dialect, and operational best practices, especially if they are transitioning from traditional SQL databases.

Cost: While CockroachDB is available in an open-source edition, the enterprise features and support may come with a cost. Organizations should consider the total cost of ownership, including hardware, maintenance, and support, when evaluating CockroachDB for production use.

Consistency vs. Latency Trade-offs: CockroachDB provides strong consistency guarantees, but achieving strong consistency in a distributed system may lead to increased latency for certain operations, especially in scenarios where data needs to be replicated across multiple nodes.

Data Modeling Considerations: Distributed databases like CockroachDB may have different data modeling considerations compared to single-node databases. Developers need to carefully design schemas and queries to optimize performance and leverage the distributed architecture effectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

A valid avg document file size

A

100 KB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

1 trillion bytes

A

1 TB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

1 billion bytes

A

1 GB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

1 million bytes

A

1 MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

1000 bytes

A

1 KB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

One thousand trillion bytes

A

1 PB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

seconds in a month

A

2,628,288 seconds
~ 2.6 million seconds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

minutes in a month

A

43829 minutes
~ 44000 minutes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Solution for polling (i.e. for distributed client syncing)

A

One solution would be for the client to poll the server periodically, however this will have a delay in reflecting changes locally since polling will be on an interval. This will also waste bandwidth, as the server needs to return empty responses most of the time, and will also keep the server busy. Pulling information in this manner is not scalable.

A better solution would be to use HTTP long polling. With long polling, the client requests information from the server with the expectation that the server may not respond immediately. If the server has no new data for the client when the poll is received, instead of sending an empty response, the server holds the request open and waits for response information to become available. Once it does have new information, the server immediately sends an HTTP/S response to the client, completing the open HTTP/S Request. Upon receipt of the server response, the client can immediately issue another server request for future updates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How should clients handle slow servers?

A

Clients should exponentially back-off if the server is busy/not-responding. Meaning, if a server is too slow to respond, clients should delay their retries, and this delay should increase exponentially.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Should mobile clients sync remote changes immediately?

A

Unlike desktop or web clients, mobile clients usually sync on-demand to save user’s bandwidth and space.

17
Q

Benefits of DB Sharding

A

1) Response time - improves query performance by reducing number of rows to query through. also can handle more concurrent calls that access different shards as opposed to waiting for each access to finish.
2) Increases availability - avoids total service outage (plus adding replication for shards)
3) Scalability - Allows you to scale horizontally vs vertically, since a single DB can only grow so big (only so many SSDs you can attach) before storage runs out.

18
Q

Seconds in a day

A

10^5 (10,000)

19
Q

Vertical Scaling

A

Definition: Vertical scaling (aka scaling up) involves adding more power to your existing machine or server. This can mean upgrading the CPU, RAM, storage, or other resources.

Pros:

Simplicity: It’s often easier to scale a system vertically because it doesn’t require significant architectural changes. You just add more resources to the existing setup.

Consistency: Since there’s only one system, you don’t have to worry about data consistency across multiple nodes.

Cons:

Physical Limits: There is a physical limit to how much you can upgrade a single machine or server.

Downtime: Upgrading hardware may require downtime if hot-swappable components are not available.

Cost: Beyond a certain point, it becomes very expensive to add high-end hardware to a single machine.

20
Q

Horizontal Scaling

A
21
Q

Geospatial Indexing

A

Ways for geospatial databases to quickly access data within a specific geographical identifier.

In practice, companies would probably use existing geospatial databases such as Geohash in Redis, or Postgres with PostGIS extension.

Option 1: Two-dimensional search
(naive, and can’t efficiently find data due to needing to have an intercept between two datasets)

Option 2: Even Grid - Evenly divide world into small grids
(still bad, distribution of businesses are not even)

Option 3: Geohash - Better than Even Grid, works by reducing 2D long and lat data into string of bits. Although you have 12 precisions, we normally only need to use geohash length between 6 (0.5km, 0.3 miles) and 4 (5km, 3.1 mile) for determining the size of the grid.
Issues: Boundary issues, two adjacent grids may nor share a common prefix at the border of their prefix areas. Also, two adjacent points may belong to different geohashes.
Implementation: To increase the grid-size, keep removing a digit from the end of the geohash.

Option 4: Quadtree - built in memory on server start-up and runs on each LBS (Location Based Service) server, and partitions a 2d space by recursively subdividing it into four quadrants until the contents of the grids meet certain criteria, i.e. # of businesses are under a certain # per quadrant.

data size per leaf node of quadtree:
Coordinate = 8 bytes
business id = 8 bytes
4 coordinates (top left, bottom right) + 100 business IDs = 832 bytes

data size per internal node
pointer to a child node = 8 bytes
4 coordinates (top left, bottom right) + 4 pointers to child nodes = 64 bytes

22
Q

What questions would you ask before starting your design?

A
  • Are we focusing on the backend only or are we developing the front-end too?
  • What are we storing (images, videos, text)?
  • Do we need to search?
  • What scale is expected from the system?
  • How much storage?
  • What network bandwidth is needed?
  • What are the expected APIs? Examples? inputs/outputs
  • What kind of Database will be used?
23
Q

What are some bottlenecks to consider when designing your architecture

A
  • Are there Single points of failure and how to mitigate it
  • Is there enough Data Replication?
  • Are there enough copies of services?
  • How to handle performance monitoring of services? Alerts?
24
Q

What are the key characteristics of Distributed Systems

A

SEARS
Scalability
Efficiency
Availability
Reliability
Serviceability or manageability

25
Q

What are the benefits of load balancing

A
  • Faster uninterrupted service
  • Less downtime and higher throughput
  • Easier to handle incoming requests
  • Fewer failed or stressed components
  • predictive analytics
26
Q

How does the load balancer choose the backend server?

A
  • first ensure that the server they choose is actually responding appropriately to requests
  • use a pre-configured algorithm to select one from the set of healthy servers
27
Q

What are load balancing methods?

A
  • Least Connection Method
  • Least Response Time Method
  • Least Bandwidth Method
  • Round Robin Method
  • Weighted Round Robin Method
  • IP Hash: a hash of the IP address of the client is calculated to redirect the request to a server
28
Q

List the types of caches in system architecture

A
  • Application server: Placing a cache directly on a request layer node enables the local storage of response data
  • Content Distribution Network ( CDN): a kind of cache that comes into play for sites serving large amounts of static media
29
Q

What are the cache invalidation schemes

A

If the data is modified in the database, it should be invalidated in the cache; if not, this can cause inconsistent application behavior.

  • Write-through cache: data is written into the cache and the corresponding database at the same time
  • Write-around cache: data is written directly to permanent storage, bypassing the cache
  • Write-back cache: data is written to cache alone and completion is immediately confirmed to the client.
30
Q
A