Systems Design Foundations Flashcards

1
Q

What are the 6 steps for solving a Systems Design problem?

A

HelloInterview
Start with what is interesting about this problem when possible.

Requirements
NFRs
Entities
API
Step Through design
Deep Dives/NFRs

IK
1. Define Requirements
2. Define Services
3. Define Architecture
4. Define Microservice
____4a. Define Service Anatomy
____4b. Define Scaling
____4c. Define Distributed System

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 6 concepts that should be considered for scaling in a Systems Design problem?

A
  • Storage: Size of data is so big that it cannot fit on a single machine.
  • Throughput: If the number of requests per second is too huge, need to scale for throughput.
  • Latency: If the response time is too high because of bulky processing, need to parallelize the computation.
  • Availability: Availability / Reliability in the face of faults
  • Geolocation: Minimize network latency by using multiple servers at different locations.
  • Hotspot: Disproportionately high load on a section of the data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the types of systems?

A
  • Online
  • Batch
  • Stream
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What data structures and concepts can you use to help solve algorithm problems?

A
  • Decrease and Conquer
    ____* BST (sorted Tree)
    ____* Two Pointer
  • Divide and Conquer
    ____* Merge Sort
  • Transformation and Conquer
    ____* Balanced Tree
    ____* Map
    ____* Presorted list
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How many requests per second is 1 billion requests a month?

A

385 clicks per second. Just round this to 400 to make the math easier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How many requests per second is 1 billion requests a year?

A

30 clicks per second.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the powers of 2^n from 0 to 15?

A

2^1 = 2
2^2 = 4
2^3 = 8
2^4 = 16
2^5 = 32
2^6 = 64
2^7 = 128
2^8 = 256
2^9 = 512
2^10 = 1,024
2^11 = 2,048
2^12 = 4,096
2^13 = 8,192
2^14 = 16,384
2^15 = 32,768
2^16 = 65,536
2^17 = 131,072
2^18 = 262,144
2^19 = 524,288
2^20 = 1,048,576

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the powers of 2^n for 10, 20, 30, 40, 50?

A
  • 2^10 = 1024
  • 2^20 is approx 1 million
    ____* So we need 20 bits to represent 1 million
  • 2^30 is approx 1 billion
    ____* So we need 30 bits to represent 1 billion
  • 2^40 is approx 1 trillion
    ____* So we need 40 bits to represent 1 trillion
  • 2^50 is approx 1 quadrillion
    ____* So we need 50 bits to represent 1 quadrillion
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the stats of a standard commodity machine?

A

Storage RAM
* 128 GB RAM

Storage Disk Space
* 1 to 2 TB of disk space
* Design for 3 years of data

Throughput
* 8 to 12 cores
____* 8 threads per core
____* 30% utilization
____* throughput = 30,000 / x where x is the time to finish the average request.
____* Num Servers = (req / sec * avg req time) / 30,000

Latency
* gigabit connection is 1 billion bits per second or ~ 125 million bytes per second (1 billion / 8)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When comparing throughput vs storage, when can you argue that you don’t need a cache?

A

If the storage layer needs more servers to handle the storage requirements than the app or storage layer needs for throughput. i.e. If you have enough servers to handle throughput, you might not need a cache. This isn’t the only reason why you would need a cache though.

If throughput is governing scalability, you can potentially use the cache to reduce the cost of the storage layer and the overall system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

O(?) of what constant is approximately 100 ms.

A

100 ms: O(10,000,000)
1 ms: O(100,000)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the popular distributed file storage tools?

A

HDFS, AWS S3, DropBox

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which databases allow for high writes?

A

Cassandra, BigTable, Dynamo were all created based on the BigTable white paper from Google.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What architecture do high-write databases (e.g., Cassandra, BigTable, Dynamo) employ?

A

Log-Structured Merge (LSM) trees.
* Key idea is that it is an append only datastore which means reads are more complicated because it needs to merge across multiple SSTables for read.
* First writes to a Write Ahead Log (WAL) to ensure durability of data in case of crash.
* Then rights to mutable memtable (sorted tree)
* When tree is full it is flushed into an Sorted String Table (SSTable) which is immutable and stored in sequence on disk.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the difference between Hash-Based partitioning and Range Based Partitioning?

A

Hash-Based Partitioning
Definition: Each key and all its associated data is fully stored on one instance.
Use Case: This method is commonly used in distributed databases like Cassandra or MongoDB.
Pros: Data locality, Consistency, Scalability
Cons: Hotspots, Consistent Hashing is needed for rebalancing.

Range-Based Partitioning
Definition: The data for a single key is split across multiple instances, where each instance holds part of the data for that key. Done based on a value in the data such as a time range or another column.
Use Case: This approach is used in distributed systems that need to handle very large objects or where partitioning based on certain ranges of data is essential.
Pros: Can handle very large data sets and parallelize access to a single key.
Cons: Expensive loss of throughput, complicated merging of a single doucment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the rule of thumb on cache size based on Zipf’s Law?

A

For a 90% cache hit rate, you need 10% of the most frequently used data in the cache.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Define SLI, SLO, and SLA

A

Service Level Indicator (SLI): is a specific metric that helps companies measure some aspect of the level of services to their customers. e.g. Request Latency, Error Rate, System Throughput, Availability, Yield, Durability
Service Level Objective (SLO): Target value/range for an SLI
Service Level Agreement (SLA): Explicit contract with users on what the SLOs are and what the consequences are for missing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Different Caching Strategies?

A

Pass-Through/Write-Through: App code sends to cache, the cache owns writing to disk.
Cache-Aside: App code handles writing to data store/disk and to cache.
Write-back: App code writes to cache, cache immediately responds back to app code and asych writes to data store/disk.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is an ACID transaction?

A

Atomic: All or nothing.
Consistent: One valid state at a time.
Isolation: Transaction is isolated from others.
Durable: Changes are permanent

20
Q

What is CAP Theorem?

A

Consistency (C):
Definition: Consistency means that all nodes in a distributed system see the same data at the same time. When a write is performed on one node, it is immediately reflected on all other nodes.
Implication: Every read request receives the most recent write. If you query two different nodes, you will get the same result.

Availability (A):
Definition: Availability ensures that every request (read or write) receives a response, regardless of whether the response contains the most recent write.
Implication: The system is operational and accessible at all times, even if some nodes are down. Every request is answered, but the data may not be the latest version if a failure has occurred.

Partition Tolerance (P):
Definition: Partition tolerance means that the system continues to operate despite network partitions that cause some nodes to be unable to communicate with others.
Implication: The system can tolerate network failures or delays between nodes and still provide service.

21
Q

What are the 5 common Non-Functional Requirements to consider?

A
  1. The 6 Scalability concerns
  2. Data: CAP
  3. Data: Read vs Write
  4. Value, Usability, Viability (Cost vs Expected Revenue), Feasibility
  5. Good, Fast, Cheap: G + F = NOT C, G + C = NOT F, F + C = NOT G
22
Q

What are common components of a Microservice to discuss?

A

Service, Storage (Database/Cache)
1. Service
____a. Protocols
____b. Validation
____c. Controller (Handles protocols and execution patterns: (HTTP, TCP, UDP, Lambda Invoke, Module invoke)
____d. Services (A series of steps or modules that need to be executed).
____e. Modules (A series of steps that need to be executed).
____e. Proxy layer
____f. Repository (Data Access Layer)
____g. Error Handling
2. Storage
____a. Replication
____b. Peer-to-Peer or Leader/Follower

23
Q

What are the two Planes of a Microservice?

A

Control Plane
Components: Cluster Manager.
Job: Managing the operations and behavior of the system (e.g. tells the load balancer how to route to app servers and owns the number of app servers running successfully).

Data Plane
Components: Load Balancers & App Server
Job: Where the work happens

24
Q

What are some approaches for handling deadlocks?

A

There are quite a few. For the purpose of the interview this should be good enough.
1. Enforce common sense timeout for getting the locks and for holding the locks (lease based)
2. Enforce a lock ordering to ensure all components acquire locks in the same order.

25
Q

For mobile, what are the two different notification tools?

A

Apple: Apple Push Notification
Android: Firebase Cloud Messaging

26
Q

What is the concept used for proximity searching?

A

GeoHashing

Redis is an in-memory data store that supports geospatial data types and commands. It uses geohashing to encode latitude and longitude coordinates into a single string key, which is then indexed using a sorted set. This allows for efficient storage and querying of geospatial data.

27
Q

What are the 1000^x up to 1000^5?

A

Power of 1000 (1000^x) Number Prefix
0 Unit
1 Thousand Kilo
2 Million Mega
3 Billion Giga
4 Trillion Tera
5 Quadrillion Peta
______________________________________________
1,000^1 = 1 kilobytes
1,000^2 = 1,000,000 = 1 megabyte
1,000^3 = 1,000,000,000 = 1 gigabyte
1,000^4 = 1,000,000,000,000 = 1 terabyte
1,000^5 = 1,000,000,000,000,000 = 1 petabyte
_______________________________________________
1 Million bytes is 1 megabyte.

28
Q

What are 4 common latency metrics (ie memory/disk/network.)

A

Action Time Comparison
Reading 1mb sequentially from memory 0.25ms
Reading 1mb sequentially from SSD 1ms 4x memory
Reading 1mb sequentially from spinning disk 20ms 20x SSD
Round trip network latency CA to Netherlands 150ms

29
Q

What are common storage sizes to know?

A

Item Size
1 hour video 2gb HD –> 1gb –> 500mb Low Res
A small book of plain text 1mb
A high-resolution photo 1mb
A medium-resolution image (or a site layout graphic) 100kb

30
Q

What are common usage levels for common services?

A

Metric Order of Magnitude
Daily active users of major social networks O(1b)
Hours of video streamed on Netflix per day O(100m)
Google searches per second O(100k)
Instagram Feed Requests per second O(50k)
Size of Wikipedia O(100gb)

31
Q

What 4 topics will help me identify NFRs?

A
  1. CAP theorem: Does this system prioritize availability or consistency? Note, that in some cases, the answer is different depending on the part of the system – as you’ll see is the case here.
  2. Read vs write ratio: is this a read heavy system or write heavy? Are either notably heavy for any reason?
  3. Fault Tolerance: Are there any interesting error scenarios?
  4. Usage Patterns or Query access pattern: Is the access pattern of the system regular or are their patterns or bursts that require particular attention. For example, the holidays for shopping sites or popular events for ticket booking.
32
Q

What is a good estimate for seconds in a day?

A

1 day = 24 hours/day × 60 minutes/hour × 60 seconds/minute = 86400 seconds/day

100,000 seconds in a day

33
Q

What is multi-part upload of a file?

A

Problem: If a file takes a long time to upload, we don’t want a user to have to restart an upload on failure.
Solution: Multi Part Upload

  • Break the file into chunks
  • Use fingerprinting to identify each chunk. This is just a hash of the chunk bytes Hash(bytes).
  • Send each chunk and keep track of success using the fingerprint.
34
Q

For distributed queue, what are some approaches for handling a hot partition?

A
  1. Remove the key that is causing the issue.
  2. Create a compound key partitionId:1-10 or use AdId:Userid
  3. Backpressure. Force the producer to slow down.
35
Q

What is the formula for how long it takes to send a file in over a network?

A

(File size in MB * 8) / connection speed

1000 mb * 8 / 1000 mbps = 8 seconds
100 mb * 8 / 1000 mbps = .8 seconds
1000 mb * 8 / 100 mbps = 80 seconds

36
Q

How many SSE / WebSocket connections can single server support?

A

~65k because that is the number of ports on the server.

37
Q

Layer 4 vs Layer 7 Load Balancer?

A

Layer 4
* Transport Layer
* Routes using IP and Port
* No inspection of data
* Just forward the data
* Use Case: The majority should just use this.

Layer 7
* Application Layer.
* Routes using data.
* Use Case: if you need to use the data in the request to decide how to route.
* It is more expensive because it requires more computing power and takes more time to process requests.

38
Q

What is a drawback with using SSE?

A

Separation of Concerns
Chat with websockets
* Websocket connection is setup by server and we can have the same logic for partitioning one place.

Chat with SSE
* You now need to have knowledge of partitioning in two places, backend for message processing and in your Laqyer 7 LB.

39
Q

What do you call it when we write data to disk and read from disk.

A

Serialization/Deserialization

Serialization has a different meaning in transactions so we can also use

Encoding and decoding.

40
Q

How many people in the world and USA?

A

USA: 300 million
World: 8 billion

41
Q

What are two different approaches for handling real time changes in a Google Docs?

A

CRDTs
Operational Transformation

42
Q

When using a leaderless replication strategy, what are two strategies taken to help a node catch up if it goes offline.

A
  1. Read Repairs. When clients are reading from others, they might see that another replica is missing a write. In this case, they can send the write.
  2. Use an anti-entropy process which runs in the background attempting to add missing writes to all instances.
43
Q

When using leaderless replication, what are quorum reads/writes?

A

If R + W >= N then we have a Quorum.

R is the number of nodes we read from
W is the number of nodes with consistent writes.
N is the number nodes.

  • If we have 2 + 2 >= 3 then we can support 1 node failure.
  • If we have 3 + 3 >= 5 then we can support 2 node failures.
44
Q

When using leaderless replication, how do we define concurrency?

A

Two writes are concurrent if neither happens before the other. Ie if the writes don’t know about each other.

3 possibilities
1. A comes before B
2. B comes before A
3. A and B are concurrent

45
Q

Compare leaderless vs leader/follower replication.

A
  • Leader/Follower: Easy to reason about, no collisions to worry about, if writer goes down failover will not be immediate, can lose data during failover.
  • Leaderless: Complicated to understand, many ways to get to quorum, always on, collisions are a problem.
46
Q

Compare GeoHashing vs QuadTrees?

A

Geohash:
- MY ANSWER: I am always going to start with GeoHash because it makes my system simpler and easier to reason about.
- Fixed precision. The world is geohashed already.
- Can handle very high writes (Redis 1 million TPS)
- Predefined so it is easier to understand, human readable.

Quadtree:
- You need to define what the quad tree represents on a map.
- Adaptive/Dynamic precision. We can create a quadtree starting from anywhere.
- Cannot handle high writes because need to keep tree balanced.

47
Q

What would the bit rate be for a 1 GB file?

A

For video, the bitrate can be estimated by dropping 1 unit in size and doubling the number. So 1 GB file would be around the 2 megabits per second.