System Design Flashcards

1
Q

Delivery Framework

A
  1. Requirements
  2. Core Entities
  3. API or System Interface
  4. Data Flow
  5. High Level Design
  6. Deep Dive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Requirements

A
  1. Functional Requirements - “Users/clients should be able to…” Top 3
  2. Non-functional Requirements - “System should be / should be able to…” Top 3
  3. Capacity Estimations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Nonfunctional Requirements Checklist (8)

A
  1. CAP theorem, but for distributed systems really just CA, P is a given
  2. Environment constraints, ie battery life or limited memory
  3. Scalability, unique reqs such as bursts traffic or read/write ratio
  4. Latency, specifically for anything with meaningful computation
  5. Durability, how important that data is not lost
  6. Security, ie data protection, access control
  7. Fault tolerance, ie redundancy, failover, recovery mechanisms
  8. Compliance, ie legal or regulatory requirements or standards
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Bytes to store data

A

ASCII - 1 byte
Unicode - 2 bytes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Split seconds

A

Millisecond (ms) 1/1000
Microsecond (us) 1/1,000,000
Nanoseconds (ns) 1/1,000,000,000

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Read latency

A

Memory
1mb/.25ms, 4gb/s

SSD (4x memory)
1mb/ms, 1gb/s

Disk (20x SSD)
1mb/20ms

Worldwide trip
6/s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Request Calculations by second

A

2.5 mil seconds per year

1 million per month = .4/s
2.5 million per month = 1/s
10 million per month = 4/s
100 million per month = 40/s
1 billion per month = 400/s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Storage estimates:
2 hr movie
Small plain text book
High res photo
Med res image

A

Movie 1gb
Book 1mb
Photo 1mb
Med res image 100kb

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

DB Writes vs Reads

A

Write is 40x more expensive than read

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Core Entities

A

2 min.
What the API will exchange and will persist in data model. Ex user/tweet,follow for twitter.

Bullet list

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

API or System Interface

A

RESTful or GraphQL

Endpoints with path and parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data Flow

A

Actions or processes that the system performs on the input to produce the desired outputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Core Concepts

A

Scaling - work distribution and data distribution

Consistency

Locking

Indexing

Communication Protocols

Security - authentication and authorization, encryption, data protection

Monitoring - infrastructure, system level, application level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Key Technologies

A

Core DB
Blob storage
Search optimized DB
API gateway
Load balancer
Queue
Streams / event sourcing
Distributed lock
Distributed cache
CDN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Patterns

A

DB backed CRUD with caching

Async job worker pool

2 stage architecture

Event driven architecture

Durable job processing

Proximity based services

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Core API - high level overview

A

“Our Core API uses a layered .NET architecture, deployed in EKS. Controllers
handle HTTP routing, Services handle business logic, and a Data layer interacts
with Aurora and Redis. This lets us scale the service horizontally while keeping
the codebase maintainable.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Core API - layered architecture justification

A

“We wanted to separate concerns—controllers focus on HTTP requests, services
encapsulate domain rules, and our data layer deals with Aurora and caching. This
approach cuts down on coupling and makes it easier to adapt or extract
microservices down the road.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Core processor - explanation

A

“We have a central ETL pipeline—the Core Processor—which ingests data from
multiple providers, stores raw payloads in S3, and then transforms/loads it into Aurora.

Tasks run on a cron based scheduler and it retries on failure with exponential backoff, ensuring resilience even if a provider is temporarily down”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Core API - why K8s?

A

“Kubernetes gave us automated scaling and rolling updates out of the box. We
can spin up more pods during major sporting events and scale back when traffic is
low, all while ensuring near-zero downtime.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Core API - EKS rolling updates

A

“We use a rolling update strategy so that when deploying a new version of the
API, only one old pod goes down at a time—our system stays online, and if
something fails, we can roll back quickly.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Core API - stateless pods

A

“Even though our application manages a lot of data, we designed each pod to be stateless. Any persistent data—sessions, user info, or stats—resides in Aurora, Redis, or S3.

That means losing a pod doesn’t risk losing data.”

22
Q

Core API - Ingress and Helm templating

A

“We have an internal ALB that terminates TLS and checks liveness via /health.
The ALB is configured via Ingress annotations in our Helm chart, ensuring only
healthy pods receive requests. We define everything in Helm charts, from replicas
and resource limits to Ingress rules. Environment-specific overrides like values-
stage.yaml and values-prod.yaml let us run the same code in staging vs.
production with minimal overhead.”

23
Q

Core API - CI/CD pipeline

A

“We use CircleCI to build Docker images, run tests, push the image to ECR, then
automatically update our Helm chart. If linting or validation fails, the deployment
never proceeds—meaning we catch issues before they hit production.”

24
Q

Core API - automatic rollbacks

A

“Our pipeline can roll back a Helm release if we detect a spike in 500 errors or
failing health checks. That safety net lets us move fast and confidently ship
updates.”

25
Q

Core API - environment specific builds

A

“For each commit on the ‘master’ branch, CircleCI sets DOTNETCORE_ENVIRONMENT=production and deploys to our production
cluster. For ‘stable’, it uses stage—we keep these pipelines consistent, ensuring
minimal drift.”

26
Q

Core API - Redis caching

A

“We cache frequently requested data in Redis—like top odds or event stats—for short TTLs.
This offloads read traffic from Aurora and drastically reduces latency on hot endpoints.”

27
Q

Core API - in memory caching

A

“Each pod has an in-memory cache for micro-optimizations, but it’s not critical if a pod restarts—it’s purely ephemeral. That’s a classic stateless approach, as all permanent state lives in external data stores.”

28
Q

Core API - Metrics

A

“We used Prometheus and Grafana for real-time visibility into common and custom metrics. That data helps us spot anomalies or performance regressions fast, and Grafana let us set events to trigger slack notifications for the proper team”

29
Q

Core API - Rollbar

A

“Any exception in the Core API automatically logs to Rollbar and critical errors trigger slack notifications to the proper teams. During a major sporting event, if we see a surge of 500 errors, we can quickly pinpoint which endpoint or DB call is failing.”

30
Q

Core API - latency tracking

A

“We keep a histogram of HTTP request durations. By tracking P95 and P99 latencies, we ensure that even our worst-case requests stay within acceptable bounds, especially during heavy game traffic.”

31
Q

Core API - CAP theorem

A

“We operate in a distributed AWS environment, so partition tolerance is mandatory.

We typically favor high availability over strict consistency by reading from Aurora replicas—though the primary itself is strongly consistent.

That means brief eventual consistency for read workloads, which is acceptable for this domain.”

32
Q

Core API - consistency

A

“We do strongly consistent writes to Aurora’s primary. But for reads—especially from replicas or caches—we accept short-lived eventual consistency.
The lag is usually small, and it’s worth it to maintain high throughput under load.”

33
Q

Core API - security

A

“All local developers must assume an MFA-secured AWS role. Secrets are stored in Parameter Store or K8s secrets, meaning we never expose plain-text creds in code or logs.”

34
Q

Core API - internal ALB

A

“We use an internal ALB for traffic, so it’s not publicly accessible. On top of that, Kubernetes role based access control restricts who can modify deployments or read secrets, ensuring a tight security posture.”

35
Q

Core API - estimating capacity

A

“We measure requests-per-second during major sporting events and compare it to CPU/memory usage. If we see pods hitting 80% CPU or if DB queries approach saturation, we scale out.
Aurora read replicas handle the read spikes, and Redis further reduces direct DB hits.”

36
Q

Core API - main bottleneck

A

“Ultimately, Aurora can become the bottleneck for heavy writes. We mitigate that with indexing, short caches, and read replicas. If needed, we could further partition data, but so far Aurora’s performance has met our needs.

Nevertheless, I recently built an archiving task that runs nightly to archive all market lines from over 18 months in the past, which included a few hundred million records from a terabytes size table”

37
Q

PSO - centralized data for all properties

A

Problem: Multiple newly acquired properties each ingested sports data differently, creating inconsistencies.

Solution: We built a Core API on .NET, containerized on EKS, and standardized data ingestion via the Core Processor.

Outcome: We reduced duplication, established a single source of truth, and scaled seamlessly for peak sports seasons.

38
Q

PSO - zero downtime deployments

A

Problem: Rolling updates were risky with older infrastructure, often causing partial outages.

Solution: By using Helm with rolling updates and readiness probes, we can gradually shift
traffic to new pods while old pods are drained.

Outcome: Near-zero downtime deploys and the ability to roll back quickly if metrics or logs show a spike in errors.

39
Q

PSO - real-time observability

A

Problem: We lacked insight into production performance; debugging took hours.

Solution: We integrated Telegraf for metrics and Rollbar for error logs.

Outcome: The moment error rates spike, we get Slack alerts and can see exactly which
endpoints or queries are failing, cutting response times in half.

40
Q

Single Responsibility Principle

A

Classes should have a single responsibility, and only one reason to change. Everything it does should be very closely related so class isn’t bloated.

41
Q

Open-Closed Principle

A

Code should be open to extension, but closed to modification. Instead of modifying we can make a subclass that inherits from the base, extension methods

42
Q

Liskov Substitution Principle

A

A child class should be able to do everything a parent class can.

43
Q

Interface Segregation Principle

A

Client should never be forced to implement an interface it doesn’t use or forced to depend on methods they don’t use

44
Q

Dependency Inversion

A

High level modules shouldn’t depend on low level modules. Both should depend on abstraction.

45
Q

Pattern: CRUD service

A

Most common and simple. Backed up by db and cache. Fronted by api and lb.

Client -> API -> Load Balancer -> Service -> Cache -> Database

46
Q

Pattern: async job worker pool

A

For systems that have lots of processing and can tolerate some delay. Good for processing images and videos.

Queue -> Workers -> Database

47
Q

Pattern: 2 stage architecture

A

Good for recommendations, search, route planning. Fast but inaccurate stage then slow but precise to finish off.

Ranking service (slow but precise) -> Vector DB (fast but inaccurate) <- Blob storage

48
Q

Pattern: Event drive architecture

A

Centered around events, good for systems that need to react to changes in real-time, ex E-Commerce when an order is placed

Event producers -> event routers/brokers (Kafka or eventbridge) -> event consumers to process the events and take necessary actions

49
Q

Pattern: Durable job processing

A

For long running jobs. Store jobs in something like Kafka, then pool of workers process the jobs. They periodically checkpoint progress to a durable log and if a worker crashes another can pick up where it left off.

Phase 1 workers -> distributed, durable log -> Phase 2 workers -> durable log -> Phase 3 workers

50
Q

Pattern: Proximity based services

A

Ex Uber

Divide geographical area into regions and index entities within the regions. Allows system to exclude vast areas that don’t contain relevant entities