System Design Flashcards

1
Q

When should you generally choose a NoSQL database?

A
  • app requires super low latency
  • data is unstructured
  • need to store massive amount of data (e.g., for horizontal scaling)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

When should you use a relational database?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are advantages of database replication?

A
  • better performance (writes to master, reads to slaves)
  • reliability (if one DB is destroyed, the data is preserved in the replications)
  • high availability (if one DB goes down we can still access from replications)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some ways we can reduce the load/response time of a request?

A
  • implement a cache layer for common data requests
  • implement a CDN for static content
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When should we consider using a cache?

A
  • when there are way more reads than writes
  • when we want to reduce the response time of a request
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some considerations to keep in mind when implementing a cache?

A
  • is it appropriate? (Need a lot more reads than writes to make it appropriate)
  • expiration policy
  • consistency (keeping the data store and cache in sync)
  • mitigating failures (multiple caches in different data centres; overprovision the required memory by certain percentages)
  • eviction policy (i.e., what to do when cache is full). Least recently used (LRU) is most common. Least frequently used or FIFO or others
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a CDN?

A
  • content delivery network
  • third party cache for static files (e.g., html/css/javascript files, images, videos, etc
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some considerations to think about when implementing a CDN?

A
  • cost; they’re third party providers, so caching infrequently used data provides no benefit and costs money
  • TTL (time to live/expiry time)
  • CDN failure - clients should be able to request from origin if there is a failure
  • invalidating files; when a file changes you have to invalidate what’s in the CDN. Invalidation can be done either through an API provided by the CDN provider, or by keeping different versions of the file which can then be accessed through query strings
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you keep your web tier stateless?

A
  • move user session data to persistent storage (i.e., the database); NoSQL is a good choice because it’s easier to scale
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is stateless architecture important?

A
  • handling failures (e.g., if a users session is on one server what happens when that server fails)
  • adding and removing servers
  • load balancing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a geo-DNS?

A
  • a DNS that resolves the domain name to the IP address of the closest data centre to the users location (only used when there are multiple data centres in different locations in the world)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What technical challenges are involved with a multiple data centre setup?

A
  • traffic redirection (geo-DNS)
  • data synchronization (generally want to replicate data across multiple data centres)
  • test and deployment (want to make sure it’s all working the same between each data centre; automatic deployment tools are crucial)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a message queue and what’s its purpose?

A
  • a message queue is a service that separates requests between producers and consumers
  • a producer sends a task to the message queue and then the consumers pick up tasks from the message queue and perform the task
  • an example is if your app supports photo processing. Since processing takes time and resources, you don’t necessarily want to do it on the same server that your clients are connected to so you don’t clog up the resources. Therefore you might implement a message queue so your app users can request a processing to your photo (e.g., blur the photo), then the consumers (other servers responsible for doing the blurring) pick up the messages and perform the task when they are available to do so, and return the processed image
  • the purpose is to decouple the web servers with the processing tasks so your servers are not overworked, and so you can scale the producers and the consumers independently
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the 4 different categories of NoSQL databases?

A
  • key-value
  • wide column
  • documents
  • graph
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a key-value database?

A
  • literally just look up the value based on a key
  • all data for a key value database is in the machines memory instead of on disk
  • pros: very fast
  • cons: pretty limited in what it can do; queries are not possible
  • often used as a cache
  • examples are redis, memcached
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a wide column database?

A
  • similar to a key-value except you can look up multiple columns instead of just one value
  • not all keys have all the same columns (otherwise this would just be a relational db)
  • pros: unstructured data; easy to scale horizontally; easy to replicate data across multiple nodes; can group related data and write queries with CQL
  • cons: no joins
  • common use case: time series data - each column is a new data point in time
  • often used for frequent writes but infrequent updates/reads
  • not generally used as the primary database
  • examples are Cassandra and Hbase
17
Q

What is a document database?

A
  • organized into documents, which are organized into collections
  • documents contain json data that is unstructured and schema-less
  • collections can be organized into logical hierarchy
  • best practice is to denormalize data so reads are fast. Trade off is writes can be complex and slow
  • use cases are very wide ranging and this is the most common use of NoSQL databases
  • pros: can develop queries that are sort of relational; schema-less
  • cons: no joins
18
Q

What is ACID compliant and what does it mean?

A
  • ACID stands for: Atomicity, Consistency, Isolation, and Durability
  • it refers to databases and essentially means that when a transaction occurs with a database, the validity is guaranteed
  • super important for organizations where this is necessary (think banks and a customers balance for example)
  • it makes these databases harder to scale though
  • relational databases are ACID compliant
19
Q

What is a graph database?

A
  • NoSQL database that represents data as nodes and relationships as edges
  • has better performance on larger datasets
  • best for: detecting fraud in finance, building knowledge graphs in companies, and recommendation engines
20
Q

What is sharding?

A
  • horizontal scaling of a database
  • shards are the broken up pieces of a larger database
  • each shard shares the same schema but has different data
  • e.g., user-id % (# of shards) to split data with the same schema over multiple tables - this is the sharding function and is important to make sure your data ends up evenly distributed and randomized
21
Q

What are some problems with sharding?

A
  • unequal data: one shard gets way more data than others
  • resharding: reshuffling data if amount of data gets too big for a single shard. This can be complicated. Consistent hashing solves this problem
  • celebrity problem: unequal amount of traffic to one shard over others (e.g., imagine celebrities all end up on the same shard)
22
Q

Name some general principles to keep in mind when we’re scaling to millions of users

A
  • keep web tier stateless
  • redundancy at every tier
  • cache as much as possible
  • scale the database tier with sharding
  • use multiple data centers
  • use a CDN
  • split tiers into individual services (e.g., web tier split into hosts and workers using a message queue)
  • keep error logs, metrics, and use as automation for build/test/deploy
23
Q

What is consistent hashing?

A
24
Q

What are the general principles to keep in mind when we’re thinking about speed of our system?

A
  • memory is fast but disk is slow
  • avoid disk seeks when possible
  • compression is fast - try and compress stuff before sending it over the internet
  • it takes time to send data between data centers (they’re usually in different regions)