System Design Flashcards

Question 1

Q

When should you generally choose a NoSQL database?

Answer

A

app requires super low latency
data is unstructured
need to store massive amount of data (e.g., for horizontal scaling)

Question 2

Q

When should you use a relational database?

Question 3

Q

What are advantages of database replication?

Answer

A

better performance (writes to master, reads to slaves)
reliability (if one DB is destroyed, the data is preserved in the replications)
high availability (if one DB goes down we can still access from replications)

Question 4

Q

What are some ways we can reduce the load/response time of a request?

Answer

A

implement a cache layer for common data requests
implement a CDN for static content

Question 5

Q

When should we consider using a cache?

Answer

A

when there are way more reads than writes
when we want to reduce the response time of a request

Question 6

Q

What are some considerations to keep in mind when implementing a cache?

Answer

A

is it appropriate? (Need a lot more reads than writes to make it appropriate)
expiration policy
consistency (keeping the data store and cache in sync)
mitigating failures (multiple caches in different data centres; overprovision the required memory by certain percentages)
eviction policy (i.e., what to do when cache is full). Least recently used (LRU) is most common. Least frequently used or FIFO or others

Question 7

Q

What is a CDN?

Answer

A

content delivery network
third party cache for static files (e.g., html/css/javascript files, images, videos, etc

Question 8

Q

What are some considerations to think about when implementing a CDN?

Answer

A

cost; they’re third party providers, so caching infrequently used data provides no benefit and costs money
TTL (time to live/expiry time)
CDN failure - clients should be able to request from origin if there is a failure
invalidating files; when a file changes you have to invalidate what’s in the CDN. Invalidation can be done either through an API provided by the CDN provider, or by keeping different versions of the file which can then be accessed through query strings

Question 9

Q

How do you keep your web tier stateless?

Answer

A

move user session data to persistent storage (i.e., the database); NoSQL is a good choice because it’s easier to scale

Question 10

Q

Why is stateless architecture important?

Answer

A

handling failures (e.g., if a users session is on one server what happens when that server fails)
adding and removing servers
load balancing

Question 11

Q

What is a geo-DNS?

Answer

A

a DNS that resolves the domain name to the IP address of the closest data centre to the users location (only used when there are multiple data centres in different locations in the world)

Question 12

Q

What technical challenges are involved with a multiple data centre setup?

Answer

A

traffic redirection (geo-DNS)
data synchronization (generally want to replicate data across multiple data centres)
test and deployment (want to make sure it’s all working the same between each data centre; automatic deployment tools are crucial)

Question 13

Q

What is a message queue and what’s its purpose?

Answer

A

a message queue is a service that separates requests between producers and consumers
a producer sends a task to the message queue and then the consumers pick up tasks from the message queue and perform the task
an example is if your app supports photo processing. Since processing takes time and resources, you don’t necessarily want to do it on the same server that your clients are connected to so you don’t clog up the resources. Therefore you might implement a message queue so your app users can request a processing to your photo (e.g., blur the photo), then the consumers (other servers responsible for doing the blurring) pick up the messages and perform the task when they are available to do so, and return the processed image
the purpose is to decouple the web servers with the processing tasks so your servers are not overworked, and so you can scale the producers and the consumers independently

Question 14

Q

What are the 4 different categories of NoSQL databases?

Answer

A

key-value
wide column
documents
graph

Question 15

Q

What is a key-value database?

Answer

A

literally just look up the value based on a key
all data for a key value database is in the machines memory instead of on disk
pros: very fast
cons: pretty limited in what it can do; queries are not possible
often used as a cache
examples are redis, memcached

Question 16

Q

What is a wide column database?

Answer

Study These Flashcards

A

similar to a key-value except you can look up multiple columns instead of just one value
not all keys have all the same columns (otherwise this would just be a relational db)
pros: unstructured data; easy to scale horizontally; easy to replicate data across multiple nodes; can group related data and write queries with CQL
cons: no joins
common use case: time series data - each column is a new data point in time
often used for frequent writes but infrequent updates/reads
not generally used as the primary database
examples are Cassandra and Hbase

Question 17

Q

What is a document database?

Answer

Study These Flashcards

A

organized into documents, which are organized into collections
documents contain json data that is unstructured and schema-less
collections can be organized into logical hierarchy
best practice is to denormalize data so reads are fast. Trade off is writes can be complex and slow
use cases are very wide ranging and this is the most common use of NoSQL databases
pros: can develop queries that are sort of relational; schema-less
cons: no joins

Question 18

Q

What is ACID compliant and what does it mean?

Answer

Study These Flashcards

A

ACID stands for: Atomicity, Consistency, Isolation, and Durability
it refers to databases and essentially means that when a transaction occurs with a database, the validity is guaranteed
super important for organizations where this is necessary (think banks and a customers balance for example)
it makes these databases harder to scale though
relational databases are ACID compliant

Question 19

Q

What is a graph database?

Answer

Study These Flashcards

A

NoSQL database that represents data as nodes and relationships as edges
has better performance on larger datasets
best for: detecting fraud in finance, building knowledge graphs in companies, and recommendation engines

Question 20

Q

What is sharding?

Answer

Study These Flashcards

A

horizontal scaling of a database
shards are the broken up pieces of a larger database
each shard shares the same schema but has different data
e.g., user-id % (# of shards) to split data with the same schema over multiple tables - this is the sharding function and is important to make sure your data ends up evenly distributed and randomized

Question 21

Q

What are some problems with sharding?

Answer

Study These Flashcards

A

unequal data: one shard gets way more data than others
resharding: reshuffling data if amount of data gets too big for a single shard. This can be complicated. Consistent hashing solves this problem
celebrity problem: unequal amount of traffic to one shard over others (e.g., imagine celebrities all end up on the same shard)

Question 22

Q

Name some general principles to keep in mind when we’re scaling to millions of users

Answer

Study These Flashcards

A

keep web tier stateless
redundancy at every tier
cache as much as possible
scale the database tier with sharding
use multiple data centers
use a CDN
split tiers into individual services (e.g., web tier split into hosts and workers using a message queue)
keep error logs, metrics, and use as automation for build/test/deploy

Question 23

Q

What is consistent hashing?

Answer

Study These Flashcards

A

Question 24

Q

What are the general principles to keep in mind when we’re thinking about speed of our system?

Answer

Study These Flashcards

A

memory is fast but disk is slow
avoid disk seeks when possible
compression is fast - try and compress stuff before sending it over the internet
it takes time to send data between data centers (they’re usually in different regions)

System Design Flashcards

(24 cards)