Designing Data Intensive Applications Flashcards

Question 1

Q

What are the 3 biggest “concerns” when designing data systems

Answer

A

Reliability (work correctly in the face of adversity)

Scalability (systems ability to deal with increased load)

Maintainability (Over time, many people can contribute to the system productively)

Question 2

Q

What is the big idea behind reliability Tell me everything you know about faults and failures

Answer

A

Reliability is ensuring that things work correctly in the face of adversity

A fault is when a component deviates from its spec

A failure is when a service does not deliver expected results to a user

So the goal is to prevent faults from turning into failures

It can make sense to deliberately trigger faults in your system (Netflix Chaos Monkey)

Question 3

Q

What are the 3 types of faults?

Answer

A

Hardware faults (generally randomly distributed but consistently happening)

Software faults (unpredictable and systemic, but often more rare)

Human error

Question 4

Q

What’s the big idea behind scalability

What are the type of questions worth considering?

Answer

A

Scalability is a systems ability to cope with increased load

“If the system grows in a particular way, what are our options for coping with the growth”

Question 5

Q

Give some examples of load parameters for different systems

Answer

A

Web server - requests per second
Database - ratio of reads to writes
Chat room - # of simultaneously active users
Cache - hit rate

Question 6

Q

Design a simple Twitter

Answer

A

Do 2 things.

Post Tweet
Get home timeline

Approach 1 for get home timeline is to just query all the users that someone is following and sort them by time

However this runs into a fan out problem

Approach 2 is to pre write everyone’s timeline and then simply fetch it (kindve like a mailbox)

However, this approach breaks down for celebrities that have 30 million followers

So the best solution is a hybrid of both

This also illustrates the example of the read/write dichotomy

Question 7

Q

What are the 2 ways of describing performance?

Answer

A

When you increase a load parameter and keep the system resources the same. How is the performance affected?

When you increase a load parameter, how much do you need to increase the resources to keep performance unchanged

Question 8

Q

What are the 2 key performance characteristics and their definition

Answer

A

Throughput - the number of records that can be processed in a second

Latency - the time between a client sending a request and getting a response

Question 9

Q

What is head of line blocking

Answer

A

When a server has a bunch of requests queued up, but some of them take much longer than others, which blocks everyone else in line

When doing load testing, make sure to not wait for requests to complete before sending the next one, otherwise these types of issues won’t be caught

Question 10

Q

What is tail latency amplification

Answer

A

When a request sent by a client needs to be fulfilled by 3-4 other backend calls, the slowest one holds back the entire response.

Question 11

Q

What are the 3 design principles for software maintainability

Answer

A

Operability - Make it easy for the operations team

Simplicity - Make it easy for new engineers to understand the system

Evolvability - Make it easy for engineers to change the system in the future

Question 12

Q

What is accidental complexity and how do we combat it?

Answer

A

Accidental complexity is any functionality that is not inherent to solve the problem at hand, but arises from the implementation

The best way to combat it is through abstractions

Question 13

Q

What are the 2 main types of relationships in data modeling and what are the strengths of no sql vs sql in this regard

Answer

A

One to many (1 person has multiple hobbies)

Many to one (many people work in 1 industry)

No sql (json documents) are better at one to many because they provide better locality (less queries needed to get all info about 1 user). They suck at many to one because they usually don’t have native support for joins.

Question 14

Q

What is the network model?

Answer

A

It’s like JSON but records can have more than 1 parent (like a graph)

Application code had to manually traverse this graph, and that became difficult to manage (manually traversing access paths)

Question 15

Q

How can you reduce the need for joins, and what is the downside of doing that?

Answer

A

You can reduce the need for joins by denormalizing the data (copying it over and over to each record) but keeping all the denormalized data up to date gets very hard

Question 16

Q

What is schema-on-read vs schema-on-write and what are the pros and cons?

Answer

Study These Flashcards

A

Schema on read is that the expectation is on the application while reading to verify the accuracy of the schema

Schema on write is the opposite

Schema on read gives you more flexibility. Schema on write gives you more safety

Question 17

Q

How does MapReduce work

Answer

Study These Flashcards

A

Theres a query that filters records

There’s a map function that takes a document and emits a key/value

Then all keys are aggregated and sorted

Then a reduce function takes a key and a list of values and then does an operation on them

Question 18

Q

What is the property graph model and how can you model it using SQL

Answer

Study These Flashcards

A

Vertex:
Unique identifier
A collection of properties (json)

Edge:
Unique identifier
Head vertex
Tail vertex
Label to describe relationship
A collection of properties

Index on head vertex and tail vertex

You can traverse the graph by starting at a vertex and then querying edges table for anything that has that vertex, either tail or head

Question 19

Q

Answer

Study These Flashcards

A

Designing Data Intensive Applications Flashcards

(19 cards)