Designing Data Intensive Applications Flashcards

1
Q

What are the 3 biggest “concerns” when designing data systems

A

Reliability (work correctly in the face of adversity)

Scalability (systems ability to deal with increased load)

Maintainability (Over time, many people can contribute to the system productively)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the big idea behind reliability Tell me everything you know about faults and failures

A

Reliability is ensuring that things work correctly in the face of adversity

A fault is when a component deviates from its spec

A failure is when a service does not deliver expected results to a user

So the goal is to prevent faults from turning into failures

It can make sense to deliberately trigger faults in your system (Netflix Chaos Monkey)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the 3 types of faults?

A

Hardware faults (generally randomly distributed but consistently happening)

Software faults (unpredictable and systemic, but often more rare)

Human error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What’s the big idea behind scalability

What are the type of questions worth considering?

A

Scalability is a systems ability to cope with increased load

“If the system grows in a particular way, what are our options for coping with the growth”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Give some examples of load parameters for different systems

A

Web server - requests per second
Database - ratio of reads to writes
Chat room - # of simultaneously active users
Cache - hit rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Design a simple Twitter

A

Do 2 things.

  1. Post Tweet
  2. Get home timeline

Approach 1 for get home timeline is to just query all the users that someone is following and sort them by time

However this runs into a fan out problem

Approach 2 is to pre write everyone’s timeline and then simply fetch it (kindve like a mailbox)

However, this approach breaks down for celebrities that have 30 million followers

So the best solution is a hybrid of both

This also illustrates the example of the read/write dichotomy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the 2 ways of describing performance?

A

When you increase a load parameter and keep the system resources the same. How is the performance affected?

When you increase a load parameter, how much do you need to increase the resources to keep performance unchanged

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the 2 key performance characteristics and their definition

A

Throughput - the number of records that can be processed in a second

Latency - the time between a client sending a request and getting a response

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is head of line blocking

A

When a server has a bunch of requests queued up, but some of them take much longer than others, which blocks everyone else in line

When doing load testing, make sure to not wait for requests to complete before sending the next one, otherwise these types of issues won’t be caught

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is tail latency amplification

A

When a request sent by a client needs to be fulfilled by 3-4 other backend calls, the slowest one holds back the entire response.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the 3 design principles for software maintainability

A

Operability - Make it easy for the operations team

Simplicity - Make it easy for new engineers to understand the system

Evolvability - Make it easy for engineers to change the system in the future

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is accidental complexity and how do we combat it?

A

Accidental complexity is any functionality that is not inherent to solve the problem at hand, but arises from the implementation

The best way to combat it is through abstractions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the 2 main types of relationships in data modeling and what are the strengths of no sql vs sql in this regard

A

One to many (1 person has multiple hobbies)

Many to one (many people work in 1 industry)

No sql (json documents) are better at one to many because they provide better locality (less queries needed to get all info about 1 user). They suck at many to one because they usually don’t have native support for joins.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the network model?

A

It’s like JSON but records can have more than 1 parent (like a graph)

Application code had to manually traverse this graph, and that became difficult to manage (manually traversing access paths)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can you reduce the need for joins, and what is the downside of doing that?

A

You can reduce the need for joins by denormalizing the data (copying it over and over to each record) but keeping all the denormalized data up to date gets very hard

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is schema-on-read vs schema-on-write and what are the pros and cons?

A

Schema on read is that the expectation is on the application while reading to verify the accuracy of the schema

Schema on write is the opposite

Schema on read gives you more flexibility. Schema on write gives you more safety

17
Q

How does MapReduce work

A

Theres a query that filters records

There’s a map function that takes a document and emits a key/value

Then all keys are aggregated and sorted

Then a reduce function takes a key and a list of values and then does an operation on them

18
Q

What is the property graph model and how can you model it using SQL

A

Vertex:
Unique identifier
A collection of properties (json)

Edge:
Unique identifier
Head vertex
Tail vertex
Label to describe relationship
A collection of properties 

Index on head vertex and tail vertex

You can traverse the graph by starting at a vertex and then querying edges table for anything that has that vertex, either tail or head

19
Q
A