Designing Data Intensive Applications Flashcards
What are the 3 biggest “concerns” when designing data systems
Reliability (work correctly in the face of adversity)
Scalability (systems ability to deal with increased load)
Maintainability (Over time, many people can contribute to the system productively)
What is the big idea behind reliability Tell me everything you know about faults and failures
Reliability is ensuring that things work correctly in the face of adversity
A fault is when a component deviates from its spec
A failure is when a service does not deliver expected results to a user
So the goal is to prevent faults from turning into failures
It can make sense to deliberately trigger faults in your system (Netflix Chaos Monkey)
What are the 3 types of faults?
Hardware faults (generally randomly distributed but consistently happening)
Software faults (unpredictable and systemic, but often more rare)
Human error
What’s the big idea behind scalability
What are the type of questions worth considering?
Scalability is a systems ability to cope with increased load
“If the system grows in a particular way, what are our options for coping with the growth”
Give some examples of load parameters for different systems
Web server - requests per second
Database - ratio of reads to writes
Chat room - # of simultaneously active users
Cache - hit rate
Design a simple Twitter
Do 2 things.
- Post Tweet
- Get home timeline
Approach 1 for get home timeline is to just query all the users that someone is following and sort them by time
However this runs into a fan out problem
Approach 2 is to pre write everyone’s timeline and then simply fetch it (kindve like a mailbox)
However, this approach breaks down for celebrities that have 30 million followers
So the best solution is a hybrid of both
This also illustrates the example of the read/write dichotomy
What are the 2 ways of describing performance?
When you increase a load parameter and keep the system resources the same. How is the performance affected?
When you increase a load parameter, how much do you need to increase the resources to keep performance unchanged
What are the 2 key performance characteristics and their definition
Throughput - the number of records that can be processed in a second
Latency - the time between a client sending a request and getting a response
What is head of line blocking
When a server has a bunch of requests queued up, but some of them take much longer than others, which blocks everyone else in line
When doing load testing, make sure to not wait for requests to complete before sending the next one, otherwise these types of issues won’t be caught
What is tail latency amplification
When a request sent by a client needs to be fulfilled by 3-4 other backend calls, the slowest one holds back the entire response.
What are the 3 design principles for software maintainability
Operability - Make it easy for the operations team
Simplicity - Make it easy for new engineers to understand the system
Evolvability - Make it easy for engineers to change the system in the future
What is accidental complexity and how do we combat it?
Accidental complexity is any functionality that is not inherent to solve the problem at hand, but arises from the implementation
The best way to combat it is through abstractions
What are the 2 main types of relationships in data modeling and what are the strengths of no sql vs sql in this regard
One to many (1 person has multiple hobbies)
Many to one (many people work in 1 industry)
No sql (json documents) are better at one to many because they provide better locality (less queries needed to get all info about 1 user). They suck at many to one because they usually don’t have native support for joins.
What is the network model?
It’s like JSON but records can have more than 1 parent (like a graph)
Application code had to manually traverse this graph, and that became difficult to manage (manually traversing access paths)
How can you reduce the need for joins, and what is the downside of doing that?
You can reduce the need for joins by denormalizing the data (copying it over and over to each record) but keeping all the denormalized data up to date gets very hard