Chapter 1 - Reliable, Maintainable and Scalable Applications Flashcards
(33 cards)
What would we use typically to store data so that one application or another can find it again later?
A database
What would we typically use to remember the result of an expensive operation so we could speed up reads?
A cache
What would we build to allow efficient searching by keyword or filtering
A search index
What would we use (one example) to send a message to another process in a way that allows that message to be processed asynchronously
Sream processing
What kind of processing would typically describe periodically crunching large amounts of accumulated data?
Batch processing
Does a one size fits all approach typically work well today when designing applications
It depends but in generally no. There are many different types of databases, search index and caching tools and knowing which ones to choose and how to combine them is critical when designing most modern applications today.
When you combine several tools in a service and hide the details behind a common Application Programming Interface, what have you done?
You’ve essentially created your own special purpose data system from smaller more general purpose components. You need to think pretty hard about the guarantees it can make and then tradeoffs of how you combine these tools.
Sum up the meaning of ‘Reliability’
The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error).
Sum up the meaning of ‘Scalability’
As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.
Sum up the meaning of ‘Maintainability’
Over time, many different people will work on the system (engineering and operations, both maintaining current behaviour and adapting the system to new use cases), and they should all be able to work on it productively.
Give 4 reasonable expectations of Reliability
The application performs the function that the user expected.
It can tolerate the user making mistakes or using the software in unexpected ways.
Its performance is good enough for the required use case, under the expected load and data volume.
The system prevents any unauthorised access and abuse.
Is it a reasonable expectation to make a system fault tolerant of every eventuality.
No, there are some faults which a system could never be expected to recover from reasonably. If the world ended, you would have to host the system in space. We have to think about the types of faults we want our system to tolerate.
What’s the difference between a fault and a failure
A fault is one component of the system deviating from expected behaviour whereas a failure is the entire system not performing as expected. We can’t prevent all faults but we can design fault tolerant mechanisms that stop faults from becoming failures.
What are some ways we could deal with hardware faults?
Redundancy of components and infrastructure
Using software systems that can tolerate the loss of an entire machine (Kubernetes for example)
What are some examples of Software faults?
A bug in the software
A runaway process that uses up shared resources (CPU time, memory, disk space, network bandwidth).
Cascading failures (a fault in one component triggers a fault in another component which triggers a fault in another component).
What are some methods of preventing software faults?
Thinking carefully through assumptions and interactions in the system
Testing
Process isolation
Allowing processes to crash and restart
Measuring, monitoring and analysing system behaviour in produciton
How do we minimise the possibility for human error?
Design our systems in a way that minimises the ability of humans to make a mistake and makes it easy for them to do the right thing. If we do make the system too restrictive though people will work around it.
Decouple the places where people make the most mistakes from the places that can cause failures. Sandbox environments that mirror production allow people to play and make mistakes without a huge impact.
Test thoroughly at all levels, unit, integration, end to end tests.
Make it easy and foolproof to recover from human errors (rollback from configuration changes etc), roll out new code gradually, provide tools to recompute data easily if there were errors in the original computation.
Set up detailed monitoring/instrumentation.
Implement good management practices, training and a a good culture.
How important is reliability?
It depends, sometimes we might choose to sacrifice reliability for cost but it is generally very important.
For scalability, how would you measure load on the system?
With pre-defined parameters called load parameters.
The best choice depends on your system and it’s architecture and requirements.
Might be web requests/second, ratio of reads/writes in a database, simultaneous active users in a chat room, hit rate on a cache.
What is the biggest factor for example to consider when looking at Twitter’s load?
Fan out. One tweet has to make it’s way into the timelines of all a user’s followers in order with all the other tweets from other people that they follow.
Each person follows many people and is followed by many people.
Is it preferable for Twitter to do more work at write or read time?
Write time because the number of tweets posted is a number of orders of magnitude less than the number of home timelines requested.
This is why twitter builds a cache for each users home timeline and updates each cash as needed with new tweets.
For a person with 30 million followers, what does Twitters cache’ing approach mean in terms of writes when that person posts a tweet?
30 million writes occur!
Is distribution of followers per user a good load parameter for Twitter?
Yes, it determines how many writes are going to happen and the fanout load.
What would be a good way to think about load and its effect on your system?
- When you increase a load parameter and keep the system resources (CPU, memory, network bandwidth, etc.) unchanged, how is the performance of your system affected?
- When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?