Reliable, Scalable Maintainable Systems Flashcards
What are five common data processing options for data-intensive applications, and what, broadly, are they used for?
- Databases: Persistent data storage
- Caches: Speed up reads to frequently-read data
- Search Indexes: Allow users to efficiently filter and search data
- Stream Processing: Asynchronous messaging between processes
- Batch Processing: Periodically processing large volumes of accumulated data
What two factors changed over the past decade in terms of how data processing systems are categorised?
- The distinction between various types of data processing options has become blurred, and modern solutions tend to fit in multiple categories, e.g. Redis can function both as a data store and as a message queue
- Rather than having a single tool that is used for all purposes, systems tend to be composed of multiple disparate systems that each meet a specific need, and these are orchestrated by application code
What is meant by the reliability of a system?
A reliable system continues to work correctly (that is, returns accurate data within a certain performance constraint) in the event of hardware faults, software faults and human error.
What is meant by the scalability of a system?
As the system or its data requirements grow in inherent complexity, how well the system grows to meet those requirements without growing in incidental complexity through some measurement of its performance.
What is meant by the maintainability of a system?
As the number of people working on a system grows or changes over time, the maintainability of a system refers to how well the system continues to perform its existing functions over time, and how well it can adapt to new use cases that may arise.
What are some things that can broadly be expected of a reliable system?
- The system performs its expected function
- It can tolerate user error
- It is sufficiently performant for its use case and continues to be as volume grows
- It prevents any unauthorised access from malicious users
What is the distinction between a fault and a failure?
A fault is some unexpected, adverse even that can affect a system, e.g. software or hardware failure within a system. A failure, on the other hand, is the where a system (through a fault) no longer serves its intended function.
How do you ensure a system is fault tolerant?
By inducing faults in the system (e.g. simulating hardware failures, attacks, service outages) and monitoring system behaviour, you can develop a profile of what types of faults a system can tolerate. No system is impervious to all types of faults, so it is critical that you identify how a system is fault tolerant and the constraints of the faults that can occur.
What are two reasons that the trend moved away from ensuring individual hardware components are resilient and toward systems becoming more resilient to individual hardware failures?
- As systems grow in data processing requirement complexity, the amount of hardware these systems need to operate in a robust and performant manner grows, meaning the likelihood of an individual failure increases, so a hardware failure should not cause a system failure
- On shared infrastructure like AWS, virtualised infrastructure can become unavailable as services respond to use cases that require elasticity, rather than ensuring individual machines are preserved
Why are software more faults more likely to cause system failures than hardware faults?
- Hardware faults tend not to be correlated unless under exigent circumstances, while software faults tend to be correlated over redundant replicates of a service in a system, meaning it’s more likely that a failure that impacts a node serving a specific function in a system will impact all of its replicas, causing an outage in a critical part of a system and hence a failure for the whole system
- Software faults tend to be less predictable, as some are inherent to the design
What are some ways that a software fault can cause an outage in a distributed, redundant system?
- Bad inputs result in an unhandled exception on all replicas of a back-end service
- A memory leak/bug causes on service to consume all of a shared hardware resource, causing other services sharing the same hardware to fail
- An operating system service fails, and services dependent on that system service can fail or hang as a result
- Cascading failures, i.e. one fault causing another fault and so on until the system fails overall
What is a load parameter in the context of measuring the data load of a system?
Any measurable, independent variable quantity that describes operations that occur within a system. Some examples include:
- Requests per second to page
- Read/write ratio for a database
- Cache hit rate
How do you describe performance in relation to the load parameters of a system?
- When the load parameter increases and the CPU/memory resources are kept the same, how does the performance of the system get affected?
- When you increase a load parameter, how much do you need to increase the resources available to the system in order for it to have the same performance?
What is one major difference between how performance is measured for batch processing system when compared with real-time processing systems?
- In batch processing systems, the number of records that can be processed within a given timeframe (the throughput) is a measure of its performance
- In real-time systems, the more important metric is the response time, i.e. the time it takes for a user interaction to have a corresponding response
What is the difference between latency and response time?
The response time is the time the response takes after the request, while the latency is the time between when the request was initiated and when the request was processed by the back-end system.