INTRO Flashcards
Why are caches necessary in data intensive applications
They are usually used to speed up Reads or remember the result of an expensive operation
What is stream processing
Stream processing involves sending a message to another process for it to be handled asynchronously
What does RELIABILITY in the context of building data intensive systems mean?
In the context of DIS, reliability simply means that even in the face of human, hardware or software errors, a system should still be able to function “correctly” at a desired level of performance.
Simply put: the system should continue to work correctly even when things go wrong.
What does SCALABILITY in the context of building data intensive systems mean?
In the context of DIS, scalability simply means that as the software system grows (in data, traffic volume etc) , the system should be resilient enough to accommodate that growth or there should be reasonable ways to deal with such growth
What does MAINTAINABILITY in the context of building data intensive systems mean?
In the context of DIS, maintainability simply means that overtime as more people work on a system (improving it’s existing functionalities or implementing new ones) they should be able to work on it productively
What is a FAULT?
We say a fault occurs in a system when one component of the system deviates from its requirements specification (stops working)
What is a FAILURE?
A failure occurs in a system when the entire system stops working and hence doesn’t provide the required service to the user
How to mitigate faults
Trigger faults deliberatly (e.g shutting down a server out of the blue). Doing this exposes cases where there’s poor error handling.
In general we want to tolerate faults (most of the time) rather than prevent faults (cos some faults are not preventable except well, security faults)
What three errors could occur in a system
Hardware errors: They have weak correlations, it is unlikely that one hard disk crashing will affect another hard disk
Software errors: They have strong correlations and can pull down an entire system (cause failure)
Human errors: Well humans design these systems
How to reduce the occurrence of human errors
- Design systems in a manner that reduces the likelihood of making an error.
For example, well-designed abstractions, APIs, and admin interfaces make it easy to do “the
right thing” and discourage “the wrong thing.” However, if the interfaces are too
restrictive people will work around them, negating their benefit, so this is a tricky
balance to get right. - Create a separate environment where people can make mistakes from the environment where these mistakes can cause actual failures. A good example of this is a sandbox
- Carry out tests: unit, integration, end-to-end
- Detailed and clear monitoring and logging should be set up
- Make recovery from human errors easy and quick so as to reduce the impact in the event of a failure
What is one common cause of degradation of systems?
Increased load. For instance a system that was handling 10,000 requests per second could find itself handling 100,000. The question now becomes how to handle this increase in load.
What are some scalability questions that can be asked?
If my system has grown in X kind of way, how do i handle such growth or how can i cope with such growth?
What computing resources can I add to cope with the additional load?
What is response time?
How long does it take to get a response for the request sent (usually by a user, client).
What is latency?
Latency is the amount of time that a request is waiting to be handled during which it is latent.
How should response time be thought of?
Not as a single value, but as a distribution of values that can be measured. Why?
Because In practice, in a system handling a variety of
requests, the response time can vary a lot. Hence if you send the same request over and over again, one would notice that the response time differs even if it is the same request