Data Intensive Ch1 - Reliable, Scalable and Maintainable Applications Flashcards
Pillars of reliability
System should continue to work CORRECTLY (correct function at desired performance) even in the face of ADVERSITY
Tolerating:
Hardware faults
Software faults
Human error
Pillars of scalability
As system GROWS in data volume, traffic volume, complexity there should be reasonable ways of dealing with that growth
Measuring load
Measuring performance
Latency percentiles
Throughput
Maintainability
Over time many folks will work with the system (engineering and operations) and they should be able to do so PRODUCTIVELY
Operability
Simplicity
Evolvability
What is the different between data and compute intensive apps?
Data intensive are rarely limited by CPU power The challenges are: Amount of data Complexity of data Speed oat which data is changing
How is data intensive apps is typically built?
From standard building blocks providing commonly needed functionality like:
Database for storing data
Caches for storing result of expensive operations or to speed up reads
Search indexes to allow look up data in various ways
Stream procesing to send an asyc message to another process
Batch processing to periodically crunch large data
Elements are so obvious nobody ever thinks about writing them from the scratch
BUT each of the blocks is provided in different variants of different characteristics and different apps have different requirements..
Combining tools can be difficult if requirement is to do something single tool cannot do alone.
Database and message queue have some superficial similarity - both store data for some time. So what is different?
Access patterns to data -> different performance characteristics -> different underlying implementation.
Context map
Fig 1-1 p5
Factors the incluence design of data systems
Skill & Exp of people involved Legacy system dependencies Time scale for delivery Org's tolerance of diff kinds of risk Regulatory constraints
What does working correctly mean in context of reliability?
App performs the function user expects
It tolerates user making mistakes or using software in unexpected ways
Performance is good enough for the use case, under expected load
System prevents any unauthorized access and abuse
Things that can go wrong are called…
System that anticipate them and copes with them is called
Faults
fault-tolerant or resilient
Fault-tolerant does not mean it can tolerate ANY fault
Fault vs failure
Fault - one component of system deviates from its spec
Failure - system as whole stops providing the required service to user
Hardware errors
Usually thought of as random and independent from each other
Failure of disk on one machine usually does not imply failure on another machine (could be the case if server racks temp goes up)
Redundancy of disks (hardware components) was enough until recently
Single machine failure was rare so multi-machine redundancy was not needed
As data volume grows, apps began using more machines which increases probability of hardware faults
Cloud platforms commonly do not guarantee single-machine reliability
Hence movement towards sysytems tolerating loss of entire machines in addition to hardware redundancy
Examples: hard disks cras faulty RAM power grid blackout Cable is unplugged
Software errors
Bug which causes app instance to crash on given input
Runaway process that eats up shared resource like RAN or network bandwith
External service dependency slows down or crashes or returns corrupted responses (SAM JWKS hello hello!)
Cascading failures - small fault in one component propagates faults in another and another etc
Usually lie dormant until triggered by unusual circumstances
Usually reveal some assumption about apps environment which USUALLY is true
Remedies: analysis, testing, process isolation, monitoring and alerts