Reliable, Scalable Maintainable Systems Flashcards

1
Q

What are five common data processing options for data-intensive applications, and what, broadly, are they used for?

A
  1. Databases: Persistent data storage
  2. Caches: Speed up reads to frequently-read data
  3. Search Indexes: Allow users to efficiently filter and search data
  4. Stream Processing: Asynchronous messaging between processes
  5. Batch Processing: Periodically processing large volumes of accumulated data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What two factors changed over the past decade in terms of how data processing systems are categorised?

A
  1. The distinction between various types of data processing options has become blurred, and modern solutions tend to fit in multiple categories, e.g. Redis can function both as a data store and as a message queue
  2. Rather than having a single tool that is used for all purposes, systems tend to be composed of multiple disparate systems that each meet a specific need, and these are orchestrated by application code
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is meant by the reliability of a system?

A

A reliable system continues to work correctly (that is, returns accurate data within a certain performance constraint) in the event of hardware faults, software faults and human error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is meant by the scalability of a system?

A

As the system or its data requirements grow in inherent complexity, how well the system grows to meet those requirements without growing in incidental complexity through some measurement of its performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is meant by the maintainability of a system?

A

As the number of people working on a system grows or changes over time, the maintainability of a system refers to how well the system continues to perform its existing functions over time, and how well it can adapt to new use cases that may arise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some things that can broadly be expected of a reliable system?

A
  1. The system performs its expected function
  2. It can tolerate user error
  3. It is sufficiently performant for its use case and continues to be as volume grows
  4. It prevents any unauthorised access from malicious users
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the distinction between a fault and a failure?

A

A fault is some unexpected, adverse even that can affect a system, e.g. software or hardware failure within a system. A failure, on the other hand, is the where a system (through a fault) no longer serves its intended function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you ensure a system is fault tolerant?

A

By inducing faults in the system (e.g. simulating hardware failures, attacks, service outages) and monitoring system behaviour, you can develop a profile of what types of faults a system can tolerate. No system is impervious to all types of faults, so it is critical that you identify how a system is fault tolerant and the constraints of the faults that can occur.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are two reasons that the trend moved away from ensuring individual hardware components are resilient and toward systems becoming more resilient to individual hardware failures?

A
  1. As systems grow in data processing requirement complexity, the amount of hardware these systems need to operate in a robust and performant manner grows, meaning the likelihood of an individual failure increases, so a hardware failure should not cause a system failure
  2. On shared infrastructure like AWS, virtualised infrastructure can become unavailable as services respond to use cases that require elasticity, rather than ensuring individual machines are preserved
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why are software more faults more likely to cause system failures than hardware faults?

A
  1. Hardware faults tend not to be correlated unless under exigent circumstances, while software faults tend to be correlated over redundant replicates of a service in a system, meaning it’s more likely that a failure that impacts a node serving a specific function in a system will impact all of its replicas, causing an outage in a critical part of a system and hence a failure for the whole system
  2. Software faults tend to be less predictable, as some are inherent to the design
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are some ways that a software fault can cause an outage in a distributed, redundant system?

A
  1. Bad inputs result in an unhandled exception on all replicas of a back-end service
  2. A memory leak/bug causes on service to consume all of a shared hardware resource, causing other services sharing the same hardware to fail
  3. An operating system service fails, and services dependent on that system service can fail or hang as a result
  4. Cascading failures, i.e. one fault causing another fault and so on until the system fails overall
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a load parameter in the context of measuring the data load of a system?

A

Any measurable, independent variable quantity that describes operations that occur within a system. Some examples include:

  1. Requests per second to page
  2. Read/write ratio for a database
  3. Cache hit rate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you describe performance in relation to the load parameters of a system?

A
  1. When the load parameter increases and the CPU/memory resources are kept the same, how does the performance of the system get affected?
  2. When you increase a load parameter, how much do you need to increase the resources available to the system in order for it to have the same performance?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is one major difference between how performance is measured for batch processing system when compared with real-time processing systems?

A
  1. In batch processing systems, the number of records that can be processed within a given timeframe (the throughput) is a measure of its performance
  2. In real-time systems, the more important metric is the response time, i.e. the time it takes for a user interaction to have a corresponding response
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the difference between latency and response time?

A

The response time is the time the response takes after the request, while the latency is the time between when the request was initiated and when the request was processed by the back-end system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a common method of measuring the performance of a real time system mathematically?

A

By measuring response times over time and collecting samples, statistical information can be determined from these samples, e.g. ordering them from fastest to slowest response time to determine the median, 95th percentile, 99th percentile response times, etc.

17
Q

What are some examples of things that can cause random additional latency in back-end systems?

A
  1. Packet loss
  2. Garbage collection delays
  3. Page faults
  4. Physical effects on hardware (vibration, heat)
18
Q

What are tail latencies?

A

Tail latencies are the response times that sit in the very high percentile end of the response time distribution, e.g. 99.9th percentile response times.

19
Q

What is one major cause of long response times?

A

Queuing delays often cause the largest delays, due to limitations in the number of records that can be processed in parallel.

20
Q

What is head-of-line blocking?

A

Head of line blocking is the phenomenon where a small number of slow requests end up causing queuing delays that cause an increase in the response times for requests that otherwise would have executed quickly.

21
Q

If a given operation in a distributed system involves calls to multiple back-end components in a system, what does this mean for the overall response time of the original operation?

A

Assuming the intermediate response times for the calls to each backend service all occur in parallel, the longest response time of those backend calls determines the overall response time for the operation.

22
Q

In terms of vertical and horizontal scaling, what is the most practical approach when dealing with the majority of systems?

A

A combination of both vertical and horizontal scaling is typically most practical.

23
Q

What is one benefit and one drawback of an elastic system?

A

Elastic systems cope well when given unpredictable loads by scaling more dynamically than manually skilled systems, but manually-scaled systems are more predictable from an operations standpoint.

24
Q

What is typical of a maintainable system?

A
  1. It is simple for an operations team to run
  2. The code has low incidental complexity relative to inherent complexity
  3. The system can evolve to meet new use cases and changing constraints
25
Q

What are main driving forces behind using a NoSQL database?

A
  1. Query operations that relational models don’t do well

2. Less restrictive than schemas

26
Q

What is a key problem with the relational data model for applications that are commonly written today, and what is a way to mitigate this problem?

A

Most business applications are written in object-oriented languages, which require a translation layer between the inherit representation of an object and how it is represented in tables, rows and columns. ORM frameworks partially abstract away this mismatch between models.

27
Q

What are some of the inherent advantages to a JSON data model and when are these advantages realised?

A
  1. Documents better match objects in structure
  2. Better data locality, so fewer joins
  3. Schema-on-read, so evolvable
28
Q

What is the primary idea behind normalisation in databases?

A

Removing duplication of meaningful data in databases that will be shared across multiple records in the database by instead using an ID foreign key mapping to a table with standardised values.

29
Q

What are the main benefits of normalising a database?

A
  1. Single point of update
  2. Better data consistency
  3. Better semantics than arbitrary text
  4. Easier localisation support
30
Q

What are the major drawback of a document-based model?

A
  1. If there is a many-to-one where many records reference one common record, meaning it would need to be stored as an ID in the document to avoid duplication and multiple queries may be required to retrieve the related data
  2. Even though a model may not have originally required many-to-one or many-to-many relationships or joins, it may evolve over time into a more interconnected structure that does and document databases have limited support for joins
  3. Denormalising data or replicating joins in code can lead to worse maintainability, reliability and performance for the data model