Chapter 1 - Reliable, Maintainable and Scalable Applications Flashcards

1
Q

What would we use typically to store data so that one application or another can find it again later?

A

A database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What would we typically use to remember the result of an expensive operation so we could speed up reads?

A

A cache

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What would we build to allow efficient searching by keyword or filtering

A

A search index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What would we use (one example) to send a message to another process in a way that allows that message to be processed asynchronously

A

Sream processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What kind of processing would typically describe periodically crunching large amounts of accumulated data?

A

Batch processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Does a one size fits all approach typically work well today when designing applications

A

It depends but in generally no. There are many different types of databases, search index and caching tools and knowing which ones to choose and how to combine them is critical when designing most modern applications today.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

When you combine several tools in a service and hide the details behind a common Application Programming Interface, what have you done?

A

You’ve essentially created your own special purpose data system from smaller more general purpose components. You need to think pretty hard about the guarantees it can make and then tradeoffs of how you combine these tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Sum up the meaning of ‘Reliability’

A

The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Sum up the meaning of ‘Scalability’

A

As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Sum up the meaning of ‘Maintainability’

A

Over time, many different people will work on the system (engineering and operations, both maintaining current behaviour and adapting the system to new use cases), and they should all be able to work on it productively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Give 4 reasonable expectations of Reliability

A

The application performs the function that the user expected.

It can tolerate the user making mistakes or using the software in unexpected ways.

Its performance is good enough for the required use case, under the expected load and data volume.

The system prevents any unauthorised access and abuse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Is it a reasonable expectation to make a system fault tolerant of every eventuality.

A

No, there are some faults which a system could never be expected to recover from reasonably. If the world ended, you would have to host the system in space. We have to think about the types of faults we want our system to tolerate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What’s the difference between a fault and a failure

A

A fault is one component of the system deviating from expected behaviour whereas a failure is the entire system not performing as expected. We can’t prevent all faults but we can design fault tolerant mechanisms that stop faults from becoming failures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some ways we could deal with hardware faults?

A

Redundancy of components and infrastructure

Using software systems that can tolerate the loss of an entire machine (Kubernetes for example)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are some examples of Software faults?

A

A bug in the software

A runaway process that uses up shared resources (CPU time, memory, disk space, network bandwidth).

Cascading failures (a fault in one component triggers a fault in another component which triggers a fault in another component).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some methods of preventing software faults?

A

Thinking carefully through assumptions and interactions in the system

Testing

Process isolation

Allowing processes to crash and restart

Measuring, monitoring and analysing system behaviour in produciton

17
Q

How do we minimise the possibility for human error?

A

Design our systems in a way that minimises the ability of humans to make a mistake and makes it easy for them to do the right thing. If we do make the system too restrictive though people will work around it.

Decouple the places where people make the most mistakes from the places that can cause failures. Sandbox environments that mirror production allow people to play and make mistakes without a huge impact.

Test thoroughly at all levels, unit, integration, end to end tests.

Make it easy and foolproof to recover from human errors (rollback from configuration changes etc), roll out new code gradually, provide tools to recompute data easily if there were errors in the original computation.

Set up detailed monitoring/instrumentation.

Implement good management practices, training and a a good culture.

18
Q

How important is reliability?

A

It depends, sometimes we might choose to sacrifice reliability for cost but it is generally very important.

19
Q

For scalability, how would you measure load on the system?

A

With pre-defined parameters called load parameters.

The best choice depends on your system and it’s architecture and requirements.

Might be web requests/second, ratio of reads/writes in a database, simultaneous active users in a chat room, hit rate on a cache.

20
Q

What is the biggest factor for example to consider when looking at Twitter’s load?

A

Fan out. One tweet has to make it’s way into the timelines of all a user’s followers in order with all the other tweets from other people that they follow.

Each person follows many people and is followed by many people.

21
Q

Is it preferable for Twitter to do more work at write or read time?

A

Write time because the number of tweets posted is a number of orders of magnitude less than the number of home timelines requested.

This is why twitter builds a cache for each users home timeline and updates each cash as needed with new tweets.

22
Q

For a person with 30 million followers, what does Twitters cache’ing approach mean in terms of writes when that person posts a tweet?

A

30 million writes occur!

23
Q

Is distribution of followers per user a good load parameter for Twitter?

A

Yes, it determines how many writes are going to happen and the fanout load.

24
Q

What would be a good way to think about load and its effect on your system?

A
  • When you increase a load parameter and keep the system resources (CPU, memory, network bandwidth, etc.) unchanged, how is the performance of your system affected?
  • When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?
25
Q

Describe some metrics for measuring performance.

A
  • Throughput (number of records we can process per second)
  • Response Time (time between a client sending a request and receiving a response) - This would be measured as a distribution of response times. Mean (average is not very good here, percentiles are better)
  • Latency (how long a a request is waiting to be handled)
26
Q

If the 95th percentile response time is 1.5 seconds what does that mean?

A

95 out of 100 requests take less than 1.5 seconds to complete.

27
Q

What agreements set the definitions for when a service is fulfilling it’s obligations and is not, and what it’s objectives are?

A

Service Level Agreements and Service Level Objectives

28
Q

Why should you measure response times on the client and not on the server?

A

Queueing delays often account for a large part of the response time at high percentiles. As a server can only process a small number of things in parallel (limited, for example, by its number of CPU cores), it only takes a small number of slow requests to hold up the processing of subsequent requests—an effect sometimes known as head-of-line blocking. Even if those subsequent requests are fast to process on the server, the client will see a slow overall response time due to the time waiting for the prior request to complete.

Due to this effect, it is important to measure response times on the client side.

29
Q

When generating artificial load, should the client generate requests based on the response time or independent of the response time

A

Independent. We don’t want to keep those queues artificially low. The client should generate requests independent of the response time.

30
Q

What are the two dimensions of scaling?

A

Up (add more resources on a single process) or Out (add more machines/processes)

31
Q

If your system is stateful is it easier or harder to scale out?

A

Much much harder, you have the complexity of these different processes all being stateful and having to co-ordinate. I think this is why Kafka makes sure to partition topics reliably, so you can scale stateful systems horizontally, reliably.

32
Q

What are the main assumptions to consider for an application that would scale well for a particular use case?

A

What kinds of operations are common and which are rare.

33
Q

Name three principles that we should consider when trying to make a system maintainable.

A

Operability - Make it easy for operations teams to keep the system running smoothly.

Simplicity - Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system. (Note this is not the same as simplicity of the user interface.)

Evolvability - Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change. Also known as extensibility, modifiability, or plasticity.