Intro Flashcards

1
Q

Approaches for coping with load?

A
  1. Scaling up or vertical scaling
  2. Scaling out or horizontal scaling
  3. Using Elastic systems in case of unpredictable load.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Scaling up or vertical scaling?

A

Moving to a more powerful machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Scaling out or horizontal scaling?

A

Distributing the load across multiple smaller machines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are Elastic systems? And when are they useful?

A

Automatically add computing resources when detected load increase.
Quite useful if load is unpredictable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the three design principles for software systems in terms of maintainability?

A
  1. Operability
  2. Simplicity
  3. Evolvability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Operability?

A

Make it easy for operation teams to keep the system running.

Data systems can do the following to make routine tasks easy e.g.

  1. Providing visibility into the runtime behavior and internals of the system, with good monitoring.
  2. Providing good support for automation and integration with standard tools.
  3. Providing good documentation and easy-to-understand operational model (“If I do X, Y will happen”).
  4. Self-healing where appropriate, but also giving administrators manual control over the system state when needed.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Simplicity?

A

Easy for new engineers to understand the system by removing as much complexity as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Evolvability?

A

Make it easy for engineers to make changes to the system in the future.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are functional and nonfunctional requirements?

A
  1. Functional requirements: what the application should do
  2. Nonfunctional requirements: general properties like security, reliability, compliance, scalability, compatibility and maintainability.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the difference between Latency and response time?

A

The response time is what the client sees. Always measured on client side.
Latency is the duration that a request is waiting to be handled.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

When we measure response time, what is a better metric than average response time? And why?

A

Percentiles are a metric than average response time as percentiles tells how many users actually experienced that delay

  1. Median (50th percentile or p50). Half of user requests are served in less than the median response time, and the other half take longer than the median
  2. Percentiles 95th, 99th and 99.9th (p95, p99 and p999) are good to figure out how bad your outliners are.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the common percentiles measures used for response time?

A
  1. Median (50th percentile or p50). Half of user requests are served in less than the median response time.
  2. Percentiles 95th, 99th and 99.9th (p95, p99 and p999) are good to figure out how bad your outliners are.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Give an example of using 99.99 percentile as a measure of response time? And why is not common practice?

A

Amazon uses 99.9th percentile for response time requirements for internal services because the customers with the slowest requests often have the most data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What accounts for large part of response times at high percentiles? And why are high percentiles not common practice?

A

Queueing delays often account for large part of the response times at high percentiles.
Optimizations are expensive at high percentiles.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are SLOs and SLAs?

A

Service level objectives (SLOs) and service level agreements (SLAs) are contracts that define the expected performance and availability of a service. These metrics set expectations for clients of the service and allow customers to demand a refund if the SLA is not met.

An SLA may state the median response time to be less than 200ms and a 99th percentile under 1s.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is reliability?

A

Reliability: The system should work correctly (performing the correct function at the desired level of performance) even in the face of adversity.

17
Q

What is Scalability?

A

Scalability: As the system grows(in data , traffic volume, or complexity), there should be reasonable ways of dealing with that growth.

18
Q

What is maintainability?

A

Maintainability: People should be able to work on the system productively in the future.

19
Q

How can human errors be reduced?

A
  1. Designing systems in a way that minimize opportunities for error through well-designed abstractions, APIs, and admin interfaces.
  2. Decoupling the places where people make the most mistakes from the places where they can cause failures. E.g. by providing a fully-featured non-production sandbox environment where people can explore and experiment safely, using real data, without affecting real users.
  3. Testing thoroughly at all levels: from unit tests to integration tests to manual tests to automated tests.
  4. Allow quick and easy recovery from human errors, to minimize the impact of failure. E.g. By making it easy to roll back configuration changes, roll out new code gradually ( so bugs do not affect all users).
  5. Set up detailed and clear monitoring, such as performance metrics and error rates.