L8 - Cluster Management Flashcards

1
Q

What is resource allocation?

A

how much CPU/DRAM/disk/net to allocate to each app

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is resource assignment?

A

What should run on which physical nodes?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is private resource allocation? What is its other name?

A

each app receives a private, static set of resources

static partitioning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Advantages of static partitioning?

A
  1. simplicity
  2. performance isolation
  3. allows specialised HW (e.g. not everyone needs a GPU)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Disadvantages of static partitioning?

A
  1. low utilisation
  2. hard to solve failures
  3. hard to maintain

about 2&3: not clear how to migrate a machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What 3 properties do we want the scheduler to fulfil in case of shared resource assignment?

A
  1. Fairness
  2. Efficient resource usage
  3. Isolation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

List the algorithm from the lecture for shared resource assignment.

A
  1. Fair queueing (extends 1) (for a single resource)
  2. Weighted max-min fair queueing (extends 2)
  3. Dominant resource fairness
  4. Token bucket
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does work conserving mean? Which property implies that?

A

Resources should not remain idle while there are users whose demand is not fully satisfied.

This is implied by “Efficient resource usage”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why do we want work conserving schedulers?

A

It keeps resources well-utilised.

It maximises overall throughput across different users.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Name a strategy that is not work conserving.

A

time division multiplexing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the different notions of fairness?

A
  1. Max-min fairness
  2. Dominant resource fairness
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the properties of max-min fairness?

A

share guarantee: each user gets at least 1/n of the unless their demand is less

strategy-proof: users are not better off by asking for more than they need

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does DRF try to achieve?

A

identify the dominant resource share of each user and maximise the minimum dominant share across all users

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the drawback of DRF?

A

not work conserving

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the issue with max-min fairness?

A

With max-min fairness, a user’s allocation depends on the demands of other users that are sharing the resource. –> no performance predictability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the goal of token buckets?

A

guarantee a baseline bandwidth, but also allow bounded bursts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How does the token bucket idea work?

A

Control traffic by delaying requests until they accumulate sufficient tokens.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does resource assignment try to optimise?

A
  1. performance
  2. resource utilisation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Explain the first step of resource assignment.

A

Filter machines that satisfy hard constraints

e.g., VM may need a machine with a GPU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Explain the second step of resource assignment.

A

Rank candidate nodes to find machine that best
satisfies soft constraints

e.g., best-fit to avoid resource fragmentation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

List different methods for cluster management system architecture.

A
  1. centralised
  2. distributed
  3. hierarchical e.g. two-level
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Next questions are about Borg. First, what is Borg?

A

Google’s centralised cluster manager

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does Borgmaster do?

A

It is the main scheduler.
It polls Borglets every few seconds

extra: 5 replicas

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does Borglet do?

A

Manages and monitors tasks and resources on machines it is responsible for.

extra: 10k heterogenous machines per Borglet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What strategies does Borg deploy to achieve high utilisation?
1. admission control 2. efficient task-packing 3. over-commitment 4. machine sharing
26
What is Kubernetes?
Cluster management for containerised applications; - manage complexity of container lifecycle and allocating/setting up hardware resources for the containers. - like an OS for your cloud cluster
27
List container orchestration primitives!
1. Resource scaling 2. Resource allocation 3. Load balancing 4. Lifecycle and health 5. Naming and discovery 6. Storage volumes 7. Logging and monitoring 8. Debugging and introspection 9. Identity and authorization
28
Resource scaling
make sets of containers bigger or smaller
29
Resource allocation
decide where my containers should run
30
Load balancing
distribute traffic across a set of containers
31
Lifecycle and health
keep my containers running despite failures
32
Naming and discovery
find where my containers are now
33
Storage volumes
provide data to containers
34
Logging and monitoring
track what’s happening with my containers
35
Debugging and introspection
enter or attach to containers
36
Identity and authorization
control who can do things to my containers
37
What do the Kubernetes containers do?
Handle package dependencies
38
What is a pod?
A pod is the unit of scheduling and migration in Kubernetes. a bunch of containers with same properties
39
List those properties!
1. Lifecycle: live together, die together 2. Network: same IP address, same routes, iptables 3. Storage volumes: can share data 4. Intended to run a common task
40
Kubernetes service?
A group of pods that work together extra: provides load balancing among pod replicas
41
How do you control pod placement in Kubernetes?
use labels and selectors
42
How do you keep N pods running?
use ReplicaSets: layer on top of Pod API that ensures N copies of a pod are running
43
What does the Horizontal Pod Autoscaler do?
automatically scale pods as needed - based on CPU utilisation (or custom metrics) - can set user-defined min/max bounds
44
What is a potential problem with relying only on CPU utilization as a scaling metric?
good for compute bound apps but maybe I/O is the bottleneck
45
What other metrics would you consider for auto-scaling besides CPU utilization?
1. memory capacity 2. memory BW 3. network BW
46
What properties does resource isolation try to achieve?
1. Applications must not be able to affect each other’s performance 2. Repeated runs of the same application should see similar behaviour
47
What are the resource allocation mechanisms in Kubernetes?
Request: How much of a resource (CPU, RAM) the container is asking to use, with a strong guarantee of availability Limit: Max amount of a resource the container can access
48
Does the scheduler overcommit to requests?
No.
49
List 3 Kubernetes Quality of Service classes.
1. Guaranteed: highest protection 2. Burstable: medium protection 3. Best effort: lowest protection
50
Relation of request and limit for Guaranteed class?
request > 0 && limit == request
51
Relation of request and limit for Burstable class?
request > 0 && limit > request
52
Relation of request and limit for Best effort class?
request == 0
53
What are the advantages of centralised design?
can make globally optimal decisions
54
What are the drawbacks of centralised design?
scalability: hard to enforce consistency
55
Name 2 two-level cluster managers
Mesos and YARN
56
How does Mesos work?
Lecture on 05.04 Min: 3.5
57
List two distributed cluster management algorithms.
Omega and Sparrow
58
List two new challenges serverless brings to the cluster management besides resource allocation and assignment.
3. resource scaling: How many containers (“slots”) to keep warm for a function? 4. request routing: To which node and “slot” do we send a particular invocation?
59
What does Quasar try to solve?
Over-provisioning
60
How does Quasar solve over-provisioning?
Don't ask users for allocation request/resource demand. They don't really know it anyway.
61
What do the users specify in this case? (Quasar)
performance goals
62
What does the cluster manager do in this case? (Quasar)
profiles applications and dynamically adjusts resource allocations
63
How does the cluster manager understand resource/performance tradeoffs? (Quasar)
It combines the following: 1. Small signal from a short run of a new application 2. Large signal from previously run applications
64
What does the cluster manager do at the end? (still Quasar)
For each new application, it needs to recommend a resource allocation and assignment.
65
How does one build a recommender system?
collaborative filtering
66
What is collaborative filtering?
Predict preferences of new users given preferences of other users SVD and PQ reconstruction.
67
What needs to be considered to recommend resource allocations to applications? (4)
1. scale-out 2. scale-up 3. HW heterogeneity 4. Interference
68
What does scale out mean?
Use 4 nodes or a single node?
69
What does scale up mean?
Use a 8-core VM or a single core VM?
70
List 3 steps of Quasar's functionality.
Step 1: short profiling runs produce initial performance data. Step 2: collaborative filtering techniques fill in missing data Step 3: Greedy scheduler uses output to find the number and type of resources that maximise utilisation and performance.
71
To summarise, what are the challenges of using shared clusters?
1. Resource allocation: how many resources should an app get? 2. Resource assignment: which specific resources does an app get? 3. Variability: within an app (different phases), within datasets, and load