Lecture 5: 29th October 2019 Flashcards

Datacentres

1
Q

What is a workflow?

A

An orchestrated and repeatable pattern of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a sequence of operations, the work of a person or group, the work of an organization of staff, or one or more simple or complex mechanisms.

From a more abstract or higher-level perspective, workflow may be considered a view or representation of real work. The flow being described may refer to a document, service, or product that is being transferred from one step to another.

Workflows may be viewed as one fundamental building block to be combined with other parts of an organization’s structure such as information technology, teams, projects, and hierarchies.

The coordinated execution of multiple tasks or activities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a workload?

A

The amount of work that a computer or computer system has been given to do at a given time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can workloads be measured?

A
  • log everything all the time? : generally expensive and infeasible
  • log with sampling? : samples may miss events of interest - outliers are important in networking
  • replay with logging turned on? : completely overlooks “Heisenbugs”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Where could we measure workloads?

A

Nowhere really at the scale of datacentres of large companies. At 10GB/s and with 84 byte packets, you have ~ 70 ns to process each packet. CPU operations are on the level of 10s of ns, and having to include packet I&O, context switching, packet classification, etc make it infeasible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the scale of traffic inside datacentres?

A

Big. Google had a 50x growth in traffic between 2008 and 2014. In 2015, Facebook web servers had 100s to 1000s of simultaneous connections, but their traffic within datacentres is several times larger than that which goes out into the Internet.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Is there any relationship between the rate of packet drops and utilisation in datacentres?

A

No (because of incast)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a top-of-rack?

A

A switch on top a rack of servers which connects them to each other and to other servers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a server rack?

A

A framework cage that contains a number of specialised servers which slide into bays like shelves. The servers are commodity hardware.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are containers?

A

The normal atomic unit at which servers are bought for datacentres. They are groups of server racks, usually to the size of a shipping container - and sometimes in a shipping container.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When do containers get replaced?

A

When ~ 10 of their machines fail

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are containers aka?

A

blocks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is locality?

A

The degree to which network traffic does not travel far topologically, i.e. staying within the same server rack vs container or datacentre vs the wider Internet.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What locality properties do servers in datacentres hold?

A

most traffic from a block within its block cluster, and a fifth of that within the same rack. outside its cluster, 12% within its own DC and 18% outside of it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What do the proprieties of datacentre traffic depend on?

A

the function of application; scale; network topology; protocols used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the implications of datacentre networking?

A

large internal traffic; tight deadlines for (network) I/O; congestion and TCP incast must be prevented; networks are complex and shared by diff apps; centralised control at per-flow level hard

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the pros and cons of using few larger data centres versus using more smaller data centres?

A

fewer and larger is less management complexity and cost overall but more per site; higher latency with fewer large centres; app complexity greater with few large DCs; may need a hierarchical cache structure for progressively authoritative DCs; more multiplexing with fewer large DCs as all connecting to 1 place (?)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the biggest design choice within datacentres? Why is it important?

A

How to connect racks together; need to allow rack and machine-wise addressing and routing and maximise performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is bisection bandwidth?

A

The minimum amount of capacity required to be cut in links to bisect (partition into halves) a network. Make the minimum number of cuts needed to separate the two partitions and then sum the bandwidth of the links cut. it represents the bandwidth available between the two partitions, and, thereby, the true bandwidth available in the entire system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the big switch approach?

A

The idea of connecting racks (ToRSes) with “bigger” switches.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are some problems with the big switch approach?

A

Presents single points of failure (mitigate this by duplication -> cable management) as well as scaling issues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is a tree network DS?

A

When you have a hierarchy of increasingly high capacity switches connect different racks in a DS.

22
Q

What are the cons of a tree network?

A

It is very expensive and doesn’t scale. Higher capacity switches present even greater points of vulnerability to attack and congestion. There is also a limit on the maximum capacity of a network switch: can only go so far up/large.

23
Q

How can you surpass cost and design difficulties?

A

Using commodity hardware

24
Q

What are the pros of using commodity hardware?

A

same machines and switches hardware (and protocols and topologies) making maintenance easier; ensure vertical and horizontal scalability; greater capacity; lower latency

25
Q

What are fat-tree networks?

A

A network topology in which one type of switch is used to implement a k-ary tree with a chosen k.

26
Q

How are fat-trees arranged?

A

There will be k switches at the core (top of tree), then k pods of (k/2) aggregation switches and (k/2) edge switches. Aggregation switches connect edge switches to core switches. Edge switches connect aggregation switches to (k/2) nodes. Aggregation switches connect ot all the edge switches in their pod and to (k/2) core switches.

27
Q

What are k-ary trees?

A

k-ary tree means no node can be connected to > k children nodes.

28
Q

What are the pros and cons of fat-tree networks?

A

low cost; low complexity; high throughput; high redundancy; any pair of hosts get the full bisection bandwidth; can cut and maintain capacity in partitions if you cut up the network (remains in partitions if split)

29
Q

How are intra and inter datacentre routing different?

A

Inter DC routing different from intra: may be different in where connections are made in the network topology, as well as protocols used (IP and TCP vs OSPF and QUIC).

30
Q

Why may we want to make paths between nodes shorter?

A

lower latency and also higher capacity per flow: a packet that travels on a short path consumes a small amount of network capacity”

31
Q

How are throughput per flow, no of flows, total capacity, and mean path length in a datacentre naively related?

A

throughput per flow ≤ total capacity / [number of flows * mean path length]

32
Q

How can you increase the throughput per flow? What will an increased throughput per flow mean for a datacentre network?

A

lower the mean path length somehow in design of network topologies. This will increase the total network capacity of the datacentre network.

33
Q

Why is TCP unsuitable for use in datacentres?

A

uses indirect, loss-based congestion measures and control implemented at sender only; slow start and halving on timeout vs linear increase not optimal for bandwidth probing; incasts leading to timeouts and congestion

34
Q

What are some requirements for transport protocols in datacentres?

A

high burst tolerance; high throughput, low latency

35
Q

What is incast in TCP?

A

When a rise in the number of servers making requests in a datacentre increase to a certain level, throughput collapses.

36
Q

Why is TCP incast particularly a problem in datacentres?

A

meets preconditions: have large number of small requests; high-bandwidth, low-latency network; small number of switch buffers.

37
Q

What is the process by which incast occurs in TCP?

A

When a high number of (MapReduce) queries are made to servers, some time out, causing retransmissions, congesting the network, so fewer of the next queries get through, repeating until very high congestion leading to very low throughput as the number of requests rise.

38
Q

How would TCP incast be changed if you removed the minimum bound on RTO times from TCP?

A

improves throughput as number of servers increase, no incast-like collapse. but still starts to drop from ~ 40 servers. This is due to TCP’s clock: time unit in TCP, effectively replaces min RTO. Its default is 10ms, much more than RTTs within datacentres.

39
Q

How would TCP incast be changed if you removed the minimum bound on RTO times from TCP and change its clock time to microsecond magnitude?

A

Improves throughput throughout as num servers increase, No incast-like collapse or small dropoff seen when only removing min RTO

40
Q

What are mice and elephant flows? What are they sensetive to?

A

In computer networking, an elephant flow is an extremely large (in total bytes) continuous flow set up by a TCP (or other protocol) flow measured over a network link. Elephant flows, though not numerous, can occupy a disproportionate share of the total bandwidth over a period of time.

In computer networking, a mouse flow is a short (in total bytes) flow set up by a TCP (or other protocol) flow measured over a network link.

Elephant flows are sensitive to throughput. Mice flows are sensitive to delay.

41
Q

How are the total sizes of flows and makeup of total date from below and above 1MB distributed?

A

65% fo flows are < 1MB but 95% of transmitted bytes are from flows > 1MB: most flows are mice but vast majority of data sent in elephants.

42
Q

What is ECN?

A

Explicit Congestion Notification = an extension to IP and TCP that allows end-to-end notification of network congestion without dropping packets.

43
Q

How does ECN work?

A

The ECN bit in TCP headers set to 1 when congestion detected by the receiver, leading to a halving of bandwidth each window where an ECN bit is set to 1.

44
Q

What are some conditions needed for TCP incast to occur?

A

have large number of small requests; high-bandwidth, low-latency network; small number of switch buffers.

45
Q

What is the single root cause of TCP incast?

A

An imbalance between low link latency (µs) and RTO (ms)

46
Q

What is the TCP clock?

A

TCP’s representation of time, effectively its minimum unit of time to increment.

47
Q

How are RTTs distributed in a Bing cluster?

A

90% of packets had an RTT < 1ms, remaining 10% to max of 15ms.

48
Q

What is DCTCP?

A

Datacentre TCP = TCP modified for datacentres

49
Q

How does DCTCP work?

A

DCTCP extends ECN processing to estimate the fraction of bytes that encounter congestion rather than simply detecting that some congestion has occurred. In TCP, halve if any number of ECN bits in acks; in DCTCP try to be proportional: reduce by 5% per ECN bit received. Smaller changes but faster response

50
Q

How do the performance levels of normal TCP and DCTP compare in general and with respect to queues?

A

In DCTCP, queues are of much smaller and stable lengths; TCP has more variance and larger queue sizes; both get full line rate

51
Q

What are some types of workloads?

A

queries; lookups.