20. Massive Distribution for Performance Flashcards

Question 1

Q

What are the aims of a topologically-regular, closed interconnect?

What does this make more realistic?

Answer

A

Minimise distance
Maximise Bandwidth
Maximise homogeneity- so we can distribute tasks more easily
Maximise security

realistic:
Minimal Latency
Maximal throughput
low operational/maintenance cost

Question 2

Q

What are the properties of a topologically-regular closed interconnect?

Answer

A

The interconnect is more reliable

The force of the axioms is greatly reduced.

The transparency goals are easier to achieve.

Question 3

Q

What style of problems can computational clusters/data centres with topologically-regular closed interconnects be used to solve?

Answer

A

Can be used for both process and data-intensive problems

Question 4

Q

How do we allow two nodes to communicate?

Answer

A

Layers of protocols around them.

Then DME, MME, …

Question 5

Q

How do we generalize message exchange?

Question 6

Q

Why do we use data centres for DS using big data?

Answer

A

We cannot use the internet- we need a more controlled environment. We use data centres as they have their own in-house architecture that reduces the axioms.

Question 7

Q

Why do data centres use shared-nothing?

Answer

A

Its massively distributed and complicated- we want to keep some simplicity- Keep tasks independent so processes/partitions share nothing.

Question 8

Q

What is important during M-W split?

Answer

A

Clean split so that each serves as an individual unit of parallel processing. How many partitions to generate?

Question 9

Q

What is important during M-W spawn?

Answer

A

Spawn parallel processes to work on each partition. Need to decide how many parallel processes to generate.

Question 10

Q

What is important during M-W process and merge?

Answer

A

Final result must be the same as had it been done without parallelization. How do we pick up the pieces and stick them together in a complete order?

Question 11

Q

What is a race condition?

Answer

A

When multiple processes try to read/write from the same memory location and hence execution order matters.

Question 12

Q

Describe task independence?

Answer

A

When two processes can be executed in parallel- they have no shared state and can be done it any order.
The problem becomes the merge

Question 13

Q

Describe the map reduce model

Answer

A

Make the individual tasks as simple as possible by pretending things are functions.

Make things side affect free, no shared state (or copy shared data structures).

Simply, Map takes a unary function and a collection and returns a collection. Applies the function to each element of collection independently. Map is parallelisable

Question 14

Q

What is a barrier?

Answer

A

Used to synchronize the map phase with the reduce phase.

Knows how many processes it should wait for and holds everything until it gets them all.

Presupposes balanced workloads

Question 15

Q

What do the map and reduce functions take and produce? i.e. word counting

Answer

A

Take key-value pairs

Map: in key, in value (i.e. list of strings) -> out key, intermediate value list (i.e. word and no of instances in line)

Reduce: out key, intermediate value list (output from map)-> out value list (all words and number of instances in whole thing)

20. Massive Distribution for Performance Flashcards

The Map- Reduce Approach. One way DS can tackle big problems