20. Massive Distribution for Performance Flashcards
The Map- Reduce Approach. One way DS can tackle big problems
What are the aims of a topologically-regular, closed interconnect?
What does this make more realistic?
Minimise distance
Maximise Bandwidth
Maximise homogeneity- so we can distribute tasks more easily
Maximise security
realistic:
Minimal Latency
Maximal throughput
low operational/maintenance cost
What are the properties of a topologically-regular closed interconnect?
The interconnect is more reliable
The force of the axioms is greatly reduced.
The transparency goals are easier to achieve.
What style of problems can computational clusters/data centres with topologically-regular closed interconnects be used to solve?
Can be used for both process and data-intensive problems
How do we allow two nodes to communicate?
Layers of protocols around them.
Then DME, MME, …
How do we generalize message exchange?
RPC
Why do we use data centres for DS using big data?
We cannot use the internet- we need a more controlled environment. We use data centres as they have their own in-house architecture that reduces the axioms.
Why do data centres use shared-nothing?
Its massively distributed and complicated- we want to keep some simplicity- Keep tasks independent so processes/partitions share nothing.
What is important during M-W split?
Clean split so that each serves as an individual unit of parallel processing. How many partitions to generate?
What is important during M-W spawn?
Spawn parallel processes to work on each partition. Need to decide how many parallel processes to generate.
What is important during M-W process and merge?
Final result must be the same as had it been done without parallelization. How do we pick up the pieces and stick them together in a complete order?
What is a race condition?
When multiple processes try to read/write from the same memory location and hence execution order matters.
Describe task independence?
When two processes can be executed in parallel- they have no shared state and can be done it any order.
The problem becomes the merge
Describe the map reduce model
Make the individual tasks as simple as possible by pretending things are functions.
Make things side affect free, no shared state (or copy shared data structures).
Simply, Map takes a unary function and a collection and returns a collection. Applies the function to each element of collection independently. Map is parallelisable
What is a barrier?
Used to synchronize the map phase with the reduce phase.
Knows how many processes it should wait for and holds everything until it gets them all.
Presupposes balanced workloads
What do the map and reduce functions take and produce? i.e. word counting
Take key-value pairs
Map: in key, in value (i.e. list of strings) -> out key, intermediate value list (i.e. word and no of instances in line)
Reduce: out key, intermediate value list (output from map)-> out value list (all words and number of instances in whole thing)