Homework Questions Flashcards

1
Q

Modern CPUs use multiple cores, but the speeds (clock frequencies) of individual cores hardly increase anymore. Someone claims that this shows that Moore’s law no longer holds. Explain whether this claim is true or false.

A

It is false. Moore’s law is not about speed increase but about number of transistors. In the multicore era, the number of transistors kept growing, but they were used to build multiple cores instead of making single cores faster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the advantages and disadvantages of the following interconnection topologies: a 2D mesh, a binary tree, and a hypercube.

A

The hypercube has an excellent diameter (logarithmic) and the best bisection width (exponential in number of dimensions N), but its number of edges increases with the dimension, making it difficult to build in practice.

A binary tree also has an excellent (logarithmic) diameter but very poor bisection width (of 1).

A mesh has a reasonable diameter and bisection width.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Some parallel machines contain multiple networks with different topologies. For example, the Blue Gene has a network with a 3D mesh topology and another network with a tree topology. Explain why it is useful to have different networks in one machine.

A

A mesh has much better bisection width, a tree has a better diameter (log(P) versus sqrt(P)), so it depends on the application which one is best.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe the similarities and one key difference between NUMA multiprocessors and Distributed Shared Memory.

A

In both DSM and Non-Uniform Memory Access (NUMA) multiprocessor, memory is distributed, each machine can access all memory, but the local memory is faster to access than other memory. DSM is software, NUMA is hardware; therefore the differences between local/remote access times are much more profound for DSM than for NUMA.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The parallel Barnes-Hut algorithm for hierarchical N-body problems can use several different load balancing mechanisms. One approach is to partition the physical space into equal parts and assign all bodies in this part of the space to one processor (so, different processors work on different parts of the physical space).

Explain how good or bad both the load balancing and the communication overhead of this approach are. For each of them, explain whether they are better or worse than for the costzone approach.

A

Space partitioning has extremely bad load balance, because it partitions the physical space of the machines, and some parts of the space may be (almost) empty while other parts may be densely populated, so some machines get hardly any work while others get much work. The costzone approach avoids load imbalance by assigning a “cost” to each body (the amount of work in the previous time step) and by giving each machine the same total costs.

On the other hand, space partitioning has excellent locality, because the bodies (e.g. stars) that each machine gets are close together in the physical space; good locality results in less communication. The costzone approach also tries to keep such bodies on the same machine, but its solution is an approximation and not perfect. So the communication overhead of space partitioning is somewhat lower than that of costzones.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Both branch-and-bound (as used, for example, for the Traveling Salesman Problem) and Barnes-Hut (used for N-body problems) use techniques to cut-off (prune) part of the computations. Parallel branch-and-bound algorithms can sometimes obtain superlinear speedups, because the parallel version may (for certain input problems) perform less work than a sequential algorithm. Can the parallel Barnes-Hut algorithm also obtain superlinear speedups in this way? Explain your answer.

A

No, superlinear speedup cannot happen in this way with Barnes-Hut, because the parallel and sequential algorithm do the same optimizations (the same approximations).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Consider an N-body system with 9 bodies, which will be simulated on 3 processors using the Barnes-Hut algorithm. The figure below shows two different distributions of the 9 bodies over the 3 processors; the number in each body indicates the CPU to which the body is assigned.
Explain why the distribution may affect the performance of the parallel program.

Which of the two distributions will give the best speedups? Why?

A

The distribution on the left has good locality: bodies that are close in the physical world are often located on the same machine. The distribution on the right has bad locality, where nearby bodies end up on different machines. The first (left) distribution results in much less communication: to update a given body, the nearby bodies will always be needed (these cannot be cut off), but many of those are on the given machine; more remote bodies are on different machines, but they will not always be needed. With the rightmost distribution, the nearby bodies always are on remote machines and thus they always need communication.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Parallel search algorithms like IDA* typically use shared transposition tables. One approach to implement such tables on a distributed-memory system is to replicate the tables on all processors, so lookups can be done locally, without any communication. Explain why the Transposition Driven Scheduling (TDS) algorithm still is much more efficient than replicated tables, especially for large numbers of processors. What are the main advantages of TDS over replicated tables?

A

Replicated tables can store fewer entries and thus search more; replicated tables have to broadcast updates, which is costly. TDS does latency-hiding (asynchronous communication).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain how the load balancing mechanism of TDS (Transposition-Driven Search) works and how it differs from the “work stealing’’ load balancing mechanism of traditional parallel search methods (like IDA*).

A

With TDS, exploring a board configuration (or state of a game) results in a number of new board configurations by making different moves. For each new configuration, a hash function determines which machine has the transposition table entry for it, and that machine is assigned the task of handling this configuration (either by getting it from the table or by exploring it further). Since hashing is random, the work distribution is completely random; the tasks are all fine-grained, so each machine gets a similar (large) number of small tasks, resulting in good load balance. Essentially, the scheduling of tasks and the transposition tables are integrated into one mechanism, called Transposition-driven scheduling.

With traditional work stealing, the scheduling and transposition tables are two completely different mechanisms. Scheduling is done by letting idle machines ask other machines for work. (NB there are many ways to do that, e.g. random stealing, or broadcasting a request, or keeping track of all the available work somewhere; this was not discussed in class and need not be in your answer.) Work stealing is done dynamically, so it also has some communication overhead.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly