Exam-like questions Flashcards
for lectures given by Francesc
Consider the following Julia code snippet. Recall that rand(1:10,n)
creates a vector of n
random integers with values from 1 to 10.
using Distributed
addprocs(4)
a = 2
b = 0
f = (i) -> mod(i,a) == b
n = 10000
c = rand(1:10,n)
x = @fetchfrom 3 count(f,c)
a) How many integers are sent and received between process 1 and process 3 in last line? Give a value and motivate your answer.
b) Which is the type of x
? An Int
or a Future
? Explain your answer.
a) We sent function f
, which captures 2 integers (a
and b
), and also the vector c
(10000 integers). We also get back the result (1 integer). Thus the total number of integers transfered is 10003.
b) x
is an Int
since macro @fetchfrom
returns the result of the code executed remotelly and not a Future
object. This is in contrast to @spawnat
that returns immediatelly and returns a Future
object.
Consider the following Julia code snippet. Remember that a=[0]
creates a vector of a single entry equal to zero.
using Distributed
addprocs(4)
f = () -> Channel{Int}(1)
chnl = RemoteChannel(f)
a = [0]
@sync for w in workers()
@spawnat w put!(chnl,10)
a[1] = a[1] + take!(chnl)
end
x = a[1]
a) Which is the value of x
? Give a value and motivate your answer.
b) If we remove @sync
before the for-loop: i) the value of x
cannot be predicted. ii) the value of x
is unchanged. Choose an option and explain why.
a) The value of x
will be 40. Since we have 4 processes and each puts number 10 into de channel. The main process will take value 10 for 4 different times and add them resulting in value 40.
b) Removing @sync
has no effect in this example since the call to take!
in the loop will block until the remote process calls to put!
and the given value is transfered via the remote channel.
What would change if Floyd’s All pairs Shortest Paths (ASP) algorithm would be parallelized by distributing the rows cyclically instead of blockwise over the different machines? Would the communication pattern and/or performance change?
During each iteration k, the processor owning row k has to broadcast it. With cyclic distribution, the processors will then take turns in owning/broadcasting the current row. The number of messages and volume of data sent will remain the same, just the order of the senders changes. This is likely to have very little impact on communication overhead.
The 1D Jacobi algorithm is usually parallelized by partitioning the array (grid) data structure in horizontal blocks over the different machines. What would change if we distribute the rows cyclically over the machines (just as is done for the Linear Equations algorithm)? How would the communication pattern change? What would the impact on performance be?
To update an entry i, entries i-1 and i+1 are needed as input. With a cyclic distribution, these elements would be located at two other machines, so 2 messages would be needed for each entry i. With block distribution, only the boundary rows of a machine need data from remote machines. Hence, the cyclic distribution would have vastly more communication overhead and would have very bad performance.
The Traveling Salesperson Problem (TSP) can be parallelized using Replicated Workers style parallelism: a master processor searches the TSP tree up to a certain depth (called MAXHOPS) and then dynamically distributes the remaining subtrees to a number of worker processors. The value of MAXHOPS affects the communication overhead, the load imbalance overhead, and the expected search overhead of the parallel TSP program.
(a) Explain in general what these three types of overhead mean.
(b) Explain for each type of overhead whether it increases or decreases with larger values of MAXHOPS. Justify your answers. You should assume that the TSP tree is sorted using the nearest-city-first heuristic.
a)
Communication: the overhead of message passing, in this case sending jobs to worker processors and updating the global minimum.
Load imbalance: some machines may have more work to do than others, resulting in idle time for machines with little work but longer runtimes for the machines with more work. To finish the overall program, we have to wait for the last machine to finish, so load imbalance results in lower speedups.
Search overhead: the extra work that the parallel algorithm does compared to the sequential algorithm.
b)
Communication overhead increases with higher values of MAXHOPS, because the algorithm will then generate more jobs (each taking less time).
Load imbalance decreases, because the many small jobs can more easily be distributed fairly among all processors than a few large jobs, especially as different parts of the tree will require different amounts of computation. However, if MAXHOPS is set too high, the master machine will become the bottleneck, which would cause a large load imbalance
Search overhead decreases, because you now get more information from leaf nodes (full routes) in the left part of the tree to prune nodes in the other part of the tree.
As extreme example, consider MAXHOPS=1 and we have, say, 20 subtrees (21 cities) and 20 machines. Communication overhead will be extremely low: just one job per machine. Load imbalance will be high, because the 20 subtrees take very different amounts of time; especially the trees of the right can do much more pruning (because of the nearest-city-first heuristic) and thus cost less time. Search overhead will be high, because the optimal TSP solution is likely to be in the left part of the tree (because of nearest-city-first), but some machines start working on the right-most subtrees without having this information from the left part. In the sequential algorithm, the tree is always searched from the left to right.
The goal of function below is to compute the ASP problem using a distributed implementation of Floyd’s algorithm. Each MPI rank calls the function below to compute a portion of the solution. Cw is a matrix containing the block of rows processed by the current MPI rank, and rows_w is a vector containing the global ids of the rows stored in Cw. The send/receive primitives used are FIFO-ordered, so messages between two nodes always arrive in the order they were sent. Explain why this algorithm is incorrect and may sometimes compute incorrect results, depending on the timing behavior of the program.
function floyd_worker!(Cw,rows_w)
comm = MPI.Comm_dup(MPI.COMM_WORLD)
rank = MPI.Comm_rank(comm)
nranks = MPI.Comm_size(comm)
m,n = size(Cw)
C_k = similar(Cw,n)
for k in 1:n
if k in rows_w
myk = (k-first(rows_w))+1
C_k .= view(Cw,myk,:)
for proc in 0:(nranks-1)
if rank == proc
continue
end
MPI.Send(C_k,comm;dest=proc,tag=0)
end
else
MPI.Recv!(C_k,comm,source=MPI.ANY_SOURCE,tag=0)
end
for j in 1:n
for i in 1:m
@inbounds Cw[i,j] = min(Cw[i,j],Cw[i,k]+C_k[j])
end
end
end
Cw
end
Even if messages between two nodes are FIFO ordered the algorithm still is incorrect, because at different iterations different nodes may be the sender. For example, with the setup given on the slide (12 cities, 3 machines), at iteration 4, node-1 (on the left) sends row-4 to all other nodes; after receiving row-4, node-2 (at the bottom) sends row-5 to all other nodes. There is no guarantee that node-3 (on the right) receives row-4 before row-5, because FIFO-ordering does not apply between multiple different senders. Hence, node-3 may first receive row-5 and think it is the row for iteration 4, giving wrong results.
Consider function below. It implements a sequential algorithm that takes an N-by-M matrix G and and an integer nsteps. Matrix G is updated within this function.
function foo(G,nsteps)
N,M = size(G)
Gnew = copy(G)
for step in 1:nsteps
for j in 3:(M-2)
for i in 3:(N-2)
Gnew[i,j] = (G[i,j]+G[i-1,j]+G[i-2,j]+G[i+1,j]+G[i+2,j]+G[i,j-1]+G[i,j-2]+G[i,j+1]+G[i,j+2])/9
end
end
G .= Gnew
end
G
end
What is the communication scheme if this algorithm is parallelized by partitioning the array row-wise over P processors (i.e., giving each processor N/P consecutive rows)? Make clear which data the processors will exchange. (You may assume that N is a multiple of P.)
The algorithm is very similar to a 2D Jacobi, except that each grid cell now needs the values from 8 other cells (2 in each direction). With a row-wise distribution, this means that every processor will now need two rows (instead of one) from each of its two neighbors, except for the first and last processor.
When parallelising matrix multiplication: C = A * B. Which partition leads to the best computation/communication ratio?
a) Each process computes one element of C
b) Each process computes one row of C
c) Each process computes one column of C
d) Each process computes a set of rows of C
Answer: D
Explanation: By distributing a set of rows we optimise the communication/computation ratio. Leading to a communication overhead of O(P/N), where P is the number of processors and N is the matrix size. As N gets larger the communication overhead diminishes.
When parallelising matrix multiplication, assume each process computes a set of rows of C=A*B. What are the data dependencies for each process?
a) The whole matrix A and B
b) A[rows,:] and the whole matrix B
c) A[rows,:] and B[:,columns]
d) A[rows,:] and B[rows,:]
Answer: B
Explanation: Each process needs its assigned rows of matrix A and the entire matrix B to compute its respective portion of C.
Each process computes a set of rows of C = A * B. What is the computation and communication complexity in each worker? Assuming NxN is the size of the matrix, and P is the number of processes.
a) O(N²/P), O(N²)
b) O(N³/P), O(N²)
c) O(N²/P), O(N)
d) O(N³/P), O(N³)
Answer: B
Explanation: For matrix multiplication, each process has computation complexity proportional to its share of rows (O(N3/P)), and communication complexity for transferring entire matrices (O(N2)).
Assume the communication/computation ratio is O(P/N). Does this parallel algorithm yield speedups?
a) Yes, if we have enough processes
b) Yes, if the problem size is large N»P
c) No, because the communication overhead is too big
d) No, because the ratio depends on the problem size
Answer: B
Explanation: The algorithm scales well when the problem size (N) is significantly larger than the number of processes (P). So if N»_space; P, the communication will be small compared to local computation.
Can we use latency hiding in the Jacobi method?
a) No, because all computations depend on the previous iteration
b) No, because all values depend on neighbouring values
c) Yes, because we can use values from the last iteration
d) Yes, because the computation of interior values can overlap with the communication of ghost cells
Answer: D
Explanation: The interior computations can proceed while communication of boundary (ghost) cells happens, hiding the latency of communication.
Which of the following suggestions does NOT solve the incorrect behaviour of the parallel Floyd algorithm?
a) Communicate row k using synchronous sends
b) Communicate row k using MPI.Bcast!
c) Communicate row k using standard sends
d) Use MPI.Barrier in each iteration
Answer: C
Explanation: Standard sends ensure FIFO ordering between pairs of processors only. Not between multiple processors
What are the data dependencies for each worker in Gaussian elimination at iteration k>1?
a) Workers with rows > k need all rows <= k
b) Workers with rows < k need all rows >= k
c) All workers need row k
d) Workers with rows > k need only part of row k
Answer: D
Explanation: In Gaussian elimination at iteration k, workers with rows greater than k only need the part of row k after column k for the elimination step.
What can be done to solve the load imbalance in block partitioning in Gaussian elimination?
a) Use a cyclic distribution of rows
b) Schedule jobs in a queue
c) Use a 2D block partition
d) Use increasing block sizes
Answer: A
Explanation: Cyclic distribution balances the workload by assigning rows in a way that distributes work evenly across processes.