Exam-like questions Flashcards by MHR Stivala

Consider the following Julia code snippet. Recall that rand(1:10,n) creates a vector of n random integers with values from 1 to 10.

using Distributed
addprocs(4)
a = 2
b = 0
f = (i) -> mod(i,a) == b
n = 10000
c = rand(1:10,n)
x = @fetchfrom 3 count(f,c)

a) How many integers are sent and received between process 1 and process 3 in last line? Give a value and motivate your answer.
b) Which is the type of x? An Int or a Future ? Explain your answer.

a) We sent function f, which captures 2 integers (a and b), and also the vector c (10000 integers). We also get back the result (1 integer). Thus the total number of integers transfered is 10003.

b) x is an Int since macro @fetchfrom returns the result of the code executed remotelly and not a Future object. This is in contrast to @spawnat that returns immediatelly and returns a Future object.

How well did you know this?

Not at all

Perfectly

Consider the following Julia code snippet. Remember that a=[0] creates a vector of a single entry equal to zero.

using Distributed
addprocs(4)
f = () -> Channel{Int}(1)
chnl = RemoteChannel(f)
a = [0]
@sync for w in workers()
@spawnat w put!(chnl,10)
a[1] = a[1] + take!(chnl)
end
x = a[1]
a) Which is the value of x? Give a value and motivate your answer.
b) If we remove @sync before the for-loop: i) the value of x cannot be predicted. ii) the value of x is unchanged. Choose an option and explain why.

a) The value of x will be 40. Since we have 4 processes and each puts number 10 into de channel. The main process will take value 10 for 4 different times and add them resulting in value 40.

b) Removing @sync has no effect in this example since the call to take! in the loop will block until the remote process calls to put! and the given value is transfered via the remote channel.

How well did you know this?

Not at all

Perfectly

What would change if Floyd’s All pairs Shortest Paths (ASP) algorithm would be parallelized by distributing the rows cyclically instead of blockwise over the different machines? Would the communication pattern and/or performance change?

During each iteration k, the processor owning row k has to broadcast it. With cyclic distribution, the processors will then take turns in owning/broadcasting the current row. The number of messages and volume of data sent will remain the same, just the order of the senders changes. This is likely to have very little impact on communication overhead.

How well did you know this?

Not at all

Perfectly

The 1D Jacobi algorithm is usually parallelized by partitioning the array (grid) data structure in horizontal blocks over the different machines. What would change if we distribute the rows cyclically over the machines (just as is done for the Linear Equations algorithm)? How would the communication pattern change? What would the impact on performance be?

To update an entry i, entries i-1 and i+1 are needed as input. With a cyclic distribution, these elements would be located at two other machines, so 2 messages would be needed for each entry i. With block distribution, only the boundary rows of a machine need data from remote machines. Hence, the cyclic distribution would have vastly more communication overhead and would have very bad performance.

How well did you know this?

Not at all

Perfectly

The Traveling Salesperson Problem (TSP) can be parallelized using Replicated Workers style parallelism: a master processor searches the TSP tree up to a certain depth (called MAXHOPS) and then dynamically distributes the remaining subtrees to a number of worker processors. The value of MAXHOPS affects the communication overhead, the load imbalance overhead, and the expected search overhead of the parallel TSP program.

(a) Explain in general what these three types of overhead mean.

(b) Explain for each type of overhead whether it increases or decreases with larger values of MAXHOPS. Justify your answers. You should assume that the TSP tree is sorted using the nearest-city-first heuristic.

Communication: the overhead of message passing, in this case sending jobs to worker processors and updating the global minimum.

Load imbalance: some machines may have more work to do than others, resulting in idle time for machines with little work but longer runtimes for the machines with more work. To finish the overall program, we have to wait for the last machine to finish, so load imbalance results in lower speedups.

Search overhead: the extra work that the parallel algorithm does compared to the sequential algorithm.

Communication overhead increases with higher values of MAXHOPS, because the algorithm will then generate more jobs (each taking less time).

Load imbalance decreases, because the many small jobs can more easily be distributed fairly among all processors than a few large jobs, especially as different parts of the tree will require different amounts of computation. However, if MAXHOPS is set too high, the master machine will become the bottleneck, which would cause a large load imbalance

Search overhead decreases, because you now get more information from leaf nodes (full routes) in the left part of the tree to prune nodes in the other part of the tree.

As extreme example, consider MAXHOPS=1 and we have, say, 20 subtrees (21 cities) and 20 machines. Communication overhead will be extremely low: just one job per machine. Load imbalance will be high, because the 20 subtrees take very different amounts of time; especially the trees of the right can do much more pruning (because of the nearest-city-first heuristic) and thus cost less time. Search overhead will be high, because the optimal TSP solution is likely to be in the left part of the tree (because of nearest-city-first), but some machines start working on the right-most subtrees without having this information from the left part. In the sequential algorithm, the tree is always searched from the left to right.

How well did you know this?

Not at all

Perfectly

The goal of function below is to compute the ASP problem using a distributed implementation of Floyd’s algorithm. Each MPI rank calls the function below to compute a portion of the solution. Cw is a matrix containing the block of rows processed by the current MPI rank, and rows_w is a vector containing the global ids of the rows stored in Cw. The send/receive primitives used are FIFO-ordered, so messages between two nodes always arrive in the order they were sent. Explain why this algorithm is incorrect and may sometimes compute incorrect results, depending on the timing behavior of the program.

function floyd_worker!(Cw,rows_w)
comm = MPI.Comm_dup(MPI.COMM_WORLD)
rank = MPI.Comm_rank(comm)
nranks = MPI.Comm_size(comm)
m,n = size(Cw)
C_k = similar(Cw,n)
for k in 1:n
if k in rows_w
myk = (k-first(rows_w))+1
C_k .= view(Cw,myk,:)
for proc in 0:(nranks-1)
if rank == proc
continue
end
MPI.Send(C_k,comm;dest=proc,tag=0)
end
else
MPI.Recv!(C_k,comm,source=MPI.ANY_SOURCE,tag=0)
end
for j in 1:n
for i in 1:m
@inbounds Cw[i,j] = min(Cw[i,j],Cw[i,k]+C_k[j])
end
end
end
Cw
end

Even if messages between two nodes are FIFO ordered the algorithm still is incorrect, because at different iterations different nodes may be the sender. For example, with the setup given on the slide (12 cities, 3 machines), at iteration 4, node-1 (on the left) sends row-4 to all other nodes; after receiving row-4, node-2 (at the bottom) sends row-5 to all other nodes. There is no guarantee that node-3 (on the right) receives row-4 before row-5, because FIFO-ordering does not apply between multiple different senders. Hence, node-3 may first receive row-5 and think it is the row for iteration 4, giving wrong results.

How well did you know this?

Not at all

Perfectly

Consider function below. It implements a sequential algorithm that takes an N-by-M matrix G and and an integer nsteps. Matrix G is updated within this function.

function foo(G,nsteps)
N,M = size(G)
Gnew = copy(G)
for step in 1:nsteps
for j in 3:(M-2)
for i in 3:(N-2)
Gnew[i,j] = (G[i,j]+G[i-1,j]+G[i-2,j]+G[i+1,j]+G[i+2,j]+G[i,j-1]+G[i,j-2]+G[i,j+1]+G[i,j+2])/9
end
end
G .= Gnew
end
G
end
What is the communication scheme if this algorithm is parallelized by partitioning the array row-wise over P processors (i.e., giving each processor N/P consecutive rows)? Make clear which data the processors will exchange. (You may assume that N is a multiple of P.)

The algorithm is very similar to a 2D Jacobi, except that each grid cell now needs the values from 8 other cells (2 in each direction). With a row-wise distribution, this means that every processor will now need two rows (instead of one) from each of its two neighbors, except for the first and last processor.

How well did you know this?

Not at all

Perfectly

When parallelising matrix multiplication: C = A * B. Which partition leads to the best computation/communication ratio?

a) Each process computes one element of C

b) Each process computes one row of C

c) Each process computes one column of C

d) Each process computes a set of rows of C

Answer: D

Explanation: By distributing a set of rows we optimise the communication/computation ratio. Leading to a communication overhead of O(P/N), where P is the number of processors and N is the matrix size. As N gets larger the communication overhead diminishes.

How well did you know this?

Not at all

Perfectly

When parallelising matrix multiplication, assume each process computes a set of rows of C=A*B. What are the data dependencies for each process?

a) The whole matrix A and B

b) A[rows,:] and the whole matrix B

c) A[rows,:] and B[:,columns]

d) A[rows,:] and B[rows,:]

Answer: B

Explanation: Each process needs its assigned rows of matrix A and the entire matrix B to compute its respective portion of C.

How well did you know this?

Not at all

Perfectly

Each process computes a set of rows of C = A * B. What is the computation and communication complexity in each worker? Assuming NxN is the size of the matrix, and P is the number of processes.

a) O(N²/P), O(N²)

b) O(N³/P), O(N²)

c) O(N²/P), O(N)

d) O(N³/P), O(N³)

Answer: B

Explanation: For matrix multiplication, each process has computation complexity proportional to its share of rows (O(N3/P)), and communication complexity for transferring entire matrices (O(N2)).

How well did you know this?

Not at all

Perfectly

Assume the communication/computation ratio is O(P/N). Does this parallel algorithm yield speedups?

a) Yes, if we have enough processes

b) Yes, if the problem size is large N»P

c) No, because the communication overhead is too big

d) No, because the ratio depends on the problem size

Answer: B

Explanation: The algorithm scales well when the problem size (N) is significantly larger than the number of processes (P). So if N&raquo_space; P, the communication will be small compared to local computation.

How well did you know this?

Not at all

Perfectly

Can we use latency hiding in the Jacobi method?

a) No, because all computations depend on the previous iteration

b) No, because all values depend on neighbouring values

c) Yes, because we can use values from the last iteration

d) Yes, because the computation of interior values can overlap with the communication of ghost cells

Answer: D

Explanation: The interior computations can proceed while communication of boundary (ghost) cells happens, hiding the latency of communication.

How well did you know this?

Not at all

Perfectly

Which of the following suggestions does NOT solve the incorrect behaviour of the parallel Floyd algorithm?

a) Communicate row k using synchronous sends

b) Communicate row k using MPI.Bcast!

c) Communicate row k using standard sends

d) Use MPI.Barrier in each iteration

Answer: C

Explanation: Standard sends ensure FIFO ordering between pairs of processors only. Not between multiple processors

How well did you know this?

Not at all

Perfectly

What are the data dependencies for each worker in Gaussian elimination at iteration k>1?

a) Workers with rows > k need all rows <= k

b) Workers with rows < k need all rows >= k

c) All workers need row k

d) Workers with rows > k need only part of row k

Answer: D

Explanation: In Gaussian elimination at iteration k, workers with rows greater than k only need the part of row k after column k for the elimination step.

How well did you know this?

Not at all

Perfectly

What can be done to solve the load imbalance in block partitioning in Gaussian elimination?

a) Use a cyclic distribution of rows

b) Schedule jobs in a queue

c) Use a 2D block partition

d) Use increasing block sizes

Answer: A

Explanation: Cyclic distribution balances the workload by assigning rows in a way that distributes work evenly across processes.

How well did you know this?

Not at all

Perfectly

In a parallel program speedup is defined as:

a) The ratio of sequential time over parallel time

b) The ratio of parallel time over sequential time

c) The ratio of communication over computation time

d) The number of processes used

Study These Flashcards

Answer: A

Explanation: Speedup measures how much faster a parallel algorithm runs compared to the sequential version. It is the ratio of the sequential time over the parallel time.

Why is load balancing important in parallel computing?

a) To minimise search overhead

b) To ensure all processors have an equal amount of work

c) To maximise communication overhead

d) To reduce memory usage

Study These Flashcards

Answer: B

Explanation: Load balancing ensures that all processors have an approximately equal amount of work, which prevents some processors from idling while others are overburdened.

Which of the following best describes the role of ghost cells in jacobi method?

a) To store redundant data for fault tolerance

b) They represent the data dependencies of the algorithm

c) To reduce load imbalance

d) To reduce search overhead

Study These Flashcards

Answer: B

Explanation: Ghost cells store copies of neighbouring boundary data in Jacobi method to allow processes to communicate and compute values at boundaries.

Which algorithm requires dynamic load balancing?

a) Gaussian Elimination

b) Floyd’s Algorithm

c) Travelling Salesman Problem

d) Matrix Multiplication

Study These Flashcards

Answer: C

Explanation: The Traveling Salesman Problem (TSP) often uses dynamic load balancing (replicated workers model) to distribute the problem across processes as different parts of the problem may have different computational loads.

What is the advantage of a binary tree topology over a mesh topology?

a) Binary tree has a better diameter

b) Binary tree has more edges per node

c) Binary tree has a better bisection width

d) Binary tree has 4 edges per node

Study These Flashcards

Answer: A

Explanation: Binary trees generally have a smaller diameter, which reduces the communication time between the farthest nodes.

Flynn’s Taxonomy classifies computer architectures as:

a) Single/multiple CPUs & distributed/shared memory

b) Single/multiple instruction & single/multiple data

c) Synchronous/parallel instructions - single/multiple data

d) CPUs/GPUs - distributed/shared memory

Study These Flashcards

Answer: B

Explanation: Flynn’s Taxonomy classifies computer architectures based on instructions and data streams: SISD, SIMD, MISD, MIMD.

Which MPI directive would you use to distribute a 2D matrix from one process to all processes?

a) MPI.Bcast

b) MPI.Scatter

c) MPI.Allreduce

d) MPI.ISend

Study These Flashcards

Answer: A

Explanation: MPI.Bcast is used to broadcast data from one process to all processes.

Your parallel program solves larger problems on many nodes in the same time as it solves a smaller problem on fewer nodes (keeping problem size proportional to number of nodes). This is an example of:

a) Strong scaling

b) Weak scaling

c) Superlinear speedup

d) Load imbalance

Study These Flashcards

Answer: B

Explanation: Weak scaling measures the efficiency when the problem size increases with the number of processes, maintaining constant workload per process.

Which strategy is used to improve load balance while reducing communication in the POP simulation software?
NB. This questions is about the invited lecture by Jason Maassen (eScience Center).

a) Cyclic partitions

b) Replicated workers model

c) Hierarchical block distribution

d) Computation ensembles

Study These Flashcards

Answer: C

Explanation: a) and b) are not used in POP. Ensembles are used in POP, but to improve the certainty of the results, not load balance.

Consider the following sparse matrix-vector product w=A*v distributed over 2 machines. We use a uniform row block partition as indicated by the colors in the figure. Matrix A has only 6 non-zero values in the positions shown in the figure. Which elements of v are strictly needed to compute w[1:4] in CPU 1 taking into account the non-zeros of A? a) The entire vector v b) Elements v[4] and v[7] c) Elements v[2], v[4], v[7], and v[8] d) Elements v[1:4]

Answer: B Explanation: Only v[4] and v[7] are needed. Only the values corresponding to the non-zero columns of A in CPU 1 are needed.

Exam-like questions Flashcards

for lectures given by Francesc (25 cards)