Notebook Questions Flashcards by MHR Stivala

What will be the value of x in the last line ? (Think your answer before executing next cell to find out the result)

x = 1
y = x
y = 2
x

In the first line, we assign a variable to a value. In the second line, we assign another variable to the same value. Thus, we have 2 variables associated with the same value. In line 3, we associate y to a new value (re-assignment). Thus, we have 2 variables associated with 2 different values. Variable x is still associated with its original value. Thus, the value at the final line is x=1.

How well did you know this?

Not at all

Perfectly

What will be the value of x in the last line ?

function q(x)
x = 2
x
end

x = 1
y = q(x)
x

It will be 1 for very similar reasons as in the previous questions: we are reassigning a local variable, not the global variable defined outside the function.

How well did you know this?

Not at all

Perfectly

Which will be the value of x below?

function hofun(x)
y -> x*y
end

f2 = hofun(2)

x = f2(3)
x

It will be 6. In the returned function f2, x is equal to 2. Thus, when calling f2(3) we compute 2*3.

How well did you know this?

Not at all

Perfectly

How long will the compute time of next cell be?

n = 140_000_000
@time for i in 1:10
@show compute_π(n)
end

a) 10t
b) t
c) 0.1t
d) O(1), i.e. time independent from n

Evaluating compute_π(100_000_000) takes about 0.25 seconds on the teacher’s laptop. Thus, the loop would take about 2.5 seconds since we are calling the function 10 times.

How well did you know this?

Not at all

Perfectly

How long will the compute time of next cell be?

n = 140_000_000
@time for i in 1:10
@async @show compute_π(n)
end

a) 10t
b) t
c) 0.1t
d) O(1)

The time in doing the loop will be O(1) since the loop just schedules 10 tasks, which should be a (small) constant time independent of n.

How well did you know this?

Not at all

Perfectly

How long will the compute time of next cell be?

n = 140_000_000
@time @sync for i in 1:10
@async @show compute_π(n)
end

a) 10t
b) t
c) 0.1t
d) O(1)

It will take 2.5 seconds, like in question 1. The @sync macro forces to wait for all tasks we have generated with the @async macro. Since we have created 10 tasks and each of them takes about 0.25 seconds, the total time will be about 2.5 seconds.

How well did you know this?

Not at all

Perfectly

How long will the compute time of the 2nd cell be?

buffer_size = 4
chnl = Channel{Int}(buffer_size)

@time begin
put!(chnl,3)
i = take!(chnl)
sleep(i)
end

a) infinity
b) 1 second
c) less than 1 seconds
d) 3 seconds

It will take about 3 seconds. The channel has buffer size 4, thus the call to put!will not block. The call to take! will not block neither since there is a value stored in the channel. The taken value is 3 and therefore we will wait for 3 seconds.

How well did you know this?

Not at all

Perfectly

How long will the compute time of the 2nd cell be?

chnl = Channel{Int}()

@time begin
put!(chnl,3)
i = take!(chnl)
sleep(i)
end

a) infinity
b) 1 second
c) less than 1 seconds
d) 3 seconds

The channel is not buffered and therefore the call to put! will block. The cell will run forever, since there is no other task that calls take! on this channel.

How well did you know this?

Not at all

Perfectly

How many integers are transferred between master and worker? Including both directions.

a = rand(Int,4,4)
proc = 4
@fetchfrom proc sum(a^2)

a) 17
b) 32
c) 16^2
d) 65

We send the matrix (16 entries) and then we receive back the result (1 extra integer). Thus, the total number of transferred integers in 17.

How well did you know this?

Not at all

Perfectly

How many integers are transferred between master and worker? Including both directions.

a = rand(Int,4,4)
proc = 4
@fetchfrom proc sum(a[2,2]^2)

a) 2
b) 17
c) 5
d) 32

Even though we only use a single entry of the matrix in the remote worker, the entire matrix is captured and sent to the worker. Thus, we will transfer 17 integers like in Question 1.

How well did you know this?

Not at all

Perfectly

Which value will be the value of x ?

a = zeros(Int,3)
proc = 3
@sync @spawnat proc a[2] = 2
x = a[2]
x

The value of x will still be zero since the worker receives a copy of the matrix and it modifies this copy, not the original one.

How well did you know this?

Not at all

Perfectly

Which value will be the value of x ?

a = zeros(Int,3)
proc = myid()
@sync @spawnat proc a[2] = 2
x = a[2]
x

In this case, the code a[2]=2 is executed in the main process. Since the matrix is already in the main process, it is not needed to create and send a copy of it. Thus, the code modifies the original matrix and the value of x will be 2.

How well did you know this?

Not at all

Perfectly

Which is the complexity (number of operations) of the serial algorithm? Assume that all matrices are
N-by- N matrices.

for j in 1:N
for i in 1:N
Cij = z
for k in 1:N
@inbounds Cij += A[i,k]*B[k,j]
end
C[i,j] = Cij
end
end

a) O(1)
b) O(N)
c) O(N²)
d) O(N³)

d) O(N³)

How well did you know this?

Not at all

Perfectly

Which are the data dependencies of the computations done by the worker in charge of computing entry C[i,j] ?
a) column A[:,i] and row B[j,:]
b) row A[i,:] and column B[:,j]
c) the whole matrices A and B
d) row A[i,:] and the whole matrix B

b) row A[i,:] and column B[:,j]

How well did you know this?

Not at all

Perfectly

How many scalars are communicated from and to a worker? Assume that matrices A, B, and C are N by N matrices.

a) O(1)
b) O(N)
c) O(N²)
d) O(N³)

b) O(N)

How well did you know this?

Not at all

Perfectly

How many operations are done in a worker?

a) O(1)
b) O(N)
c) O(N²)
d) O(N³)

b) O(N)

How well did you know this?

Not at all

Perfectly

Which are the data dependencies of the computations done by the worker in charge of computing row C[i,:] ?

a) column A[:,i] and row B[j,:]
b) row A[i,:] and column B[:,j]
c) the whole matrices A and B
d) row A[i,:] and the whole matrix B

b) row A[i,:] and column B[:,j]

How well did you know this?

Not at all

Perfectly

Which is the complexity of the communication and computations done by a worker in algorithm 2?

a) O(N) communication and O(N^2) computation
b) O(N^2) communication and O(N^2) computation
c) O(N^3) communication and O(N^3) computation
d) O(N) communication and O(N) computation

d) O(N) communication and O(N) computation

How well did you know this?

Not at all

Perfectly

Which are the data dependencies of the computations done by the worker in charge of computing the range of rows C[rows,:] ?
a) A[rows,:] and B[:,rows]
b) the whole matrix A and B[:,rows]
c) A[rows,:] and the whole matrix B
d) the whole matrices A and B

Study These Flashcards

c) A[rows,:] and the whole matrix B

Which is the complexity of the communication and computations done by a worker in algorithm 3?

a) O(N²) communication and O(N³) computation
b) O(N²) communication and O(N³/P) computation
c) O(N²) communication and O(N²) computation
d) O(N²/P) communication and O(N³/P) computation

Study These Flashcards

b) O(N²) communication and O(N³/P) computation

What does MPI_Barrier do?

Study These Flashcards

Synchronize all processes

What does MPIBcast do?

Study These Flashcards

Send the same data from one to all processes

What does MPI_Gather do?

Study These Flashcards

Gather data from all processes to one

What does MPI_Scatter do?

Study These Flashcards

Scatter data from 1 to all processes

What does MPI_Reduce do?

Reduce data from all processes to a single one

What does MPI_Scan do?

Scan reduction

What does MPI_Allgather do?

MPI_Gather + all processes receive the result

What does MPI_Allreduce do?

MPI_Reduce + all processes receive the result

What does MPI_Alltoall do?

Exchange data from all to all processes

What two key features do MPI communicators provide? How are they useful?

1) isolated communication context 2) creating groups of processes Useful to: 1) combine different libraries using MPI in the same application 2) use collective operations in a subset of the processes

Compute the complexity of the communication over computation ratio for this data partition. (1D block partition)

- We update N^2/P items per iteration - We need data from 2 neighbors (2 messages per iteration) - We communicate N items per message - Communication/computation ratio is 2N/(N^2/P) = 2P/N =O(P/N)

Compute the complexity of the communication over computation ratio for this data partition. (2D Block partition)

- We update N^2/P items per iteration - We need data from 4 neighbors (4 messages per iteration) - We communicate N/sqrt(P) items per message - Communication/computation ratio is (4N/sqrt(P)/(N^2/P)= 4sqrt(P)/N =O(sqrt(P)/N)

Compute the complexity of the communication over computation ratio for this data partition. (2D cyclic partition)

- We update N^2/P items - We need data from 4 neighbors (4 messages per iteration) - We communicate N^2/P items per message (the full data owned by the neighbor) - Communication/computation ratio is O(1)

Which of the two loops in the Gauss-Seidel method are trivially parallelizable? for t in 1:niters for i in 2:(n+1) u[i] = 0.5*(u[i-1]+u[i+1]) end end a) Both of them b) The outer, but not the inner c) None of them d) The inner, but not the outer

c) None of them

Which of the loops in the red-black Gauss-Seidel method are trivially parallelizable? for t in 1:niters for color in (0,1) for i in (n+1):-1:2 if color == mod(i,2) u[i] = 0.5*(u[i-1]+u[i+1]) end end end end a) All loops b) Loop over t only c) Loop over color only d) Loop over i only

d) Loop over i only All "red" cells can be updated in parallel as they only depend on the values of "black" cells. In order workds, we can update the "red" cells in any order whithout changing the result. They only depend on values in the "black" cells, which will not change during the loop over "red" cells. Similarly, all "black" cells can be updated in parallel as they only depend on "red" cells.

Is this other implementation based on `MPI.Send` and `MPI.Recv!` correct? function ghost_exchange!(u,comm) load = length(u)-2 rank = MPI.Comm_rank(comm) nranks = MPI.Comm_size(comm) if rank != 0 neig_rank = rank-1 u_snd = view(u,2:2) u_rcv = view(u,1:1) dest = neig_rank source = neig_rank MPI.Send(u_snd,comm;dest) MPI.Recv!(u_rcv,comm;source) end if rank != (nranks-1) neig_rank = rank+1 u_snd = view(u,(load+1):(load+1)) u_rcv = view(u,(load+2):(load+2)) dest = neig_rank source = neig_rank MPI.Send(u_snd,comm;dest) MPI.Recv!(u_rcv,comm;source) end end a) It is correct. b) It is incorrect, but it might provide the right result depending on the MPI implementation. c) It is incorrect, and it is guaranteed that it will result in a dead lock. d) This implementation does not work when distributing over just a single MPI rank.

b) It is incorrect, but it might provide the right result depending on the MPI implementation.

Which MPI directives would you use to implement latency hiding in the communication the ghost values? a) MPI_Send and MPI_Recv b) MPI_Bsend and MPI_Recv c) MPI_Isend and MPI_Irecv d) MPI_Sendrecv

How would you fix the implementation while still using `MPI.Send` and `MPI.Recv!` instead of `MPI.Sendrecv!` ? function ghost_exchange!(u,comm) load = length(u)-2 rank = MPI.Comm_rank(comm) nranks = MPI.Comm_size(comm) if rank != 0 neig_rank = rank-1 u_snd = view(u,2:2) u_rcv = view(u,1:1) dest = neig_rank source = neig_rank MPI.Send(u_snd,comm;dest) MPI.Recv!(u_rcv,comm;source) end if rank != (nranks-1) neig_rank = rank+1 u_snd = view(u,(load+1):(load+1)) u_rcv = view(u,(load+2):(load+2)) dest = neig_rank source = neig_rank MPI.Send(u_snd,comm;dest) MPI.Recv!(u_rcv,comm;source) end end

One needs to carefully order the sends and the receives to avoid cyclic dependencies that might result in deadlocks. The actual implementation is left as an exercise.

Which MPI directives would you use to implement latency hiding in the communication the ghost values? a) MPI_Send and MPI_Recv b) MPI_Bsend and MPI_Recv c) MPI_Isend and MPI_Irecv d) MPI_Sendrecv

c) MPI_Isend and MPI_Irecv

Can we really parallelize the loops over `i` and `j` ? To compute `C[i,j]` at iteration `k`, we first need to compute `C[i,k]` and `C[k,j]`. In order words, it seems that the order of the loops over `i` and `j` cannot be arbitrary. However, this is not really true, why? n = size(C,1) for k in 1:n for j in 1:n for i in 1:n C[i,j] = min(C[i,j],C[i,k]+C[k,j]) end end end

Then we can change the loop order over i and j without changing the result. Remember: C[i,j] = min(C[i,j],C[i,k]+C[k,j]) if we substitute j=k, we get C[i,k] = min(C[i,k],C[i,k]+C[k,k]). Since C[k,k]=0, thus, C[i,k] = min(C[i,k],C[i,k]), and C[i,k] = C[i,k]. In other words, the value of C[i,k] will not be updated at iteration k. The same is true for i=k.

How much data is send from the owner of row k in each iteration in this parallel algorithm? a) O(N²/P) b) O(N) c) O(NP) d) O(P)

c) O(NP)

Which of the loops can be parallelized? n,m = size(B) for k in 1:n for t in (k+1):m B[k,t] = B[k,t]/B[k,k] end B[k,k] = 1 for i in (k+1):n for j in (k+1):m B[i,j] = B[i,j] - B[i,k]*B[k,j] end B[i,k] = 0 end end a) the inner loops, but not the outer loop b) the outer loop, but not the inner loops c) all loops d) only the first inner loop

a) the inner loops, but not the outer loop The outer loop of the algorithm is not parallelizable, since the iterations depend on the results of the previous iterations. However, we can extract parallelism from the inner loops.

Using a cyclic partition for Gaussian elimination, is a form of static or dynamic load balancing?

It is a form of static load balancing. We know in advance the load distribution and the partition strategy does not depend on the actual values of the input matrix

How many routes are fully traversed in total when we assign two branches to each worker? Assume that each worker does pruning locally and independently of the other workers.

For some values of `n` and `max_hops` the parallel efficiency can be above 100% (super-linear speedup). For example with `n=18` and `max_hops=2`, I get super-linear speedup on my laptop for some runs. Explain a possible cause for super-linear speedup in this algorithm.

Negative search overhead can explain the superlinear speedup in this algorithm. The optimal speedup (speedup equal to the numer of processors) assumes that the work done in the sequental and parallel algorithm is the same. If the parallel code does less work, it is possible to go beyond the optimal speedup. Cache effects are not likely to have a positive impact here. Even large search spaces can be represented with rather small distance matrices. Moreover, we are not partitioning the distance matrix.

Notebook Questions Flashcards

from here: https://www.francescverdugo.com/XM_40017/dev/ (45 cards)