Notebook Questions Flashcards

from here: https://www.francescverdugo.com/XM_40017/dev/

1
Q

What will be the value of x in the last line ? (Think your answer before executing next cell to find out the result)

x = 1
y = x
y = 2
x

A

In the first line, we assign a variable to a value. In the second line, we assign another variable to the same value. Thus, we have 2 variables associated with the same value. In line 3, we associate y to a new value (re-assignment). Thus, we have 2 variables associated with 2 different values. Variable x is still associated with its original value. Thus, the value at the final line is x=1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What will be the value of x in the last line ?

function q(x)
x = 2
x
end

x = 1
y = q(x)
x

A

It will be 1 for very similar reasons as in the previous questions: we are reassigning a local variable, not the global variable defined outside the function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which will be the value of x below?

function hofun(x)
y -> x*y
end

f2 = hofun(2)

x = f2(3)
x

A

It will be 6. In the returned function f2, x is equal to 2. Thus, when calling f2(3) we compute 2*3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How long will the compute time of next cell be?

n = 140_000_000
@time for i in 1:10
@show compute_π(n)
end

a) 10t
b) t
c) 0.1
t
d) O(1), i.e. time independent from n

A

Evaluating compute_π(100_000_000) takes about 0.25 seconds on the teacher’s laptop. Thus, the loop would take about 2.5 seconds since we are calling the function 10 times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How long will the compute time of next cell be?

n = 140_000_000
@time for i in 1:10
@async @show compute_π(n)
end

a) 10t
b) t
c) 0.1
t
d) O(1)

A

The time in doing the loop will be O(1) since the loop just schedules 10 tasks, which should be a (small) constant time independent of n.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How long will the compute time of next cell be?

n = 140_000_000
@time @sync for i in 1:10
@async @show compute_π(n)
end

a) 10t
b) t
c) 0.1
t
d) O(1)

A

It will take 2.5 seconds, like in question 1. The @sync macro forces to wait for all tasks we have generated with the @async macro. Since we have created 10 tasks and each of them takes about 0.25 seconds, the total time will be about 2.5 seconds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How long will the compute time of the 2nd cell be?

buffer_size = 4
chnl = Channel{Int}(buffer_size)

@time begin
put!(chnl,3)
i = take!(chnl)
sleep(i)
end

a) infinity
b) 1 second
c) less than 1 seconds
d) 3 seconds

A

It will take about 3 seconds. The channel has buffer size 4, thus the call to put!will not block. The call to take! will not block neither since there is a value stored in the channel. The taken value is 3 and therefore we will wait for 3 seconds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How long will the compute time of the 2nd cell be?

chnl = Channel{Int}()

@time begin
put!(chnl,3)
i = take!(chnl)
sleep(i)
end

a) infinity
b) 1 second
c) less than 1 seconds
d) 3 seconds

A

The channel is not buffered and therefore the call to put! will block. The cell will run forever, since there is no other task that calls take! on this channel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How many integers are transferred between master and worker? Including both directions.

a = rand(Int,4,4)
proc = 4
@fetchfrom proc sum(a^2)

a) 17
b) 32
c) 16^2
d) 65

A

We send the matrix (16 entries) and then we receive back the result (1 extra integer). Thus, the total number of transferred integers in 17.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How many integers are transferred between master and worker? Including both directions.

a = rand(Int,4,4)
proc = 4
@fetchfrom proc sum(a[2,2]^2)

a) 2
b) 17
c) 5
d) 32

A

Even though we only use a single entry of the matrix in the remote worker, the entire matrix is captured and sent to the worker. Thus, we will transfer 17 integers like in Question 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Which value will be the value of x ?

a = zeros(Int,3)
proc = 3
@sync @spawnat proc a[2] = 2
x = a[2]
x

A

The value of x will still be zero since the worker receives a copy of the matrix and it modifies this copy, not the original one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which value will be the value of x ?

a = zeros(Int,3)
proc = myid()
@sync @spawnat proc a[2] = 2
x = a[2]
x

A

In this case, the code a[2]=2 is executed in the main process. Since the matrix is already in the main process, it is not needed to create and send a copy of it. Thus, the code modifies the original matrix and the value of x will be 2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which is the complexity (number of operations) of the serial algorithm? Assume that all matrices are
N-by- N matrices.

for j in 1:N
for i in 1:N
Cij = z
for k in 1:N
@inbounds Cij += A[i,k]*B[k,j]
end
C[i,j] = Cij
end
end

a) O(1)
b) O(N)
c) O(N²)
d) O(N³)

A

d) O(N³)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which are the data dependencies of the computations done by the worker in charge of computing entry C[i,j] ?
a) column A[:,i] and row B[j,:]
b) row A[i,:] and column B[:,j]
c) the whole matrices A and B
d) row A[i,:] and the whole matrix B

A

b) row A[i,:] and column B[:,j]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How many scalars are communicated from and to a worker? Assume that matrices A, B, and C are N by N matrices.

a) O(1)
b) O(N)
c) O(N²)
d) O(N³)

A

b) O(N)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How many operations are done in a worker?

a) O(1)
b) O(N)
c) O(N²)
d) O(N³)

A

b) O(N)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Which are the data dependencies of the computations done by the worker in charge of computing row C[i,:] ?

a) column A[:,i] and row B[j,:]
b) row A[i,:] and column B[:,j]
c) the whole matrices A and B
d) row A[i,:] and the whole matrix B

A

b) row A[i,:] and column B[:,j]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Which is the complexity of the communication and computations done by a worker in algorithm 2?

a) O(N) communication and O(N^2) computation
b) O(N^2) communication and O(N^2) computation
c) O(N^3) communication and O(N^3) computation
d) O(N) communication and O(N) computation

A

d) O(N) communication and O(N) computation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Which are the data dependencies of the computations done by the worker in charge of computing the range of rows C[rows,:] ?
a) A[rows,:] and B[:,rows]
b) the whole matrix A and B[:,rows]
c) A[rows,:] and the whole matrix B
d) the whole matrices A and B

A

c) A[rows,:] and the whole matrix B

20
Q

Which is the complexity of the communication and computations done by a worker in algorithm 3?

a) O(N²) communication and O(N³) computation
b) O(N²) communication and O(N³/P) computation
c) O(N²) communication and O(N²) computation
d) O(N²/P) communication and O(N³/P) computation

A

b) O(N²) communication and O(N³/P) computation

21
Q

What does MPI_Barrier do?

A

Synchronize all processes

22
Q

What does MPIBcast do?

A

Send the same data from one to all processes

23
Q

What does MPI_Gather do?

A

Gather data from all processes to one

24
Q

What does MPI_Scatter do?

A

Scatter data from 1 to all processes

25
Q

What does MPI_Reduce do?

A

Reduce data from all processes to a single one

26
Q

What does MPI_Scan do?

A

Scan reduction

27
Q

What does MPI_Allgather do?

A

MPI_Gather + all processes receive the result

28
Q

What does MPI_Allreduce do?

A

MPI_Reduce + all processes receive the result

29
Q

What does MPI_Alltoall do?

A

Exchange data from all to all processes

30
Q

What two key features do MPI communicators provide? How are they useful?

A

1) isolated communication context
2) creating groups of processes

Useful to:
1) combine different libraries using MPI in the same application
2) use collective operations in a subset of the processes

31
Q

Compute the complexity of the communication over computation ratio for this data partition. (1D block partition)

A
  • We update N^2/P items per iteration
  • We need data from 2 neighbors (2 messages per iteration)
  • We communicate N items per message
  • Communication/computation ratio is 2N/(N^2/P) = 2P/N =O(P/N)
31
Q

Compute the complexity of the communication over computation ratio for this data partition. (2D Block partition)

A
  • We update N^2/P items per iteration
  • We need data from 4 neighbors (4 messages per iteration)
  • We communicate N/sqrt(P) items per message
  • Communication/computation ratio is (4N/sqrt(P)/(N^2/P)= 4sqrt(P)/N =O(sqrt(P)/N)
31
Q

Compute the complexity of the communication over computation ratio for this data partition. (2D cyclic partition)

A
  • We update N^2/P items
  • We need data from 4 neighbors (4 messages per iteration)
  • We communicate N^2/P items per message (the full data owned by the neighbor)
  • Communication/computation ratio is O(1)
31
Q

Which of the two loops in the Gauss-Seidel method are trivially parallelizable?

for t in 1:niters
for i in 2:(n+1)
u[i] = 0.5*(u[i-1]+u[i+1])
end
end

a) Both of them
b) The outer, but not the inner
c) None of them
d) The inner, but not the outer

A

c) None of them

32
Q

Which of the loops in the red-black Gauss-Seidel method are trivially parallelizable?

for t in 1:niters
for color in (0,1)
for i in (n+1):-1:2
if color == mod(i,2)
u[i] = 0.5*(u[i-1]+u[i+1])
end
end
end
end

a) All loops
b) Loop over t only
c) Loop over color only
d) Loop over i only

A

d) Loop over i only

All “red” cells can be updated in parallel as they only depend on the values of “black” cells.

In order workds, we can update the “red” cells in any order whithout changing the result. They only depend on values in the “black” cells, which will not change during the loop over “red” cells.

Similarly, all “black” cells can be updated in parallel as they only depend on “red” cells.

33
Q

Is this other implementation based on MPI.Send and MPI.Recv! correct?

function ghost_exchange!(u,comm)
load = length(u)-2
rank = MPI.Comm_rank(comm)
nranks = MPI.Comm_size(comm)
if rank != 0
neig_rank = rank-1
u_snd = view(u,2:2)
u_rcv = view(u,1:1)
dest = neig_rank
source = neig_rank
MPI.Send(u_snd,comm;dest)
MPI.Recv!(u_rcv,comm;source)
end
if rank != (nranks-1)
neig_rank = rank+1
u_snd = view(u,(load+1):(load+1))
u_rcv = view(u,(load+2):(load+2))
dest = neig_rank
source = neig_rank
MPI.Send(u_snd,comm;dest)
MPI.Recv!(u_rcv,comm;source)
end
end

a) It is correct.
b) It is incorrect, but it might provide the right result depending on the MPI implementation.
c) It is incorrect, and it is guaranteed that it will result in a dead lock.
d) This implementation does not work when distributing over just a single MPI rank.

A

b) It is incorrect, but it might provide the right result depending on the MPI implementation.

34
Q

Which MPI directives would you use to implement latency hiding in the communication the ghost values?

a) MPI_Send and MPI_Recv
b) MPI_Bsend and MPI_Recv
c) MPI_Isend and MPI_Irecv
d) MPI_Sendrecv

A
35
Q

How would you fix the implementation while still using MPI.Send and MPI.Recv! instead of MPI.Sendrecv! ?

function ghost_exchange!(u,comm)
load = length(u)-2
rank = MPI.Comm_rank(comm)
nranks = MPI.Comm_size(comm)
if rank != 0
neig_rank = rank-1
u_snd = view(u,2:2)
u_rcv = view(u,1:1)
dest = neig_rank
source = neig_rank
MPI.Send(u_snd,comm;dest)
MPI.Recv!(u_rcv,comm;source)
end
if rank != (nranks-1)
neig_rank = rank+1
u_snd = view(u,(load+1):(load+1))
u_rcv = view(u,(load+2):(load+2))
dest = neig_rank
source = neig_rank
MPI.Send(u_snd,comm;dest)
MPI.Recv!(u_rcv,comm;source)
end
end

A

One needs to carefully order the sends and the receives to avoid cyclic dependencies that might result in deadlocks. The actual implementation is left as an exercise.

36
Q

Which MPI directives would you use to implement latency hiding in the communication the ghost values?

a) MPI_Send and MPI_Recv
b) MPI_Bsend and MPI_Recv
c) MPI_Isend and MPI_Irecv
d) MPI_Sendrecv

A

c) MPI_Isend and MPI_Irecv

37
Q

Can we really parallelize the loops over i and j ? To compute C[i,j] at iteration k, we first need to compute C[i,k] and C[k,j]. In order words, it seems that the order of the loops over i and j cannot be arbitrary. However, this is not really true, why?

n = size(C,1)
for k in 1:n
for j in 1:n
for i in 1:n
C[i,j] = min(C[i,j],C[i,k]+C[k,j])
end
end
end

A

Then we can change the loop order over i and j without changing the result. Remember:

C[i,j] = min(C[i,j],C[i,k]+C[k,j])

if we substitute j=k, we get

C[i,k] = min(C[i,k],C[i,k]+C[k,k]).

Since C[k,k]=0, thus, C[i,k] = min(C[i,k],C[i,k]), and C[i,k] = C[i,k].

In other words, the value of C[i,k] will not be updated at iteration k.

The same is true for i=k.
38
Q

How much data is send from the owner of row k in each iteration in this parallel algorithm?

a) O(N²/P)
b) O(N)
c) O(NP)
d) O(P)

A

c) O(NP)

39
Q

Which of the loops can be parallelized?

n,m = size(B)
for k in 1:n
for t in (k+1):m
B[k,t] = B[k,t]/B[k,k]
end
B[k,k] = 1
for i in (k+1):n
for j in (k+1):m
B[i,j] = B[i,j] - B[i,k]*B[k,j]
end
B[i,k] = 0
end
end

a) the inner loops, but not the outer loop
b) the outer loop, but not the inner loops
c) all loops
d) only the first inner loop

A

a) the inner loops, but not the outer loop

The outer loop of the algorithm is not parallelizable, since the iterations depend on the results of the previous iterations. However, we can extract parallelism from the inner loops.

40
Q

Using a cyclic partition for Gaussian elimination, is a form of static or dynamic load balancing?

A

It is a form of static load balancing.

We know in advance the load distribution and the partition strategy does not depend on the actual values of the input matrix

41
Q

How many routes are fully traversed in total when we assign two branches to each worker?

Assume that each worker does pruning locally and independently of the other workers.

A

4

42
Q

For some values of n and max_hops the parallel efficiency can be above 100% (super-linear speedup). For example with n=18 and max_hops=2, I get super-linear speedup on my laptop for some runs.

Explain a possible cause for super-linear speedup in this algorithm.

A

Negative search overhead can explain the superlinear speedup in this algorithm.

The optimal speedup (speedup equal to the numer of processors) assumes that the work done in the sequental and parallel algorithm is the same. If the parallel code does less work, it is possible to go beyond the optimal speedup.

Cache effects are not likely to have a positive impact here. Even large search spaces can be represented with rather small distance matrices.

Moreover, we are not partitioning the distance matrix.