Misc Flashcards
what is the diameter of a p processor hypercube
log2p
what is the bisection width of a 3d mesh with equal number of processors along all three dimensions. express the answer as a function of the number of processors given by p
p^(2/3)
What two performance bounds are represented on a roofline plot?
Computational Peak Performance
Memory Bandwidth Peak (or Peak Bandwidth)
According to the standard, under what conditions can the MPI_send call return to the calling procedure?
The MPI_Send call can return after the send operation is complete, which means either the message has been sent to the system buffer and is ready for transmission, or the send has been matched with a corresponding receive operation, depending on the implementation.
For amdahl’s law, describe the function to derive the serial fraction
In Amdahl’s Law, the serial fraction (s) is derived from the part of the task that cannot be parallelized, divided by the total execution time.
For ahmdahl’s law, give the equation to derive the speedup and efficiency given n processors
speedup (S) = 1 / (s + (1-s) / n)
efficiency (E) = S / n
the lowercase s represents the serial fraction
For ahmdal’s law, give the equation to find the upper bound on performance
S_max = 1 / s
Briefly state Little’s law in the context of memory system performance
Little’s law states that the number of elements in a queuing system is equal to the average wait time multiplied by the arrival wait. It can be used to describe bandwidth.
What cache type has the lowest latency, direct mapped, k-way, or fully associative?
Direct mapped
Which cache type has the highest hit ratio, direct mapped, k-way, or fully associative?
Fully associative
What type of operation might you use MPI_Scan to perform?
MPI_Scan is often used for performing a prefix operation, such as calculating running totals or cumulative sums across an array distributed among multiple processors.
In CUDA programs, what is a thread block?
In CUDA, a thread block (also called a block) is a group of concurrent threads that can cooperate among themselves through shared memory and synchronization.
In CUDA programs what does the shared keyword do?
In CUDA, the __shared__ keyword is used to declare variables that are shared among all threads within the same block. This shared memory allows for faster inter-thread communication within a block.
What is the difference between a compare split and a compare
exchange operation?
A compare-split operation is used in distributed computing, specifically in sorting algorithms. Given two sorted lists, it creates two new sorted lists such that all elements in the first list are less than or equal to all elements in the second list.
A compare-exchange operation, on the other hand, is a fundamental operation in parallel and distributed computing used in various algorithms. Given two values, it compares them and swaps them if they are out of order. It’s a basic building block for many sorting algorithms such as bitonic sort and bubble sort.
What is the serial cost of the bitonic sorting algorithm?
The serial complexity (cost) of the bitonic sort algorithm is O(n log^2 n), where n is the number of elements to be sorted.