Misc Flashcards
what is the diameter of a p processor hypercube
log2p
what is the bisection width of a 3d mesh with equal number of processors along all three dimensions. express the answer as a function of the number of processors given by p
p^(2/3)
What two performance bounds are represented on a roofline plot?
Computational Peak Performance
Memory Bandwidth Peak (or Peak Bandwidth)
According to the standard, under what conditions can the MPI_send call return to the calling procedure?
The MPI_Send call can return after the send operation is complete, which means either the message has been sent to the system buffer and is ready for transmission, or the send has been matched with a corresponding receive operation, depending on the implementation.
For amdahl’s law, describe the function to derive the serial fraction
In Amdahl’s Law, the serial fraction (s) is derived from the part of the task that cannot be parallelized, divided by the total execution time.
For ahmdahl’s law, give the equation to derive the speedup and efficiency given n processors
speedup (S) = 1 / (s + (1-s) / n)
efficiency (E) = S / n
the lowercase s represents the serial fraction
For ahmdal’s law, give the equation to find the upper bound on performance
S_max = 1 / s
Briefly state Little’s law in the context of memory system performance
Little’s law states that the number of elements in a queuing system is equal to the average wait time multiplied by the arrival wait. It can be used to describe bandwidth.
What cache type has the lowest latency, direct mapped, k-way, or fully associative?
Direct mapped
Which cache type has the highest hit ratio, direct mapped, k-way, or fully associative?
Fully associative
What type of operation might you use MPI_Scan to perform?
MPI_Scan is often used for performing a prefix operation, such as calculating running totals or cumulative sums across an array distributed among multiple processors.
In CUDA programs, what is a thread block?
In CUDA, a thread block (also called a block) is a group of concurrent threads that can cooperate among themselves through shared memory and synchronization.
In CUDA programs what does the shared keyword do?
In CUDA, the __shared__ keyword is used to declare variables that are shared among all threads within the same block. This shared memory allows for faster inter-thread communication within a block.
What is the difference between a compare split and a compare
exchange operation?
A compare-split operation is used in distributed computing, specifically in sorting algorithms. Given two sorted lists, it creates two new sorted lists such that all elements in the first list are less than or equal to all elements in the second list.
A compare-exchange operation, on the other hand, is a fundamental operation in parallel and distributed computing used in various algorithms. Given two values, it compares them and swaps them if they are out of order. It’s a basic building block for many sorting algorithms such as bitonic sort and bubble sort.
What is the serial cost of the bitonic sorting algorithm?
The serial complexity (cost) of the bitonic sort algorithm is O(n log^2 n), where n is the number of elements to be sorted.
Parallel quicksort: pivot selection
In parallel quicksort, a common approach to pivot selection is to select a random sample of elements from the array, sort this sample, and then use the median as the pivot. This is done to ensure a balanced partitioning and optimal load balancing across processors.
Parallel quicksort: local rearrangement
Each processor then rearranges its local data such that all elements less than the pivot come before all elements greater than the pivot. This local rearrangement can be done in O(n/p) time, where n is the total number of elements and p is the number of processors.
Parallel quicksort: global rearrangement
After local rearrangement, a global rearrangement is performed. Elements less than the pivot are moved to the lower half of the array, while elements greater than the pivot are moved to the upper half. This rearrangement could be achieved through an all-to-all communication, and takes O(n) time.
Parallel quicksort: how may global rearrangement steps are required?
The number of global rearrangement steps in the parallel algorithm is log(p), where p is the number of processors. This is because with each step, the number of partitions doubles.
Parallel quicksort: how much time does local rearrangement take?
The time for local rearrangement is O(n/p) since each processor is operating on its local data independently.
Parallel quicksort: how much time does global rearrangement take?
The time for global rearrangement is O(n), as it involves communication between all processors.
Parallel quicksort: What is an estimate of the running time
for parallel quicksort assuming ideal pivots are selected?
Assuming ideal pivots are selected (which would mean that the data is evenly distributed among the processors after each partitioning), the running time of the parallel quicksort algorithm is O(n log n / p + log^2 p). The first term accounts for local work and the second term accounts for the necessary synchronization and communication between processors.
Parallel sample sort: what is the first step of the algorithm?
The first step in the parallel sample sort algorithm is data distribution. Each processor is given a roughly equal portion of the data to be sorted.
Parallel sample sort: how are samples selected?
Samples are selected by having each processor pick a subset of values from its local data. These samples can either be random or regular intervals.