Misc Flashcards by Michael Allen

what is the diameter of a p processor hypercube

log2p

How well did you know this?

Not at all

Perfectly

what is the bisection width of a 3d mesh with equal number of processors along all three dimensions. express the answer as a function of the number of processors given by p

p^(2/3)

How well did you know this?

Not at all

Perfectly

What two performance bounds are represented on a roofline plot?

Computational Peak Performance
Memory Bandwidth Peak (or Peak Bandwidth)

How well did you know this?

Not at all

Perfectly

According to the standard, under what conditions can the MPI_send call return to the calling procedure?

The MPI_Send call can return after the send operation is complete, which means either the message has been sent to the system buffer and is ready for transmission, or the send has been matched with a corresponding receive operation, depending on the implementation.

How well did you know this?

Not at all

Perfectly

For amdahl’s law, describe the function to derive the serial fraction

In Amdahl’s Law, the serial fraction (s) is derived from the part of the task that cannot be parallelized, divided by the total execution time.

How well did you know this?

Not at all

Perfectly

For ahmdahl’s law, give the equation to derive the speedup and efficiency given n processors

speedup (S) = 1 / (s + (1-s) / n)
efficiency (E) = S / n

the lowercase s represents the serial fraction

How well did you know this?

Not at all

Perfectly

For ahmdal’s law, give the equation to find the upper bound on performance

S_max = 1 / s

How well did you know this?

Not at all

Perfectly

Briefly state Little’s law in the context of memory system performance

Little’s law states that the number of elements in a queuing system is equal to the average wait time multiplied by the arrival wait. It can be used to describe bandwidth.

How well did you know this?

Not at all

Perfectly

What cache type has the lowest latency, direct mapped, k-way, or fully associative?

Direct mapped

How well did you know this?

Not at all

Perfectly

Which cache type has the highest hit ratio, direct mapped, k-way, or fully associative?

Fully associative

How well did you know this?

Not at all

Perfectly

What type of operation might you use MPI_Scan to perform?

MPI_Scan is often used for performing a prefix operation, such as calculating running totals or cumulative sums across an array distributed among multiple processors.

How well did you know this?

Not at all

Perfectly

In CUDA programs, what is a thread block?

In CUDA, a thread block (also called a block) is a group of concurrent threads that can cooperate among themselves through shared memory and synchronization.

How well did you know this?

Not at all

Perfectly

In CUDA programs what does the shared keyword do?

In CUDA, the __shared__ keyword is used to declare variables that are shared among all threads within the same block. This shared memory allows for faster inter-thread communication within a block.

How well did you know this?

Not at all

Perfectly

What is the difference between a compare split and a compare
exchange operation?

A compare-split operation is used in distributed computing, specifically in sorting algorithms. Given two sorted lists, it creates two new sorted lists such that all elements in the first list are less than or equal to all elements in the second list.

A compare-exchange operation, on the other hand, is a fundamental operation in parallel and distributed computing used in various algorithms. Given two values, it compares them and swaps them if they are out of order. It’s a basic building block for many sorting algorithms such as bitonic sort and bubble sort.

How well did you know this?

Not at all

Perfectly

What is the serial cost of the bitonic sorting algorithm?

The serial complexity (cost) of the bitonic sort algorithm is O(n log^2 n), where n is the number of elements to be sorted.

How well did you know this?

Not at all

Perfectly

Parallel quicksort: pivot selection

Study These Flashcards

In parallel quicksort, a common approach to pivot selection is to select a random sample of elements from the array, sort this sample, and then use the median as the pivot. This is done to ensure a balanced partitioning and optimal load balancing across processors.

Parallel quicksort: local rearrangement

Study These Flashcards

Each processor then rearranges its local data such that all elements less than the pivot come before all elements greater than the pivot. This local rearrangement can be done in O(n/p) time, where n is the total number of elements and p is the number of processors.

Parallel quicksort: global rearrangement

Study These Flashcards

After local rearrangement, a global rearrangement is performed. Elements less than the pivot are moved to the lower half of the array, while elements greater than the pivot are moved to the upper half. This rearrangement could be achieved through an all-to-all communication, and takes O(n) time.

Parallel quicksort: how may global rearrangement steps are required?

Study These Flashcards

The number of global rearrangement steps in the parallel algorithm is log(p), where p is the number of processors. This is because with each step, the number of partitions doubles.

Parallel quicksort: how much time does local rearrangement take?

Study These Flashcards

The time for local rearrangement is O(n/p) since each processor is operating on its local data independently.

Parallel quicksort: how much time does global rearrangement take?

Study These Flashcards

The time for global rearrangement is O(n), as it involves communication between all processors.

Parallel quicksort: What is an estimate of the running time
for parallel quicksort assuming ideal pivots are selected?

Study These Flashcards

Assuming ideal pivots are selected (which would mean that the data is evenly distributed among the processors after each partitioning), the running time of the parallel quicksort algorithm is O(n log n / p + log^2 p). The first term accounts for local work and the second term accounts for the necessary synchronization and communication between processors.

Parallel sample sort: what is the first step of the algorithm?

Study These Flashcards

The first step in the parallel sample sort algorithm is data distribution. Each processor is given a roughly equal portion of the data to be sorted.

Parallel sample sort: how are samples selected?

Study These Flashcards

Samples are selected by having each processor pick a subset of values from its local data. These samples can either be random or regular intervals.

Parallel sample sort: How are splitters created (and what are splitters?)

The selected samples are then gathered together and sorted to create "splitters". Splitters are selected values that divide the data into buckets. Each processor uses these splitters to partition its local data into buckets such that all values in a bucket are less than the next splitter and greater than or equal to the previous one.

Parallel sample sort: Provide an estimate of the running time for the parallel sample sort.

The running time of the parallel sample sort algorithm is O(n/p log(n/p) + n log p), where n is the total number of elements and p is the number of processors.

Parallel sample sort: what happens after local rearrangement?

After local rearrangement, all values in the same bucket across different processors are gathered together. Each processor then sorts its own buckets.

Misc Flashcards

(27 cards)