Misc Flashcards

1
Q

what is the diameter of a p processor hypercube

A

log2p

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is the bisection width of a 3d mesh with equal number of processors along all three dimensions. express the answer as a function of the number of processors given by p

A

p^(2/3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What two performance bounds are represented on a roofline plot?

A

Computational Peak Performance
Memory Bandwidth Peak (or Peak Bandwidth)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

According to the standard, under what conditions can the MPI_send call return to the calling procedure?

A

The MPI_Send call can return after the send operation is complete, which means either the message has been sent to the system buffer and is ready for transmission, or the send has been matched with a corresponding receive operation, depending on the implementation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

For amdahl’s law, describe the function to derive the serial fraction

A

In Amdahl’s Law, the serial fraction (s) is derived from the part of the task that cannot be parallelized, divided by the total execution time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

For ahmdahl’s law, give the equation to derive the speedup and efficiency given n processors

A

speedup (S) = 1 / (s + (1-s) / n)
efficiency (E) = S / n

the lowercase s represents the serial fraction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

For ahmdal’s law, give the equation to find the upper bound on performance

A

S_max = 1 / s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Briefly state Little’s law in the context of memory system performance

A

Little’s law states that the number of elements in a queuing system is equal to the average wait time multiplied by the arrival wait. It can be used to describe bandwidth.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What cache type has the lowest latency, direct mapped, k-way, or fully associative?

A

Direct mapped

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which cache type has the highest hit ratio, direct mapped, k-way, or fully associative?

A

Fully associative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What type of operation might you use MPI_Scan to perform?

A

MPI_Scan is often used for performing a prefix operation, such as calculating running totals or cumulative sums across an array distributed among multiple processors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In CUDA programs, what is a thread block?

A

In CUDA, a thread block (also called a block) is a group of concurrent threads that can cooperate among themselves through shared memory and synchronization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In CUDA programs what does the shared keyword do?

A

In CUDA, the __shared__ keyword is used to declare variables that are shared among all threads within the same block. This shared memory allows for faster inter-thread communication within a block.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the difference between a compare split and a compare
exchange operation?

A

A compare-split operation is used in distributed computing, specifically in sorting algorithms. Given two sorted lists, it creates two new sorted lists such that all elements in the first list are less than or equal to all elements in the second list.

A compare-exchange operation, on the other hand, is a fundamental operation in parallel and distributed computing used in various algorithms. Given two values, it compares them and swaps them if they are out of order. It’s a basic building block for many sorting algorithms such as bitonic sort and bubble sort.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the serial cost of the bitonic sorting algorithm?

A

The serial complexity (cost) of the bitonic sort algorithm is O(n log^2 n), where n is the number of elements to be sorted.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Parallel quicksort: pivot selection

A

In parallel quicksort, a common approach to pivot selection is to select a random sample of elements from the array, sort this sample, and then use the median as the pivot. This is done to ensure a balanced partitioning and optimal load balancing across processors.

17
Q

Parallel quicksort: local rearrangement

A

Each processor then rearranges its local data such that all elements less than the pivot come before all elements greater than the pivot. This local rearrangement can be done in O(n/p) time, where n is the total number of elements and p is the number of processors.

18
Q

Parallel quicksort: global rearrangement

A

After local rearrangement, a global rearrangement is performed. Elements less than the pivot are moved to the lower half of the array, while elements greater than the pivot are moved to the upper half. This rearrangement could be achieved through an all-to-all communication, and takes O(n) time.

19
Q

Parallel quicksort: how may global rearrangement steps are required?

A

The number of global rearrangement steps in the parallel algorithm is log(p), where p is the number of processors. This is because with each step, the number of partitions doubles.

20
Q

Parallel quicksort: how much time does local rearrangement take?

A

The time for local rearrangement is O(n/p) since each processor is operating on its local data independently.

21
Q

Parallel quicksort: how much time does global rearrangement take?

A

The time for global rearrangement is O(n), as it involves communication between all processors.

22
Q

Parallel quicksort: What is an estimate of the running time
for parallel quicksort assuming ideal pivots are selected?

A

Assuming ideal pivots are selected (which would mean that the data is evenly distributed among the processors after each partitioning), the running time of the parallel quicksort algorithm is O(n log n / p + log^2 p). The first term accounts for local work and the second term accounts for the necessary synchronization and communication between processors.

23
Q

Parallel sample sort: what is the first step of the algorithm?

A

The first step in the parallel sample sort algorithm is data distribution. Each processor is given a roughly equal portion of the data to be sorted.

24
Q

Parallel sample sort: how are samples selected?

A

Samples are selected by having each processor pick a subset of values from its local data. These samples can either be random or regular intervals.

25
Q

Parallel sample sort: How are splitters created
(and what are splitters?)

A

The selected samples are then gathered together and sorted to create “splitters”. Splitters are selected values that divide the data into buckets. Each processor uses these splitters to partition its local data into buckets such that all values in a bucket are less than the next splitter and greater than or equal to the previous one.

26
Q

Parallel sample sort: Provide an estimate of the running time
for the parallel sample sort.

A

The running time of the parallel sample sort algorithm is O(n/p log(n/p) + n log p), where n is the total number of elements and p is the number of processors.

27
Q

Parallel sample sort: what happens after local rearrangement?

A

After local rearrangement, all values in the same bucket across different processors are gathered together. Each processor then sorts its own buckets.