Intro To The Work-Span Model Flashcards
Dynamic Multithreading Model
Two parts:
- A computation can be represented by a directed acyclic graph, in which each node is an operation and each edge a dependence.
- Pseudocode notation, or programming model, for writing down the algorithm. Designed so that “when you execute one of these algorithms, it generates a DAG, at least conceptually.”
From the point of view of exploiting parallelism, what is the key property of a “good” DAG?
It has few dependencies.
What separation does the pseudo code notation do?
It separates how to produce work (algorithm) from how to schedule and execute it (hardware).
What is a scheduling problem?
The problem of how to take free units of work and assign them to processors.
Given a complete binary tree with n leaves, how many levels does it have?
log(n)
What is work?
Total number of nodes/verticies in a multithreaded DAG model
What is span?
The length of the longest path through the multithreaded DAG model (from start to end). Also called “depth”.
Notated D(n)
What is another name for the longest path through a DAG?
Critical path
How much time does it take to execute all operations in DAG with one processor, assuming all ops have unit cost?
W(n), i.e., total work
How much time does it take to execute all operations in DAG with infinite processors, assuming all ops have unit cost?
D(n)
Average available parallelism
W(n)/D(n)
Average work per critical path vertex. This is the amount of parallelism we can take advantage of at each critical path vertex, and is a good starting point for choosing an appropriate number of processors.
Span Law
T_p(n) >= D(n)
i.e., Span is the lower bound on the execution time given p processors.
Work Law
T_p(n) >= ceiling(W(n)/p)
Another lower bound on the execution time given p processors. This time it’s total work divided by number of processors.
Work-Span Law
Both the Work Law and Span Law must be true simultaneously, so we can combine them.
T_p(n) >= max{D(n), ceiling(W(n)/p)}
This gives a lower bound on execution time.
How do you derive Brent’s Theorem (which provides an upper bound on execution time of a DAG)?
1) Break up the DAG into k phases, where each phase meets three conditions:
- Each phase has STRICTLY one critical path vertex.
- All vertices within a phase that are NOT on the critical path must be independent, i.e., have no dependencies between them.
- Each vertex must belong to some phase.
2) Each phase k has a number of vertices in it, W_k (including critical path vertex)
3) By condition three of our phase conditions, the sum of W_k for all k = W, total work for the DAG.
4) How long does it take to execute phase k? t_k = ceiling(W_k/p)
5) 4 implies that the total time T_p = sum from k=1 to D of ceiling(W_k/p)
6) We can algebraically rearrange the ceiling of the quotient in 5 to turn it into a floor: total time T_p = sum from k=1 to D of floor( (W_k-1) / p) + 1
7) floor(x) is always <= x, so T_p <= sum from k=1 to D of (W_k-1)/p + 1
8) 7 implies that T_p <= (W - D)/p + D! Brent’s Theorem!
What is Brent’s Theorem?
T_p <= (W - D)/p + D
T_p := total execution time of DAG given p processors
W := work of DAG, i.e., number of vertices or operations
D := span of DAG, i.e., depth
p : number of processors
What is the intuitive interpretation of Brent’s Theorem?
Total time to execute the DAG is less than or equal to the time needed to execute the critical path (D) plus the time needed to execute all other nodes using p processors.
Note that the time to execute the critical path does not depend on p because you can’t execute nodes of the critical path in parallel.
What’s the equation for speedup of parallel time over best sequential time?
S_p(n) := T_*(n)/T_p(n)
What does T_p(n) depend on?
T_p(n) = f(W, D; n, P)
work, span, problem size, and number of processors
What does T_*(n) depend on?
T_(n) = W_(n)
Work done by best sequential algorithm
What is the ideal speedup of a parallel algorithm over the best sequential algorithm?
Ideal speedup is linear in P, also called linear speed up, linear scaling, or ideal scaling.
S_p(n) = Theta(P)
How does big-Theta notation work?
big-Theta depends on both big-O and big-Omega.
big-O provides an upper bound on the performance of an algorithm. f(n) is O(g(n)) if there exist some constants C_1 and n_0 s.t. 0 <= f(x) <= C_1*g(n) for all n >= n_0.
big-Omega provides a lower bound on the performance of an algorithm. f(n) is Omega(g(n)) if there exist some constants C_2 and n_0 s.t. 0 <= C_2*g(n) <= f(n) for all n >= n_0.
big-Theta provides the tightest bound on execution time. f(n) is Theta(g(n)) if there exist some constants C_1, C_2, and n_0 such that 0 <= C_2g(n) <= f(n) <= C_1g(n) for all n >= n_0.
How do you use Brent’s Theorem to get a lower bound on speedup (and then make it easier to analyze)? Using the resulting equation, what has to be true to get ideal scaling?
S_p(n) = T_(n)/T_p(n) = W_(n)/T_p(n) because we know execution time using the best sequential algorithm is basically just a function of work done using that algorithm.
Brent’s Theorem gives us an upper bound on execution time. By plugging it in to this equation we get a lower bound on speedup: S_p(n) = W_(n)/T_p(n) >= W_(n)/[(W-D)/P + D] = (via a little algebra) P/[(W/W_) + (P-1)/(W_/D)]
To get ideal scaling, the denominator must be constant (since P is the numerator and ideal scaling means linear scaling in P). To do this, W(n) must equal W_(n) (so W/W_ = a constant fraction). This is called “work-optimality”.
We also need p to be proportional to W_/D, i.e., P = O(W_/D). Could be rewritten as W*/P = Omega(D), which says the work per processor must grow with the span, which in turn depends on n. This is called “weak scalability”.
What is work-optimality?
W(n) = O(W*(n)), i.e., the work of our parallel algorithm must be proportional to the work of our best sequential algorithm. We use this to get ideal scaling.