Exam cards Flashcards

Study for PMPH course fall 2018

1
Q

State Moore’s Law

A

The number of transistors in a dense integrated circuit doubles about every two years. OR: Computer power doubles every 19-24 months while the cost effectiveness keeps pace.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain cost-effectiveness of hardware

A

Cost-effectiveness is the ratio between the performance and cost of hardware

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the killer-micro effect

A

Additional hardware resources generated by the rapid increase in transistor density were utilized to increase the speed/frequency of single CPU-systems resulting in complex designs of muscled single processors relying on out-of-order dataflow execution model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the two main capabilities of the out-of-order (data-flow) execution, model?

A
  1. storing in their pipeline thousands of instruction. 2. executing hundreds of instructions per cycle.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain the divergence between academic interest in parallel architectures, and industries focus on muscled single processor systems

A

Muscling single CPU systems was seen as the path of least resistance as it required no rewriting of code while converting to parallel architectures requires significant rewriting which is tedious and non-trivial.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What was the tectonic shift towards multiprocessor design?

A

In 2004 Intel cancelled the design of the Pentium 4 GHz and switches to multicore design

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the power wall?

A

Dynamic power is proportional to the cube of frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the memory wall?

A

The seemingly exponentially-increasing performance gap between processor and memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Commodity parallel architectures such as GPUs provide thousands of cores and tens of thousands of threads. What is the main barrier to utilizing this power?

A

The lack of high-level programming models and compiler optimizations that would allow commodity software to harness the power.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are a program and a process/thread?

A

A program is a set of statements for performing computational tasks while a process/thread embeds the execution of the computation. A program is to the process/thread what a
the recipe is for cooking

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a processor/core and what is a multithreaded core?

A

The hardware entity capable of sequencing and executing a process/threads instructions. Multithreaded cores support multiple hardware threads, each running in its own hardware context.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a multiprocessor?

A

A multiprocessor is a set of processors connected to execute a workload. Each multiprocessor consists of several cores - potentially with hardware multithreaded support and several levels of cache.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Name the three main improvement types in processor design

A
  1. Deeper piplines (10-20 stages) - more stages = less complex single stages = less gates per stage, i.e each stage is quicker hence larger clock rate.
  2. Aggressively exploiting instruction level parallelism (ILP) for example by out-of-order execution, speculative execution which combine techniques such as register renaming, reordering
    buffers, branch predication, lockup-free caches, speculative memory disambiguation.
  3. improvements in circuit design
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why is it unsustainable to carry on improving the clockrate?

A

• First, it is unfeasible to build deeper pipelines because it is difficult to imagine useful stages
that can be built from less than 10 gates (we have already reached that point).
• Second, the impact of technology scaling will be blunted in the future due to wire delays,
which do not scale, because the speed of wire transmission grows much slower than the
switching speed.
• Finally, and perhaps most importantly, circuits clocked at higher rates consume more power,
and we have already reached the limits of power consumption in single-chip microprocessors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which hardware optimizations can we do other than increasing clock rate?

A

Reduction of feature size and increasing the number of transistors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can we utilize an increased number of transistors?

A

We can
1. Enhance the parallelism of the memory system.
2. Fetch and decode multiple instructions per clock.
3 Run concurrently multiple hardware threads per core, for example in order to hide the high latency of the memory system
4 Support thousands of cores that run threads in parallel on different corres.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How much does DRAM density increase every third year (historically)?

A

4 x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How much does DRAM speed increase each year (historically)?

A

7%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

The memory wall has been replaced by what?

A

A bandwidth wall. This is because nowadays, the memory subsystem
needs to efficiently feed cores that execute threads in parallel, which means that the memory system
has to be capable of delivering multiple data in the same time, which is measured by bandwidth.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Name the four technological limits in architectural design

A

Power, wire delays, reliability and complexity of design

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Explain what the power constraint is and why it favours parallel processing over an increase in the clock rate.

A

The power constraint is that the power consumed by a processor is given by the sum of dynamic and static power and is not limitless. The dynamic power is consumed by gate switching. Dynamic power is roughly proportional to the cubic power of the frequency. Hence increasing the frequency requires much more power than replicating a uniprocessor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is leakage power and how is it related to dynamic power?

A

Leakage power is the same as static power and is dissipated in all circuits at all times independent of frequency and switching. It is dominated by cache leakage. the leakage power increases exponentially as the voltage at which a transistor switches off is reduced. Smaller feature sizes and lower threshold voltage means that leakage power has overtaken dynamic power as the major source of dissipation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the three classifications of hardware failures

A
Transient failures (Soft error): due to voltage reduction in every process generation bits are more and more prone to flip. The device is operational but data may be partly corrupted.
Intermittent/Temporary failures: Result of environmental variations on the chip such as hot spots. In order for device to return to correct behavior the issue must be fixed (turn off and cool down).
Permanent failures: can not be fixed.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Why do multiproccesors promote more reliablility than uniprocessor systems?

A

Threads can be used to redundantly perform the same computation and compare results. Faulty cores can be detected and disabled while the system remains functional.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

The propagation delay on a wire is proportional to what?

A

the product between its resistance and capacitance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

The resistance of a wire is proportional to what?

A

the ratio between the cross-section area and the length of the wire.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What happens to the length of wires at each process generation? Is this good or bad for resistance?

A

It shrinks. Good!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What happens to the cross-section area of wires at each process generation? Is this good or bad for resistance?

A

It shrinks. Bad!

29
Q

Why is the impact of wire delay lower for multiprocessors?

A

Communication traffic is hierarchical:

most communication is local, while inter-core communication only happens occasionally.

30
Q

Why does the cost of design verification favour mulitprocessors?

A

It is much easier to replicate the same

structure multiple times than it is to design a large and complex system

31
Q

How is feature size defined?

A

half the distance between two metal wires.

32
Q

What is the definition of a monoid?

A

Definition 1 (Monoid).
Assume a set S and a binary operator ⊙ : S × S →S.
(S, ⊙) is called a monoid if it satisfies the following two axioms:
(1) Associativity: ∀x,y, z ∈ S we have (x ⊙ y) ⊙ z ≡ x ⊙ (y ⊙ z) and
(2) Identity Element: ∃e ∈ S such that ∀a ∈ S, e ⊙ a ≡ a ⊙ e ≡ a.

33
Q

What is the definition of a group?

A

Definition 2 (Group).
(S, ⊙) is called a group if it is a monoid satisfying the additional property that any element is invertible:
∀a, ∃a−1 such that a ⊙ a−1 ≡ a−1 ⊙ a ≡ e.

34
Q

State whether (Z, +), (N, +), (Z, ×) are monoids and groups

A

(Z, +): monoid and group, (N,+): monoid but not group, (Z, ×): monoid not group

35
Q

Give the definition of a monoid homomorphism

A

A monoid homomorphism from monoid (S, ⊕) to monoid (T, ⊙) is a function h : S →T such that
∀u,v ∈ S, h(u ⊕ v) ≡ h(u) ⊙ h(v).

36
Q

Give the definition of a list homomorphic function

A

A LHF function h is a (mathematical) function that can be implemented as:
h( [ ] ) = e
h( [x] ) = f(x)
h( x ++ y) = h(x) ⊙ h(y)

37
Q

A list homomorphic function is a mathematical homomorphism from ____ to ____? What does this mean?

A

(LT , ++) to (Img(h), ⊙).
In particular (Img(h), ⊙) must be a monoid with neutral element e, which also means that ⊙ must be
associative.
The reverse also (trivially) holds: if h is a homomorphism between (LT , ++) and some monoid (M, ⊙)
then it accepts a LHI (as above).

38
Q

Give examples of 3 list homomorphic programs

A
  1. computing the length of a list, i.e., len : [α] → Int.
    len [ ] = 0
    len [x] = one x – where one x = 1
    len (x ++ y) = (len x) + (len y)
  2. allp that checks whether some predicate p
    : α → Bool satisfies (holds on) all elements of an input list:
    allp [ ] = true
    allp [x] = p x
    allp (x ++ y) = (allp x) && (allp y)
  3. summing up the (numerical) elements of a list. This can be accomplished with the following LHI:
    sum [ ] = 0
    sum [x] = id x – where id x = x
    sum (x ++ y) = (sum x) + (sum y)
  4. The function that adds two to every (numeric) element of a list and then multiplies the results.
    fld [ ] = 1
    fld [x] = plus2 x – where plus2 x = x + 2
    fld (x ++ y) = (fld x) * (fld y)
39
Q

What is the first basic block of data parallel programming?

A

the map second-order operator, which takes
as argument an unary function and a list of elements, and produces a list of the same length as the
original one by applying the function argument to each element of the input list. The type and
semantics of map are presented below:
map : (α → β) → [α] → [β]
map f [x1, . . .xn ] = [f x1,. . ., f xn ]

40
Q

Explain why map has implicitly parallel semantics.

A

The computation corresponding to some element x_i of the array is completely independent of all other elements of the array.

41
Q

What is the second basic block of data parallel programming?

A

The reduce second-order operator,
which takes as arguments an associative binary operator, the neutral element corresponding to the
monoid defined by the binary operator, and a list of elements. The result of reduce is obtained
by successively applying the operator to all the elements of the array. The type and semantics of
reduce are presented below:
reduce : (α →α →α) → α → [α] →α
reduce ⊙ e [x1, . . .xn ] = e ⊙ x1 ⊙ . . . ⊙ xn

42
Q

What is important to rememeber about the operator in reduce so that it has parallel semantics?

A

It must be associative.

43
Q

What is the first list homomorphism theorem?

A

A list-homomorphic implementation defined as:
h( [ ] ) = e
h( [x] ) = f(x)
h( x ++ y) = h(x) ⊙ h(y)
is semantically equivalent with (reduce ⊙ e) ◦ (map f), or in complete code:
h z ≡ reduce ⊙ e ( map f z)

44
Q

Write the map-reduce implementations of the list homomorphic programs for finding the length of a list: len, whether all items of a list satifies some propertyr: allp, the sum of elements in a list: sum.

A
  • len z ≡ reduce (+) 0 (map one z)
  • allp z ≡ reduce (&&) true (map p z)
  • sum z ≡ reduce (+) 0 (map id z) ≡ reduce (+) 0 z
45
Q

State the list homomorphism lemmas

A
  1. ( map f) ◦ ( map g) ≡ map (f ◦ g)
  2. ( map f) ◦ ( reduce (++) []) ≡ ( reduce (++) []) ◦ ( map ( map f))
  3. ( reduce (⊙) e⊙) ◦ ( reduce (++) []) ≡
    ( reduce (⊙) e⊙) ◦ (map ( reduce (⊙) e⊙ ))
46
Q

The first identity in the list homomorphism lemma is known as ___ and is useful for what?

A

It is known as the map fusion/fission rule. Fusion corresponds to applying the transformation in the forward direction and is useful for reducing the number of accesses to global memory which is slow compared to registers. Fission corresponds to applying the transformation in the backwards direction and is useful for enhancing the degree of parallelism that is statically mapped to the hardware in the context of a nested parallel program.

47
Q

How can the second and third identities of the list homomorphism lemmas be used?

A

In the forward direction they can be used to efficiently sequentialize the parallelism in excess of what the hardware can support and in the backward direction to enhance load balancing and the programs degree of parallelism that can be statically mapped to the hardware.

48
Q

What is an often used optimization for the scheduling of map-reduce computations?

A

Map-reduce composition can be
re-written into a semantically equivalent program that:
• splits the input list into p sublists of roughly equal length,
• applies the original computation to each sublist, such the computation of a sublist is performed
sequentially on a core, but different sublists are processed in parallel on different cores,
• applies the original reduction to the per-core results.
The benefit of such an execution is not only given by spawning a number of threads equal to the
numbers of cores, thus reducing scheduling and switching-contexts overheads, but also optimizing
the reduction depth. Originally, the reduction was applied to a list of n elements, and would require
log(n) sequential steps (see reduction tree in fig. 4). In the transformed program the final (parallel)
reduction is performed on a list of p elements, requiring only log(p) sequential steps.

49
Q

What is the optimized map-reduce lemma?

A
Theorem 4 (Optimized Map-Reduce Lemma). Assume splitp :: [α] → [[α]] distributes a list
into p sublists, each containing about the same number of elements. Also assume ⊙ a binary associative
operator with neutral element e⊙ and f a unary function. The following identity always holds:
redomap (⊙) f e⊙ ≡
( reduce (⊙) e⊙) ◦ (map ( redomap (⊙) f e⊙ )) ◦ splitp
where redomap is defined as redomap ⊙ f e⊙ ≡ (reduce ⊙ e⊙) ◦ (map f).
50
Q

Give an example of a near homomorphic program and the transformation into a homomophic program.

A

The maximum segment sum problem is an example. It can be transformed by computing extra baggage of information - (maximum segment sum, maximum intial sum, maximum concluding sum, total sum) and then implementing i three main steps: each input element is lifted to before mentioned quadruple of info. The result is reduced using an associative operator for computing mentioned info. Then we project, i.e select the element in the quadruple that is the result (the mss in the quadruple).
Longest satisfying segment is another similar example.

51
Q

Write the three formulas that provide the basis for Amdahls law.

A

Let E be the enhancement providing speedup, F the fraction of the program that benefits and 1-F the fraction that does not. Then
Speedup(E) = T(without E)/T(with E) = 1/((1-F)+F/S)
Speedup(E) <= 1/(1-F)
lim S -> inf Speedup(E) = 1/(1-F)

52
Q

Provide an interpretation of Amdahls law by diminishing returns

A

In essence, Amdahl’s law shows that no matter how big the improvement is, the overall application
speedup is limited by the 1 − F fraction that does not benefit from the improvement.

53
Q

Write the formula fr Amdahls law in the particular case of parallelism

A

Let T1 be the time execution take on one processor and TP be the time on P processors. F is the fraction run in parallel. Speedup(P) = T1/TP = P/(F+P(1−F )) < 1/(1−F). Ideally P cores results in a Px speedup.

54
Q

What is the moral of Amdahls law for parallel parallelizing code?

A

Never leave sequential any part of the program that can possibly be
parallelized. In other words, when developing parallel code, we must reason as if the hardware has an
unlimited/infinity number of cores.

55
Q

What does PRAM stand for?

A

Parallel random access machine.

56
Q

What are the 5 assumptions on a PRAM?

A
  1. There are P processors connected to shared memory
  2. Each processsor han an unique identifier/index 0 <=i < P
  3. The execution happens in single instruction multiple data fashion, i.e all cores execute in lock step
  4. each parallel instruction takes unit time
  5. each processor has a flag that controls whether it is active in the execution of an instruction
57
Q

What does SIMD stand for and what does it mean?

A

ingle instruction multiple data which means all cores execute in lock step i.e. a core can not start the next instruction until all cores have finished the current instruction.

58
Q

What effect does executing in lock step have when branching.

A

In the case of an if-then-else,
the processors that did not take the then branch must wait until all the other processors has
finished executing the then branch, before starting to execute the else branch, and similar
for the else branch.

59
Q

What is the work complexity?

A

The total number of operations performed to execute a program.

60
Q

What is the depth complexity?

A

The number of sequential steps needed to execute a program.

61
Q

When is a parallel implementation work efficient?

A

If the work complexity is asymptotically equal to that of the best sequential implementation of the same program.

62
Q

State Brentt’s theorem

A

A parallel implementation that has depth D(n) and work W(n) can be simulated on a P-processor
PRAM in time complexity T such that:
W(n)/P ≤ T ≤ W(n)/P + D(n)

63
Q

What does SOAC stand for

A

Second order array operator

64
Q

Explain the implementation of scan

A

The implementation is
organized in two parallel steps:
(1) The first step is called “Up-Sweep” and is similar with a reduction, except that the accumulation
of all elements is computed in the last element of the array, rather than the first.
• After the up-sweep pass, value 0 is placed in the position of the last element.

(2) The second step is called “Down-Sweep”, and it propagates updates to the array’s elements
in the reverse order of the up-sweep pass (i.e., reverse the arrows and the traversal of the up
sweep). Each propagation requires two substeps:
2.1. the left child sends its value to its parent and updates its value to that of the parent.
2.2. the right-child value is obtained by applying the binary operator of the scan to the leftchild
value and to the (old) value of parent. Please notice that the right child is in fact the
parent—an in-place algorithm.

65
Q

What’ss the ddepth and work of scan and reduce?

A

Lg(n) and n

66
Q

What is a Warp?

A

The unit of parallel execution in CUDA which consists of 32 parallel threads.

67
Q

How do the threads in a warp execute?

A

In SIMD fashion

68
Q

Make a data-parallel represetation of the array [[1 , 3, 5], [7, 8], [9, 11, 14, 15]]

A
shapeArray = [ 3, 2, 4 ]
flagArray = [ 1, 0, 0, 1, 0, 1, 0, 0, 0 ]
valueArray = [ 1, 3, 5, 7, 8, 9, 11, 14, 15 ]
69
Q

What is the work and depth of Segmented Scan

A

Lg n and n -> since essentially a segmented scan can be implemented in terms of a scan and the work and depth of scan is lg n and n.