Theory Flashcards

Question

What is the difference between scaled speedup and speedup?

Answer 1

Scaled: x times the work can be done in the same amount of time Speedup: The same amount of work can be done in 1/x time

Answer 2

The graph of S(p) given p

Answer 3

The scaling curves of Amdahl's law. Best scaling is linear scaling: S(p) = p Towards infinity, they approach infinity when the fraction (f) of sequential operations in the program increases Studies of how speedup changes with processor count, are said to be carried out in the strong mode

Answer 4

The scaling curves of Gustafson's- law. They are all linear, but their gradiant is not quite 1 Ss(p) = ax + b ; OBS: Scaled speedup a: the gradient, not quite 1 Studies of how scaled speedup changes with processor count are carried out in weak mode

Answer 5

E(p) = S(p) / p Amount of speedup divided equally amongst the processors Efficiency gives a sense of how much one processor contributed to the speedup

Answer 6

Upgrade system with more of the same components as available before Improves the parallel part of execution

Answer 7

Upgrade system with new, more powerful components Improves sequential part of execution

Answer 8

That the parallel workload can grow without needing to increase the sequential

Answer 9

When we know the message size (n), we can estimate transmission time as sum of latency and n*inverse bandwidth Latency (a): Time it takes to get between communication points, decided by the distance between these not message size Inverse bandwidth (B⁻1): seconds/byte T_comm(n) = a + n*B⁻1

Answer 10

Start clock Send message from A to B Send message from B to A Repeate this a lot of times Stop clock Divide the time by 2 (for both directions) and n (number of messages)

Answer 11

Do ping-pong test with empty- or 1-byte messages The time taken will in this case be dominated by latency

Answer 12

Ping-pong test with a smaller number of huge messages In this case, bandwidth requirements will dominate time taken

Answer 13

Adding extra lanes to interconnect fabric

Answer 14

Very difficult, essentially dependent on the speed of light. Can however be masked, e.g. using MPI_Isend

Answer 15

Earlier, 1 processor core per chip, clock rates for cpu and memory were comparable, no cache The structure was memory banks in a grid in the middle, e.g. dimensions 3x3, and 3 CPUs on each side of memory bank. All memory shared by every CPU Any CPU read/write to any memory bank at the same speed (Uniform Memory Acces, UMA) Because of this, any CPU can contact any other at the same cost This design introduced race conditions

Answer 16

Interconnect fabric + cpu supported atomic operations test-and-set: check if value is 0, set to 1 if isn't, return (great for spin-locking) fetch-and-increment: increase value in mem, return what is was before (great for obtaining ticket numbers in a queue) fetch-and-add: fetch and incr with arbitrary value Compare-and-swap: check if value is equal to supplied value, if it is - exchange it for a number from the CPU, return whether it succeeded

Answer 17

The atomic operations that where implemented in SMPs to handle race conditions

Answer 18

Access to closer memory banks grew faster than access to remote banks NUMA - non-uniform memory access

Answer 19

Caches are introduced to reduce memory latency gap between close and further away memory. Works for 1 CPU, make it worse for rest

Answer 20

Shared memory processor. No longer symmetric memory access

Answer 21

Attempt on atomic solutions LL fetches value into a register, and temporarily tags the memory bank it came from While value is in register, the CPUs entire instruction set can manipulate it SC tries to write result back to tagged memory bank. Returns whether it succeeded. If fails, value not stored, because someone altered in meantime. Program gets to know about failure

Answer 22

Seperate memory banks are wired directly to processor, bypasses all caching mechanisms. Slow, but all read/write are atomic

Answer 23

Instructions that include a whole read-mod-write cycle in one instruction can be made atomic by placing the keyword 'lock' in front of the instruction in the assembly code Gives exclusive access to either cache line with the value, if no other core has this value, or the entire memory bus.

Answer 24

Problem of keeping multiple CPU caches updated with each others modified values Solution: Snooping, or directory

Answer 25

Solution to cache coherence. Allowing CPUs to listen in on each others memory traffic via shared branches of the interconnect Can be combined with write-through and write-back

Answer 26

Solution to cache coherence Maintaining a centralized registry of cache lines and the various states they're in

Answer 27

Done on machines with a single, shared memory bus as interconnect. One CPU write through the change on the interconnect, when this happen the other CPU listens and copies the change to its own cache. The change is written back to memory

Answer 28

CPU alerts the mem.bus that the cache line is dirty. The other CPU does not use this cache line until it is clean. Only 1 bit is broadcasted to the other CPUs (dirty bit) When changes as written back by the first CPU, the memory is updated and the second CPU fetches the fresh copy

Answer 29

The cost of broadcasting changes Only fast for numbers efficient for broadcasts

Answer 30

Have a table that records what CPU have copies of what memory Needs quite a bit of memory as the table can grow quite large Entry: - Entry for each memory block we want to track - Bits to record state (exclusive-or-shared, modified, uncached) - bit vector, one bit for each CPU

Answer 31

Memory: 1, 2, 3, 4, 5, 6, 7, 8 CPU 0: Cache 3, 4 CPU 1: Cache 3, 4 CPU 2: Cache 5, 6 CPU 3: no cache Directory (Block - cpus - state): [1, 2] - 0000 - uncached [3, 4] - 1100 - shared [5, 6] - 0010 - exclusive [7, 8] - 0000 - uncached

Answer 32

Bit vector indicates if one-or-more cpus has a modified copy The CPUs that share this memory sorts out coherence between themselves

Answer 33

Some systems allocate fixed part of general system RAM. This is fine for smaller systems, but bigger systems will notice that some of their RAM cannot be used for other purposes. Some systems install additional memory banks for the directory.

Answer 34

A way to improve cache performance.

Answer 35

Vector registers Tiling

Answer 36

Constructs that behave like function calls, but compiler recognize as shorthand notation for assembly instructions. Useful when programming vector registers by hand, in situations where the compiler does not recognize a structure as a vector structure. __mm128d var: Stands for 128-bit SIMD register, can hold 2 doubles This is blob of bytes which can be put into a vector register in an instant and fit there.

Answer 37

Alignment arg. gives a number that evenly divides the starting address of the allocation. Vector registers load faster when the addresses they load are clean multiples of the register size for 128 bit register value, alignment can be 16

Answer 38

__m128d my_two_doubles = _mm_load_pd(&two_doubles) address of two_doubles must be 16 byte aligned Load two copies of one double: _m128d two_copies = _mm_load_pd1(&a)

Answer 39

_mm_store_pd(two_doubles, &two_doubles)

Answer 40

__m128d sum = _mm_add_pd(ab, cd) gives a+b and c+d _m128d ac_and_bd = _mm_mul_pd(ab, cd)

Answer 41

Problem model: What we want to calculate, simplified representation of a real thing Programming model: Operations used to express calculation. Simplified representation of machine instructions Processing model: Expectations about what the machine will do, simplified representation of hardware Actual computer: Actual hardware

Answer 42

Both are performance models Gives realistic estimates of whether or not program performance will improve if we invest in more hardware to run on it

Answer 43

Whether program performance is constrained by the size of its messages of the number of messages

Answer 44

Millions of Instruction Per Second Floating-point operations per second

Answer 45

Asit is so specific, it does not give a good overview/statistics for actual performance

Answer 46

Maximum amount of bytes the interconnect can transport between CPU and memory each second bytes/second

Answer 47

Intructions uses operands. Divide number of operations by the number of bytes they are applied to. operations / byte [FLOPS / byte]

Answer 48

Operational intensity * memory bandwidth If data transport is not fast enough to supply the processor with data for all the instructions in the program, the program has to wait. byte/s * Flop/byte = Flop / s

Answer 49

x-axis: Arithmetic intensity (FLOP/byte) y-axis: Flop / s The peak performance: straight line at a given y-value. this is the roofline of the graph. The graph will be a linear line up until this point, with the gradient a: a = bytes/s * FLOP/byte

Answer 50

Programs with an intensity that makes them running at the speed capped by memory bandwidth When Flop/b is lower than the peak performance

Answer 51

When the FLOP/byte intensity goes from memory-bounding a program to compute bounding it

Answer 52

When program intensity is so that the program run at the speed capped by the processor

Answer 53

Theoretical numbers from hardware Run benchmarks that stress computing capability or memory

Answer 54

Long arrays of consecutive values tightly packed together in memory Operations of complicated calculations that require many instructions per element

Answer 55

Similar to dense algebra, but many/most of elements are zero this makes it meaningless to read them Data structures we use are rather lists of indices with non-zero values, and the values themselves values[], row[], col[]

Answer 56

Similar to grids, but without assumption that neighbours are evenly spaced

Answer 57

List of coordinates for things that push each other around (starts, biljard balls) bottleneck: finding neighbours Does everybody affect each other, does only neighbours affect each other

Answer 58

Calculations that approach their solution by accumulating random numbers additional numbers should contribute to the solution nomatter what their values are performance challenges: - pseudo random numbers often have sequential dependences between numbers

Answer 59

Dense linear algebra Sparse linear algebra Structured grids Unstructured grids N-body problems Monte carlo methods Spectral methods

Answer 60

Replicate ALU, extend decoder to dispatch several independent instructions simultaneously

Answer 61

Useful to utilize the different parts of the ALU with instructions that require different calculation logic. Instead of leaving the ALU parts idle, run multiple at the same time. Replicate instruction pointer and decoding unit Receive 2 instruction streams. merge these together when their needs doesn't conflict This allows a 4 core processor to be able to run 8 threads, for example

Answer 62

It doesn't happen often that two instruction streams only depend on different ALU units. Threads often need the same units, making the second one wait. SMTs improve performance in the rare cases where the threads does not need the same units. Statistically, this only happens every so often.

Answer 63

Times when we can measure S(p) > p, which goes against Amdahl's law. This can happen if a problem is split up into smaller parts, that suddenly fits in a faster part of memory (e.g. a higher level cache)

Answer 64

Ways of mitigate unbalanced workload. Parallel programs often have synchronization points. If the work are unevenly distributed between processes, one process will lag behind. And the execution time is held back by the slowest process.

Answer 65

Static: Embed the partitioning of the program directly into the source code Semi-static: Examine workload on program start, divide it then, and run with this inital partitioning Dynamic: Shift work around between participants while program is running

Answer 66

A dynamic workload balancing technique. One rank is the master and maintains a queue of similar-sized tasks. Rest of ranks are workers. Assigned tasks by master, or take tasks from queue and informs master limited scalability due to centralized control. If the amount of workers is to big, they can overwhelm the master.

Theory Flashcards

(91 cards)