Theory Flashcards

1
Q

What is easier to improve, bandwidth or latency?

A

Bandwitch

Because of this, when accessing memory, it is better to fetch as much memory as we can to make up for the memory latency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In caches, what are the size restirctions?

A

Cache block length: power of 2

Ways of associativity: Number of sets is a power of 2

n_blocks in a set: any integer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are Bernstein’s conditions for out-of-order execution?

A

Statement S1 and S2 will produce the same result when run in any order if:

  • S1 does not read what S2 writes
  • S1 does not write what S2 reads
  • S1 and S2 does not write in the same place
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a superscalar processor?

A

Launch multiple instructions at one (done in control path)

Execute multiple instructions at once by replicate the ALU, so that there are multiple

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are vector registers?

A

Extra-wide registers that can store several consecutive alues at once, e.g.:

Vector load: a[0], a[1], a[2], a[3]

Vector registers allows for vector operations, where operations are done on all vector elements in parallel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Flynn’s taxonomy?

A

Ways to sort parallel programs

SISD: Von neumann

SIMD: Vector machine

MISD: Theoretical

MIMD: Processes and threads. Fully independent processors, but program make them work towards the same goal, not synch,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the ways to sort parallel programs?

A

Flynn’s taxonomy and shared-distributed memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is shared- and distributed memory?

A

Shared:
Partitions the memory image of a process

Distributed:
Works with several processes, all with their own memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What components are allocated to threads?

A

Stack,
Instruction pointer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is invisible communication between threads?

A

When one thread access memory outside their own stack, the memory will be instantly available to the rest of the threads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Pros and cons of shared memory?

A

Pro:
- only 1 copy of shared data
- everyone can modify this (must be coordinated)

Con:
- Cannot be on separate computers (thread lives within a process’ address space - so their memory must be managed by one single OS)
- Only get as many threads as fit into one machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Pros and cons of distributed memory?

A

Pros:
- Their memory cannot be involuntarily corrupted
- only receive what they ask for, and store this in a designated place
- Can run on same or different computers - if we run out of processors, can just add a machine

Cons:
- copies of data, because of message passing, uses twice the memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is multi-computer parallelism?

A

When processes in a distributed memory program run on different computers.

Communicate over network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some types of interconnects used in shared memory systems?

A

Crossbar: Looks like a grid

(Fat) trees: Gives non-uniform memory access (NUMA) effects - remote memory is slower. Associate processors with memory that are their responsibility

Mesh: Constant number of links per unit, messages routed throughout the network in several loops. Communication latency grows linearly with distance

Torus: Mesh that wraps around the edges

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are interconnect fabrics?

A

The combination of interconnect types (graph shapes) found in a computers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are inherently sequential computations?

A

Computations that cannot be parallelized, because instructions are dependent on each other and can’t run in any order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the formula for parallel execution time?

A

Tp = f*T + ((1 - f) * T) / p)

T: Sum of the time costs of all operations in the program

p: number of processors, where the parallel operations are divided

(1-f): fraction of operations that can be parallelized

f: fraction of sequential operations the program requires

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the formula for speedup?

A

Speedup is how much faster a fast solution runs compared to a slow one

S = Tslow / Tfast

If S = 1/4, the faster solution is 4 times quicker

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the formula for speedup as a function of number of processors?

A

S(p)= Ts/Tp =(Ts = fT + (1-f)T) / (f*T + ((1 - f) * T) / p))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the formula for sequential time?

A

Ts = fT + (1-f)T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is linear speedup?

A

When every operation in a program can be parallelized

S(p) = p

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is Amdahl’s law?

A

When a program contains a sequential part, the fastest theoretical execution time it could run at, given infinite processors, would be the time spent executing these operations in sequence.

Assumes constant serial time. Everything is calculated in proportion to an amount of work taking T time sequentially.

If we have an infinite number of available processors, the speedup would reach the limit

lim p->infinity S(p) = 1/f

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is scaled speedup?

A

If we assume constant parallel time:

Tp = fT + (1-f)*T

Meaning, every time a processor is added, (1-f)*T units of work is added to occupy the additional processor

Bigger problems takes longer to run sequentially:

Ts = fT + p(1-f)*T

Formula for scaled speedup:
Ss(p) = Ts/Tp = f + p(1-f), using new formula for Tp and Ts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is Gustafson’s law?

A

Observes thaat scaled speedup does not approach any limit as p grows, as long as the problem is sized up in proportion to the machine:

Ss(p) = f + p(1-f)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the difference between scaled speedup and speedup?

A

Scaled: x times the work can be done in the same amount of time

Speedup: The same amount of work can be done in 1/x time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are scaling curves?

A

The graph of S(p) given p

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are strong scaling?

A

The scaling curves of Amdahl’s law.

Best scaling is linear scaling:
S(p) = p

Towards infinity, they approach infinity when the fraction (f) of sequential operations in the program increases

Studies of how speedup changes with processor count, are said to be carried out in the strong mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is weak scaling?

A

The scaling curves of Gustafson’s- law.

They are all linear, but their gradiant is not quite 1

Ss(p) = ax + b ; OBS: Scaled speedup

a: the gradient, not quite 1

Studies of how scaled speedup changes with processor count are carried out in weak mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is efficiency?

A

E(p) = S(p) / p

Amount of speedup divided equally amongst the processors

Efficiency gives a sense of how much one processor contributed to the speedup

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is horizontal scaling?

A

Upgrade system with more of the same components as available before

Improves the parallel part of execution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is vertical scaling?

A

Upgrade system with new, more powerful components

Improves sequential part of execution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is an assumption that gustafson makes?

A

That the parallel workload can grow without needing to increase the sequential

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is the hockney model?

A

When we know the message size (n), we can estimate transmission time as sum of latency and n*inverse bandwidth

Latency (a): Time it takes to get between communication points, decided by the distance between these not message size

Inverse bandwidth (B⁻1): seconds/byte

T_comm(n) = a + n*B⁻1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is the ping-pong test of communication speed?

A

Start clock

Send message from A to B
Send message from B to A
Repeate this a lot of times

Stop clock

Divide the time by 2 (for both directions) and n (number of messages)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How can we measure latency?

A

Do ping-pong test with empty- or 1-byte messages

The time taken will in this case be dominated by latency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

How can we measure inverse bandwidth?

A

Ping-pong test with a smaller number of huge messages

In this case, bandwidth requirements will dominate time taken

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

How can bandwidth be expanded?

A

Adding extra lanes to interconnect fabric

38
Q

How can latency be improves?

A

Very difficult, essentially dependent on the speed of light.

Can however be masked, e.g. using MPI_Isend

39
Q

What is Symmetric MultiProcessor (SMP, r dancehall architectures)?

A

Earlier, 1 processor core per chip, clock rates for cpu and memory were comparable, no cache

The structure was memory banks in a grid in the middle, e.g. dimensions 3x3, and 3 CPUs on each side of memory bank.

All memory shared by every CPU

Any CPU read/write to any memory bank at the same speed (Uniform Memory Acces, UMA)

Because of this, any CPU can contact any other at the same cost

This design introduced race conditions

40
Q

How were race conditions handled in SMPs (Symmetric MultiProcessors)?

A

Interconnect fabric + cpu supported atomic operations

test-and-set: check if value is 0, set to 1 if isn’t, return (great for spin-locking)

fetch-and-increment: increase value in mem, return what is was before (great for obtaining ticket numbers in a queue)

fetch-and-add: fetch and incr with arbitrary value

Compare-and-swap: check if value is equal to supplied value, if it is - exchange it for a number from the CPU, return whether it succeeded

41
Q

What are the fetch-and-phi operations?

A

The atomic operations that where implemented in SMPs to handle race conditions

42
Q

What caused memory access to no longer be uniform in SMPs?

A

Access to closer memory banks grew faster than access to remote banks

NUMA - non-uniform memory access

43
Q

What is cache coherent NUMA (ccNUMA)?

A

Caches are introduced to reduce memory latency gap between close and further away memory.

Works for 1 CPU, make it worse for rest

44
Q

What does SMP now stand for?

A

Shared memory processor.

No longer symmetric memory access

45
Q

What is Load Linked/Store Conditional?

A

Attempt on atomic solutions

LL fetches value into a register, and temporarily tags the memory bank it came from

While value is in register, the CPUs entire instruction set can manipulate it

SC tries to write result back to tagged memory bank. Returns whether it succeeded.
If fails, value not stored, because someone altered in meantime.

Program gets to know about failure

46
Q

What is atomic reservoirmemory

A

Seperate memory banks are wired directly to processor, bypasses all caching mechanisms.

Slow, but all read/write are atomic

47
Q

How can x86_64 ops be made atomic?

A

Instructions that include a whole read-mod-write cycle in one instruction can be made atomic by placing the keyword ‘lock’ in front of the instruction in the assembly code

Gives exclusive access to either cache line with the value, if no other core has this value, or the entire memory bus.

48
Q

What is cache coherence?

A

Problem of keeping multiple CPU caches updated with each others modified values

Solution: Snooping, or directory

49
Q

What is snooping?

A

Solution to cache coherence.

Allowing CPUs to listen in on each others memory traffic via shared branches of the interconnect

Can be combined with write-through and write-back

50
Q

What is directory?

A

Solution to cache coherence

Maintaining a centralized registry of cache lines and the various states they’re in

51
Q

Describe snooping with write-through

A

Done on machines with a single, shared memory bus as interconnect.

One CPU write through the change on the interconnect, when this happen the other CPU listens and copies the change to its own cache. The change is written back to memory

52
Q

Describe snooping with write-back

A

CPU alerts the mem.bus that the cache line is dirty. The other CPU does not use this cache line until it is clean. Only 1 bit is broadcasted to the other CPUs (dirty bit)

When changes as written back by the first CPU, the memory is updated and the second CPU fetches the fresh copy

53
Q

What is an issue with snooping?

A

The cost of broadcasting changes

Only fast for numbers efficient for broadcasts

54
Q

Describe the Directory solution to cache coherence

A

Have a table that records what CPU have copies of what memory

Needs quite a bit of memory as the table can grow quite large

Entry:
- Entry for each memory block we want to track
- Bits to record state (exclusive-or-shared, modified, uncached)
- bit vector, one bit for each CPU

55
Q

Give an example of Directory entries with ful bit vector format?

A

Memory: 1, 2, 3, 4, 5, 6, 7, 8

CPU 0: Cache 3, 4
CPU 1: Cache 3, 4
CPU 2: Cache 5, 6
CPU 3: no cache

Directory (Block - cpus - state):
[1, 2] - 0000 - uncached
[3, 4] - 1100 - shared
[5, 6] - 0010 - exclusive
[7, 8] - 0000 - uncached

56
Q

What is a directory with coarse bit vector format?

A

Bit vector indicates if one-or-more cpus has a modified copy

The CPUs that share this memory sorts out coherence between themselves

57
Q

How is memory allocation done for directory systems?

A

Some systems allocate fixed part of general system RAM. This is fine for smaller systems, but bigger systems will notice that some of their RAM cannot be used for other purposes.

Some systems install additional memory banks for the directory.

58
Q

What is tiling?

A

A way to improve cache performance.

59
Q

What can improve cache performance?

A

Vector registers

Tiling

60
Q

What are intrinsics?

A

Constructs that behave like function calls, but compiler recognize as shorthand notation for assembly instructions.

Useful when programming vector registers by hand, in situations where the compiler does not recognize a structure as a vector structure.

__mm128d var:

Stands for 128-bit SIMD register, can hold 2 doubles
This is blob of bytes which can be put into a vector register in an instant and fit there.

61
Q

What is _mm_malloc(size, alignment)

A

Alignment arg. gives a number that evenly divides the starting address of the allocation.

Vector registers load faster when the addresses they load are clean multiples of the register size

for 128 bit register value, alignment can be 16

62
Q

How do you transfer data from memory to a SIMD register?

A

__m128d my_two_doubles = _mm_load_pd(&two_doubles)

address of two_doubles must be 16 byte aligned

Load two copies of one double:
_m128d two_copies = _mm_load_pd1(&a)

63
Q

How are SIMD data moved from registers to memory

A

_mm_store_pd(two_doubles, &two_doubles)

64
Q

How to you do pairwise addition/multiplication of two SIMD registers?

A

__m128d sum = _mm_add_pd(ab, cd)

gives a+b and c+d

_m128d ac_and_bd = _mm_mul_pd(ab, cd)

65
Q

What models are used in processing?

A

Problem model: What we want to calculate, simplified representation of a real thing

Programming model: Operations used to express calculation. Simplified representation of machine instructions

Processing model: Expectations about what the machine will do, simplified representation of hardware

Actual computer: Actual hardware

66
Q

What useful tihng does Amdahl’s and Gustafson’s law model?

A

Both are performance models

Gives realistic estimates of whether or not program performance will improve if we invest in more hardware to run on it

67
Q

What useful thing does the Hockney model model?

A

Whether program performance is constrained by the size of its messages of the number of messages

68
Q

What is MIPS and FLOPS?

A

Millions of Instruction Per Second

Floating-point operations per second

69
Q

What is the issue with benchmarking?

A

Asit is so specific, it does not give a good overview/statistics for actual performance

70
Q

What is memory bandwidth?

A

Maximum amount of bytes the interconnect can transport between CPU and memory each second

bytes/second

71
Q

What is the operational intensity/arithmetic intensity?

A

Intructions uses operands.
Divide number of operations by the number of bytes they are applied to.

operations / byte
[FLOPS / byte]

72
Q

What is the rate at which a program can run because of the rate it can read and write at?

A

Operational intensity * memory bandwidth

If data transport is not fast enough to supply the processor with data for all the instructions in the program, the program has to wait.

byte/s * Flop/byte = Flop / s

73
Q

What is roofline models?

A

x-axis: Arithmetic intensity (FLOP/byte)

y-axis: Flop / s

The peak performance: straight line at a given y-value. this is the roofline of the graph.

The graph will be a linear line up until this point, with the gradient a:

a = bytes/s * FLOP/byte

74
Q

What is the memory bound region in a rooftop graph?

A

Programs with an intensity that makes them running at the speed capped by memory bandwidth

When Flop/b is lower than the peak performance

75
Q

What is the ridge point in rooftop analysis?

A

When the FLOP/byte intensity goes from memory-bounding a program to compute bounding it

76
Q

What is the “comput bound” region in a rooftop analysis?

A

When program intensity is so that the program run at the speed capped by the processor

77
Q

Where do we find the rooftop of machines?

A

Theoretical numbers from hardware

Run benchmarks that stress computing capability or memory

78
Q

What is some properties ofdense linear algebra?

A

Long arrays of consecutive values tightly packed together in memory

Operations of complicated calculations that require many instructions per element

79
Q

What is sparse linear algebra?

A

Similar to dense algebra, but many/most of elements are zero

this makes it meaningless to read them

Data structures we use are rather lists of indices with non-zero values, and the values themselves

values[], row[], col[]

80
Q

What are unstructured grids?

A

Similar to grids, but without assumption that neighbours are evenly spaced

81
Q

What are n-body problems?

A

List of coordinates for things that push each other around (starts, biljard balls)

bottleneck: finding neighbours
Does everybody affect each other, does only neighbours affect each other

82
Q

What are Monte Carlo methods?

A

Calculations that approach their solution by accumulating random numbers

additional numbers should contribute to the solution nomatter what their values are

performance challenges:
- pseudo random numbers often have sequential dependences between numbers

83
Q

What are the 7 berkley kernels?

A

Dense linear algebra
Sparse linear algebra
Structured grids
Unstructured grids
N-body problems
Monte carlo methods
Spectral methods

84
Q

What is multiple issue?

A

Replicate ALU, extend decoder to dispatch several independent instructions simultaneously

85
Q

What is Simultaneously MultiThreading (SMT)

A

Useful to utilize the different parts of the ALU with instructions that require different calculation logic. Instead of leaving the ALU parts idle, run multiple at the same time.

Replicate instruction pointer and decoding unit

Receive 2 instruction streams.

merge these together when their needs doesn’t conflict

This allows a 4 core processor to be able to run 8 threads, for example

86
Q

When will SMT improve performance?

A

It doesn’t happen often that two instruction streams only depend on different ALU units.

Threads often need the same units, making the second one wait.

SMTs improve performance in the rare cases where the threads does not need the same units. Statistically, this only happens every so often.

87
Q

What is superlinear speedup?

A

Times when we can measure S(p) > p, which goes against Amdahl’s law.

This can happen if a problem is split up into smaller parts, that suddenly fits in a faster part of memory (e.g. a higher level cache)

88
Q

What is load balancing?

A

Ways of mitigate unbalanced workload.

Parallel programs often have synchronization points. If the work are unevenly distributed between processes, one process will lag behind. And the execution time is held back by the slowest process.

89
Q

What are the 3 kinds of load balancing?

A

Static:
Embed the partitioning of the program directly into the source code

Semi-static:
Examine workload on program start, divide it then, and run with this inital partitioning

Dynamic:
Shift work around between participants while program is running

90
Q

What is the Master/Worker pattern?

A

A dynamic workload balancing technique.

One rank is the master and maintains a queue of similar-sized tasks.

Rest of ranks are workers. Assigned tasks by master, or take tasks from queue and informs master

limited scalability due to centralized control. If the amount of workers is to big, they can overwhelm the master.

91
Q
A