Theory Flashcards
What is easier to improve, bandwidth or latency?
Bandwitch
Because of this, when accessing memory, it is better to fetch as much memory as we can to make up for the memory latency
In caches, what are the size restirctions?
Cache block length: power of 2
Ways of associativity: Number of sets is a power of 2
n_blocks in a set: any integer
What are Bernstein’s conditions for out-of-order execution?
Statement S1 and S2 will produce the same result when run in any order if:
- S1 does not read what S2 writes
- S1 does not write what S2 reads
- S1 and S2 does not write in the same place
What is a superscalar processor?
Launch multiple instructions at one (done in control path)
Execute multiple instructions at once by replicate the ALU, so that there are multiple
What are vector registers?
Extra-wide registers that can store several consecutive alues at once, e.g.:
Vector load: a[0], a[1], a[2], a[3]
Vector registers allows for vector operations, where operations are done on all vector elements in parallel
What is Flynn’s taxonomy?
Ways to sort parallel programs
SISD: Von neumann
SIMD: Vector machine
MISD: Theoretical
MIMD: Processes and threads. Fully independent processors, but program make them work towards the same goal, not synch,
What are the ways to sort parallel programs?
Flynn’s taxonomy and shared-distributed memory
What is shared- and distributed memory?
Shared:
Partitions the memory image of a process
Distributed:
Works with several processes, all with their own memory
What components are allocated to threads?
Stack,
Instruction pointer
What is invisible communication between threads?
When one thread access memory outside their own stack, the memory will be instantly available to the rest of the threads
Pros and cons of shared memory?
Pro:
- only 1 copy of shared data
- everyone can modify this (must be coordinated)
Con:
- Cannot be on separate computers (thread lives within a process’ address space - so their memory must be managed by one single OS)
- Only get as many threads as fit into one machine
Pros and cons of distributed memory?
Pros:
- Their memory cannot be involuntarily corrupted
- only receive what they ask for, and store this in a designated place
- Can run on same or different computers - if we run out of processors, can just add a machine
Cons:
- copies of data, because of message passing, uses twice the memory
What is multi-computer parallelism?
When processes in a distributed memory program run on different computers.
Communicate over network
What are some types of interconnects used in shared memory systems?
Crossbar: Looks like a grid
(Fat) trees: Gives non-uniform memory access (NUMA) effects - remote memory is slower. Associate processors with memory that are their responsibility
Mesh: Constant number of links per unit, messages routed throughout the network in several loops. Communication latency grows linearly with distance
Torus: Mesh that wraps around the edges
What are interconnect fabrics?
The combination of interconnect types (graph shapes) found in a computers
What are inherently sequential computations?
Computations that cannot be parallelized, because instructions are dependent on each other and can’t run in any order.
What is the formula for parallel execution time?
Tp = f*T + ((1 - f) * T) / p)
T: Sum of the time costs of all operations in the program
p: number of processors, where the parallel operations are divided
(1-f): fraction of operations that can be parallelized
f: fraction of sequential operations the program requires
What is the formula for speedup?
Speedup is how much faster a fast solution runs compared to a slow one
S = Tslow / Tfast
If S = 1/4, the faster solution is 4 times quicker
What is the formula for speedup as a function of number of processors?
S(p)= Ts/Tp =(Ts = fT + (1-f)T) / (f*T + ((1 - f) * T) / p))
What is the formula for sequential time?
Ts = fT + (1-f)T
What is linear speedup?
When every operation in a program can be parallelized
S(p) = p
What is Amdahl’s law?
When a program contains a sequential part, the fastest theoretical execution time it could run at, given infinite processors, would be the time spent executing these operations in sequence.
Assumes constant serial time. Everything is calculated in proportion to an amount of work taking T time sequentially.
If we have an infinite number of available processors, the speedup would reach the limit
lim p->infinity S(p) = 1/f
What is scaled speedup?
If we assume constant parallel time:
Tp = fT + (1-f)*T
Meaning, every time a processor is added, (1-f)*T units of work is added to occupy the additional processor
Bigger problems takes longer to run sequentially:
Ts = fT + p(1-f)*T
Formula for scaled speedup:
Ss(p) = Ts/Tp = f + p(1-f), using new formula for Tp and Ts
What is Gustafson’s law?
Observes thaat scaled speedup does not approach any limit as p grows, as long as the problem is sized up in proportion to the machine:
Ss(p) = f + p(1-f)
What is the difference between scaled speedup and speedup?
Scaled: x times the work can be done in the same amount of time
Speedup: The same amount of work can be done in 1/x time
What are scaling curves?
The graph of S(p) given p
What are strong scaling?
The scaling curves of Amdahl’s law.
Best scaling is linear scaling:
S(p) = p
Towards infinity, they approach infinity when the fraction (f) of sequential operations in the program increases
Studies of how speedup changes with processor count, are said to be carried out in the strong mode
What is weak scaling?
The scaling curves of Gustafson’s- law.
They are all linear, but their gradiant is not quite 1
Ss(p) = ax + b ; OBS: Scaled speedup
a: the gradient, not quite 1
Studies of how scaled speedup changes with processor count are carried out in weak mode
What is efficiency?
E(p) = S(p) / p
Amount of speedup divided equally amongst the processors
Efficiency gives a sense of how much one processor contributed to the speedup
What is horizontal scaling?
Upgrade system with more of the same components as available before
Improves the parallel part of execution
What is vertical scaling?
Upgrade system with new, more powerful components
Improves sequential part of execution
What is an assumption that gustafson makes?
That the parallel workload can grow without needing to increase the sequential
What is the hockney model?
When we know the message size (n), we can estimate transmission time as sum of latency and n*inverse bandwidth
Latency (a): Time it takes to get between communication points, decided by the distance between these not message size
Inverse bandwidth (B⁻1): seconds/byte
T_comm(n) = a + n*B⁻1
What is the ping-pong test of communication speed?
Start clock
Send message from A to B
Send message from B to A
Repeate this a lot of times
Stop clock
Divide the time by 2 (for both directions) and n (number of messages)
How can we measure latency?
Do ping-pong test with empty- or 1-byte messages
The time taken will in this case be dominated by latency
How can we measure inverse bandwidth?
Ping-pong test with a smaller number of huge messages
In this case, bandwidth requirements will dominate time taken
How can bandwidth be expanded?
Adding extra lanes to interconnect fabric
How can latency be improves?
Very difficult, essentially dependent on the speed of light.
Can however be masked, e.g. using MPI_Isend
What is Symmetric MultiProcessor (SMP, r dancehall architectures)?
Earlier, 1 processor core per chip, clock rates for cpu and memory were comparable, no cache
The structure was memory banks in a grid in the middle, e.g. dimensions 3x3, and 3 CPUs on each side of memory bank.
All memory shared by every CPU
Any CPU read/write to any memory bank at the same speed (Uniform Memory Acces, UMA)
Because of this, any CPU can contact any other at the same cost
This design introduced race conditions
How were race conditions handled in SMPs (Symmetric MultiProcessors)?
Interconnect fabric + cpu supported atomic operations
test-and-set: check if value is 0, set to 1 if isn’t, return (great for spin-locking)
fetch-and-increment: increase value in mem, return what is was before (great for obtaining ticket numbers in a queue)
fetch-and-add: fetch and incr with arbitrary value
Compare-and-swap: check if value is equal to supplied value, if it is - exchange it for a number from the CPU, return whether it succeeded
What are the fetch-and-phi operations?
The atomic operations that where implemented in SMPs to handle race conditions
What caused memory access to no longer be uniform in SMPs?
Access to closer memory banks grew faster than access to remote banks
NUMA - non-uniform memory access
What is cache coherent NUMA (ccNUMA)?
Caches are introduced to reduce memory latency gap between close and further away memory.
Works for 1 CPU, make it worse for rest
What does SMP now stand for?
Shared memory processor.
No longer symmetric memory access
What is Load Linked/Store Conditional?
Attempt on atomic solutions
LL fetches value into a register, and temporarily tags the memory bank it came from
While value is in register, the CPUs entire instruction set can manipulate it
SC tries to write result back to tagged memory bank. Returns whether it succeeded.
If fails, value not stored, because someone altered in meantime.
Program gets to know about failure
What is atomic reservoirmemory
Seperate memory banks are wired directly to processor, bypasses all caching mechanisms.
Slow, but all read/write are atomic
How can x86_64 ops be made atomic?
Instructions that include a whole read-mod-write cycle in one instruction can be made atomic by placing the keyword ‘lock’ in front of the instruction in the assembly code
Gives exclusive access to either cache line with the value, if no other core has this value, or the entire memory bus.
What is cache coherence?
Problem of keeping multiple CPU caches updated with each others modified values
Solution: Snooping, or directory
What is snooping?
Solution to cache coherence.
Allowing CPUs to listen in on each others memory traffic via shared branches of the interconnect
Can be combined with write-through and write-back
What is directory?
Solution to cache coherence
Maintaining a centralized registry of cache lines and the various states they’re in
Describe snooping with write-through
Done on machines with a single, shared memory bus as interconnect.
One CPU write through the change on the interconnect, when this happen the other CPU listens and copies the change to its own cache. The change is written back to memory
Describe snooping with write-back
CPU alerts the mem.bus that the cache line is dirty. The other CPU does not use this cache line until it is clean. Only 1 bit is broadcasted to the other CPUs (dirty bit)
When changes as written back by the first CPU, the memory is updated and the second CPU fetches the fresh copy
What is an issue with snooping?
The cost of broadcasting changes
Only fast for numbers efficient for broadcasts
Describe the Directory solution to cache coherence
Have a table that records what CPU have copies of what memory
Needs quite a bit of memory as the table can grow quite large
Entry:
- Entry for each memory block we want to track
- Bits to record state (exclusive-or-shared, modified, uncached)
- bit vector, one bit for each CPU
Give an example of Directory entries with ful bit vector format?
Memory: 1, 2, 3, 4, 5, 6, 7, 8
CPU 0: Cache 3, 4
CPU 1: Cache 3, 4
CPU 2: Cache 5, 6
CPU 3: no cache
Directory (Block - cpus - state):
[1, 2] - 0000 - uncached
[3, 4] - 1100 - shared
[5, 6] - 0010 - exclusive
[7, 8] - 0000 - uncached
What is a directory with coarse bit vector format?
Bit vector indicates if one-or-more cpus has a modified copy
The CPUs that share this memory sorts out coherence between themselves
How is memory allocation done for directory systems?
Some systems allocate fixed part of general system RAM. This is fine for smaller systems, but bigger systems will notice that some of their RAM cannot be used for other purposes.
Some systems install additional memory banks for the directory.
What is tiling?
A way to improve cache performance.
What can improve cache performance?
Vector registers
Tiling
What are intrinsics?
Constructs that behave like function calls, but compiler recognize as shorthand notation for assembly instructions.
Useful when programming vector registers by hand, in situations where the compiler does not recognize a structure as a vector structure.
__mm128d var:
Stands for 128-bit SIMD register, can hold 2 doubles
This is blob of bytes which can be put into a vector register in an instant and fit there.
What is _mm_malloc(size, alignment)
Alignment arg. gives a number that evenly divides the starting address of the allocation.
Vector registers load faster when the addresses they load are clean multiples of the register size
for 128 bit register value, alignment can be 16
How do you transfer data from memory to a SIMD register?
__m128d my_two_doubles = _mm_load_pd(&two_doubles)
address of two_doubles must be 16 byte aligned
Load two copies of one double:
_m128d two_copies = _mm_load_pd1(&a)
How are SIMD data moved from registers to memory
_mm_store_pd(two_doubles, &two_doubles)
How to you do pairwise addition/multiplication of two SIMD registers?
__m128d sum = _mm_add_pd(ab, cd)
gives a+b and c+d
_m128d ac_and_bd = _mm_mul_pd(ab, cd)
What models are used in processing?
Problem model: What we want to calculate, simplified representation of a real thing
Programming model: Operations used to express calculation. Simplified representation of machine instructions
Processing model: Expectations about what the machine will do, simplified representation of hardware
Actual computer: Actual hardware
What useful tihng does Amdahl’s and Gustafson’s law model?
Both are performance models
Gives realistic estimates of whether or not program performance will improve if we invest in more hardware to run on it
What useful thing does the Hockney model model?
Whether program performance is constrained by the size of its messages of the number of messages
What is MIPS and FLOPS?
Millions of Instruction Per Second
Floating-point operations per second
What is the issue with benchmarking?
Asit is so specific, it does not give a good overview/statistics for actual performance
What is memory bandwidth?
Maximum amount of bytes the interconnect can transport between CPU and memory each second
bytes/second
What is the operational intensity/arithmetic intensity?
Intructions uses operands.
Divide number of operations by the number of bytes they are applied to.
operations / byte
[FLOPS / byte]
What is the rate at which a program can run because of the rate it can read and write at?
Operational intensity * memory bandwidth
If data transport is not fast enough to supply the processor with data for all the instructions in the program, the program has to wait.
byte/s * Flop/byte = Flop / s
What is roofline models?
x-axis: Arithmetic intensity (FLOP/byte)
y-axis: Flop / s
The peak performance: straight line at a given y-value. this is the roofline of the graph.
The graph will be a linear line up until this point, with the gradient a:
a = bytes/s * FLOP/byte
What is the memory bound region in a rooftop graph?
Programs with an intensity that makes them running at the speed capped by memory bandwidth
When Flop/b is lower than the peak performance
What is the ridge point in rooftop analysis?
When the FLOP/byte intensity goes from memory-bounding a program to compute bounding it
What is the “comput bound” region in a rooftop analysis?
When program intensity is so that the program run at the speed capped by the processor
Where do we find the rooftop of machines?
Theoretical numbers from hardware
Run benchmarks that stress computing capability or memory
What is some properties ofdense linear algebra?
Long arrays of consecutive values tightly packed together in memory
Operations of complicated calculations that require many instructions per element
What is sparse linear algebra?
Similar to dense algebra, but many/most of elements are zero
this makes it meaningless to read them
Data structures we use are rather lists of indices with non-zero values, and the values themselves
values[], row[], col[]
What are unstructured grids?
Similar to grids, but without assumption that neighbours are evenly spaced
What are n-body problems?
List of coordinates for things that push each other around (starts, biljard balls)
bottleneck: finding neighbours
Does everybody affect each other, does only neighbours affect each other
What are Monte Carlo methods?
Calculations that approach their solution by accumulating random numbers
additional numbers should contribute to the solution nomatter what their values are
performance challenges:
- pseudo random numbers often have sequential dependences between numbers
What are the 7 berkley kernels?
Dense linear algebra
Sparse linear algebra
Structured grids
Unstructured grids
N-body problems
Monte carlo methods
Spectral methods
What is multiple issue?
Replicate ALU, extend decoder to dispatch several independent instructions simultaneously
What is Simultaneously MultiThreading (SMT)
Useful to utilize the different parts of the ALU with instructions that require different calculation logic. Instead of leaving the ALU parts idle, run multiple at the same time.
Replicate instruction pointer and decoding unit
Receive 2 instruction streams.
merge these together when their needs doesn’t conflict
This allows a 4 core processor to be able to run 8 threads, for example
When will SMT improve performance?
It doesn’t happen often that two instruction streams only depend on different ALU units.
Threads often need the same units, making the second one wait.
SMTs improve performance in the rare cases where the threads does not need the same units. Statistically, this only happens every so often.
What is superlinear speedup?
Times when we can measure S(p) > p, which goes against Amdahl’s law.
This can happen if a problem is split up into smaller parts, that suddenly fits in a faster part of memory (e.g. a higher level cache)
What is load balancing?
Ways of mitigate unbalanced workload.
Parallel programs often have synchronization points. If the work are unevenly distributed between processes, one process will lag behind. And the execution time is held back by the slowest process.
What are the 3 kinds of load balancing?
Static:
Embed the partitioning of the program directly into the source code
Semi-static:
Examine workload on program start, divide it then, and run with this inital partitioning
Dynamic:
Shift work around between participants while program is running
What is the Master/Worker pattern?
A dynamic workload balancing technique.
One rank is the master and maintains a queue of similar-sized tasks.
Rest of ranks are workers. Assigned tasks by master, or take tasks from queue and informs master
limited scalability due to centralized control. If the amount of workers is to big, they can overwhelm the master.