Theory Flashcards
What is easier to improve, bandwidth or latency?
Bandwitch
Because of this, when accessing memory, it is better to fetch as much memory as we can to make up for the memory latency
In caches, what are the size restirctions?
Cache block length: power of 2
Ways of associativity: Number of sets is a power of 2
n_blocks in a set: any integer
What are Bernstein’s conditions for out-of-order execution?
Statement S1 and S2 will produce the same result when run in any order if:
- S1 does not read what S2 writes
- S1 does not write what S2 reads
- S1 and S2 does not write in the same place
What is a superscalar processor?
Launch multiple instructions at one (done in control path)
Execute multiple instructions at once by replicate the ALU, so that there are multiple
What are vector registers?
Extra-wide registers that can store several consecutive alues at once, e.g.:
Vector load: a[0], a[1], a[2], a[3]
Vector registers allows for vector operations, where operations are done on all vector elements in parallel
What is Flynn’s taxonomy?
Ways to sort parallel programs
SISD: Von neumann
SIMD: Vector machine
MISD: Theoretical
MIMD: Processes and threads. Fully independent processors, but program make them work towards the same goal, not synch,
What are the ways to sort parallel programs?
Flynn’s taxonomy and shared-distributed memory
What is shared- and distributed memory?
Shared:
Partitions the memory image of a process
Distributed:
Works with several processes, all with their own memory
What components are allocated to threads?
Stack,
Instruction pointer
What is invisible communication between threads?
When one thread access memory outside their own stack, the memory will be instantly available to the rest of the threads
Pros and cons of shared memory?
Pro:
- only 1 copy of shared data
- everyone can modify this (must be coordinated)
Con:
- Cannot be on separate computers (thread lives within a process’ address space - so their memory must be managed by one single OS)
- Only get as many threads as fit into one machine
Pros and cons of distributed memory?
Pros:
- Their memory cannot be involuntarily corrupted
- only receive what they ask for, and store this in a designated place
- Can run on same or different computers - if we run out of processors, can just add a machine
Cons:
- copies of data, because of message passing, uses twice the memory
What is multi-computer parallelism?
When processes in a distributed memory program run on different computers.
Communicate over network
What are some types of interconnects used in shared memory systems?
Crossbar: Looks like a grid
(Fat) trees: Gives non-uniform memory access (NUMA) effects - remote memory is slower. Associate processors with memory that are their responsibility
Mesh: Constant number of links per unit, messages routed throughout the network in several loops. Communication latency grows linearly with distance
Torus: Mesh that wraps around the edges
What are interconnect fabrics?
The combination of interconnect types (graph shapes) found in a computers
What are inherently sequential computations?
Computations that cannot be parallelized, because instructions are dependent on each other and can’t run in any order.
What is the formula for parallel execution time?
Tp = f*T + ((1 - f) * T) / p)
T: Sum of the time costs of all operations in the program
p: number of processors, where the parallel operations are divided
(1-f): fraction of operations that can be parallelized
f: fraction of sequential operations the program requires
What is the formula for speedup?
Speedup is how much faster a fast solution runs compared to a slow one
S = Tslow / Tfast
If S = 1/4, the faster solution is 4 times quicker
What is the formula for speedup as a function of number of processors?
S(p)= Ts/Tp =(Ts = fT + (1-f)T) / (f*T + ((1 - f) * T) / p))
What is the formula for sequential time?
Ts = fT + (1-f)T
What is linear speedup?
When every operation in a program can be parallelized
S(p) = p
What is Amdahl’s law?
When a program contains a sequential part, the fastest theoretical execution time it could run at, given infinite processors, would be the time spent executing these operations in sequence.
Assumes constant serial time. Everything is calculated in proportion to an amount of work taking T time sequentially.
If we have an infinite number of available processors, the speedup would reach the limit
lim p->infinity S(p) = 1/f
What is scaled speedup?
If we assume constant parallel time:
Tp = fT + (1-f)*T
Meaning, every time a processor is added, (1-f)*T units of work is added to occupy the additional processor
Bigger problems takes longer to run sequentially:
Ts = fT + p(1-f)*T
Formula for scaled speedup:
Ss(p) = Ts/Tp = f + p(1-f), using new formula for Tp and Ts
What is Gustafson’s law?
Observes thaat scaled speedup does not approach any limit as p grows, as long as the problem is sized up in proportion to the machine:
Ss(p) = f + p(1-f)
What is the difference between scaled speedup and speedup?
Scaled: x times the work can be done in the same amount of time
Speedup: The same amount of work can be done in 1/x time
What are scaling curves?
The graph of S(p) given p
What are strong scaling?
The scaling curves of Amdahl’s law.
Best scaling is linear scaling:
S(p) = p
Towards infinity, they approach infinity when the fraction (f) of sequential operations in the program increases
Studies of how speedup changes with processor count, are said to be carried out in the strong mode
What is weak scaling?
The scaling curves of Gustafson’s- law.
They are all linear, but their gradiant is not quite 1
Ss(p) = ax + b ; OBS: Scaled speedup
a: the gradient, not quite 1
Studies of how scaled speedup changes with processor count are carried out in weak mode
What is efficiency?
E(p) = S(p) / p
Amount of speedup divided equally amongst the processors
Efficiency gives a sense of how much one processor contributed to the speedup
What is horizontal scaling?
Upgrade system with more of the same components as available before
Improves the parallel part of execution
What is vertical scaling?
Upgrade system with new, more powerful components
Improves sequential part of execution
What is an assumption that gustafson makes?
That the parallel workload can grow without needing to increase the sequential
What is the hockney model?
When we know the message size (n), we can estimate transmission time as sum of latency and n*inverse bandwidth
Latency (a): Time it takes to get between communication points, decided by the distance between these not message size
Inverse bandwidth (B⁻1): seconds/byte
T_comm(n) = a + n*B⁻1
What is the ping-pong test of communication speed?
Start clock
Send message from A to B
Send message from B to A
Repeate this a lot of times
Stop clock
Divide the time by 2 (for both directions) and n (number of messages)
How can we measure latency?
Do ping-pong test with empty- or 1-byte messages
The time taken will in this case be dominated by latency
How can we measure inverse bandwidth?
Ping-pong test with a smaller number of huge messages
In this case, bandwidth requirements will dominate time taken