Thread and data level parallelism Flashcards
What is the difference between instruction-level-, thread-level- and data-level parallelism?
TLP increases overall throughput
ILP and DLP focus on increasing the IPC
What is ILP?
Exploits parallelism within a program, e.g. OoO-to be able to execute multiple instructions simultaneously. (loops, etc.)
What is TLP?
Exploits parallelism between independent threads. Running more applications in parallel, increases overall system performance.
However, does not necessarily make the individual applications perform better/faster.
What is DLP?
Exploits parallelism by operating on multiple data points simultaneously. E.g. if a loop updates all elements of an array, can instead tell the program that a certain operation is to be done to all elements in the loop. Then this operation would be run at the same time for all array elements.
What is throughput?
Total amount of completed instructions across all threads.
Higher throughput allows for more programs running simultaneously
What is IPC?
Number of average instructions completed per cycle for a given thread.
Higher IPC gives more responsive programs
What is a thread?
The execution of a program within a process
What is multithreading?
Scheduling multiple threads on a single core.
Duplicate independent state for each thread (register file, PC, page table)
Memory sharing is done by using virtual memory
HW must support thread switching. Latency of this must be much lower than context switching
What are two types of switch strategies?
Coarse and fine grained
What are coarse-grained switch strategies?
Switch thread on long stall (L2 miss, TLB miss)
Advantage:
- Low HW cost due to slow thread switch
- Fast forward progress: thread is only switched when it would be delayed anyways
Disadvantages:
- CPU only issues from one thread. On stall, pipeline must be flushed before new thread can issue
- New thread must then refill pipeline from start - restart penalty
- Need added HW to detect costly stall and to start switch
What are fine-grained switch strategies?
Switches between threads every cycle, interleaving the different threads.
Usually round-robin - skipping stalled threads
Advantages:
- Pipeline must not be refilled on stall
- Hides both short and long stalls
Cons:
- Slows down execution of individual threads
- Extra HW support
What is simultaneous Multithreading (SMT)?
Thread switching happens within cycles, on open slots. An advantage is that we are much more likely to be able to fill up all available slots because of this.
The motivation for SMT is that dynamically scheduled processors already have HW mechanisms to support multithreading.
- Lots of physical registers because of register renaming
- Dynamic execution scheduling
Required hardware:
- Per-thread renaming table
- Separate PC
- Separate ROBs with commit capabilities
What are some design challenges with SMTs? (3)
Need a large register file:
- Need a lot of physical registers to be able to map the architectural ones.
Must also avoid worsening the critical path:
- Don’t introduce additional bottlenecks
- From the issue to execution stage, need to make sure each thread is able to make as much progress, as if they were running on their own processor unit.
Make sure that threads don’t worsen the cache behaviour of other threads. This can happen if a working set of one thread only barely fit within the cache. The next thread will then need to evict to get its own data.
- Threads should be “good neighbours”
- Avoid evicting each others working set
- Possibly be able to share resources or have fairness between resources
How does renaming work with SMTs?
Each thread have their own mapping table. So if threads are using the same instructions, these can be mapped to different physical instructions.
Why don’t we need to add more physical registers to allow SMTs?
Because a thread only uses the whole physical register file when it runs at peak performance. In SMTs this won’t be the case, as the threads won’t be running at peak performance.
What does an OoO pipeline look like with SMTs?
Separate PCs for each thread, used to fetch instructions from instruction cache. Fetch stage needs to support providing instructions to multiple threads at the same time.
Separate renaming units for each thread, with no crossing wires.
Separate ROOBs for each thread. (in-order commit, precise exceptions)
How does Dynamic processors find parallelism?
Through OoO-execution of independent instructions
What is the basic elements of DLP?
Designing systems around data elements directly, instead of exposing and exploiting parallelism.
Do this through vectorization
What is vectorization?
Larger sets of associative data, where one operation is to be applied to all data points.
What are SIMD?
Extend existing hardware features to enable a simplified form of vector execution
Splits registers into multiple data-elements. Execution units work on parallel lanes for each data element.
The programmer must convert their vectors to have formats (e.g. length) that are supported in hardware.
What are vector architectures?
Explicitly introduce vector registers and -processing.
Vectors up to a max size is supported. This size is generally larger than the SIMD systems.
HW can be designed to speed ut vector fetch and addressing.
What types of applications benefit from SIMD?
Ones using narrow data types (8-32 bit data). can fit multiple of these datatypes in a single register.
Can reuse existing register structures, and make minor modification to execution units to be able to fit multiple data type elements into single registers, and do operations on these.
How is SIMD designed?
Use the full width of registers and split operations
Uses specific instructions to identify how to split up registers. (1x64, 2x32, etc.)
Give an example on how registers can be modified in a SIMD design
Have a 64-bit register
Modify this into 4 16 bit registers
What are some benefits of SIMD? (2 types)
Performance:
- Multiplicative factor of processing speed
- Latency stays the same
HW implementation has low cost
- logical operations already work
- ALU only need minor adjustments
What are some limitations of SIMD? (4)
Instruction operand encoding sometimes gives bad alignment, causing less performance benefits.
Challenging to program. Programmer need to make sure data is aligned, as SIMD does not provide sophisticated addressing modes. The program is required to explicitly tile in cases where alignment is not exact.
No mask registers
Challenging to compile
How does vector addition benefit from vactor architectures
for(i<-0 to 100) a[i] = b[i] + c[i]
In vector, all of these instructions can be done in a single instruction:
A = B + C
What are some benefits of vector architectures (6)
Easier to program
Have a very explicit mode for parallelism, which makes for more efficient hardware.
Vectors are inherently independent, this gives a lot less control overhead
Gives a lot of data processing at any given time
Defining vector properties gives more freedom to how it will be implemented in HW. Can scale to different designs.
Vector processing can map well SW wise, by using vector_len properties.
How are vector architectures designed?
Use vector registers, these are very large. These stores (Vector_len/Data_len) number of elements
FU execution time depends on type of operation and number of elements in a vector.
FUs support accesses from vector elements
How are vector data loaded form memory?
Base address is calculated.
Vector data is then brought into vector registers.
What are some limitation of vectors? (3)
Reduced performance for control flow dominated workloads
Data must be vectorizable
Mem system must be capable of servicing large amount of data (with low latency).
Program must be very responsive, even with large units of data being processed.
How have SIMD and Vector converged over time?
SIMD extensions slowely morphed to be more vector like
SIMD is a constrained vector model
Vector is in general a powerful tool when doing data processing.
However, SIMD do not have the flexible vector addressing mode as vector architectures have.
What is start-up time?
Time until first word is read/is back into system, after it has been requested
After the initial start-up time, how much data should be provided per cycle afterwards from the memory cycle?
LANES * Data_len / cycles
In this case, as much data as possible is provided each cycle
Lanes: Number of parallel executions we can do per cycle
Data_len(DLEN): Amount of bits each data element has
This can be achieved using memory banks (interleaving)
What must Vector architectures support to deal with load-stores with strides?
For example, matrix multiplication. Row from first matrix is multiplied with column from the other matrix. This mean that the Vector needs to be able to deal with adjacent and strided accesses.
Specify stride in instruction:
ld v1, number_elements, [base_address], stride
The same load instruction can be used to implement all different strides.
What are sparse matrices, and some properties with these?
Mostly zero elements. The zero elements needs no processing, but can’t know if an element is zero or not, until we have already loaded it in.
Can use gather-scatter to take care of this issue
What is gather-scatter in regards to sparse matrices?
Uses meta-data to find all 1-elements (gather)
Process these,
Write back to those elements in matrix (scatter)
LVI: Load Vector Indexed (gather)
SVI: Store Vector Indexed
What is DLP acceleration?
Techniques to accelerate the performance of DLP
What are 3 ways of accelerating DLP?
Speeding up dependency handling
Reducing impact of divergence
Parallelizing data element execution
What is chaining of instructions?
Elements within a vector are independent. But there can still be dependencies between vectors.
In this case, each element of vector 2 has a true data dependence with the element in the same position, of vector 1.
Because some data elements from source vector will complete before the whole vector has completed. In this case, we can begin executing the next vector before the previous finishes.
By doing this we deconstruct vectors into registers. And issue dependent operands as soon as their source op are computed in the source vector
What are multiple lanes?
Split vector execution across multiple execution units, instead of having whole vectors executing on one unit. Afterwards, we store the vector element results in their correct element in the result vector
What are predicates (vector-mask control)?
Take outcome of if-else evaluation and create a mask vector with the elements that gives true as 1 and other as 0.
Apply operation to vector, only if mask entry is enabled. Rest of the elements become zero.
This ensures that an entire vector can still move as one.