Thread and data level parallelism Flashcards
What is the difference between instruction-level-, thread-level- and data-level parallelism?
TLP increases overall throughput
ILP and DLP focus on increasing the IPC
What is ILP?
Exploits parallelism within a program, e.g. OoO-to be able to execute multiple instructions simultaneously. (loops, etc.)
What is TLP?
Exploits parallelism between independent threads. Running more applications in parallel, increases overall system performance.
However, does not necessarily make the individual applications perform better/faster.
What is DLP?
Exploits parallelism by operating on multiple data points simultaneously. E.g. if a loop updates all elements of an array, can instead tell the program that a certain operation is to be done to all elements in the loop. Then this operation would be run at the same time for all array elements.
What is throughput?
Total amount of completed instructions across all threads.
Higher throughput allows for more programs running simultaneously
What is IPC?
Number of average instructions completed per cycle for a given thread.
Higher IPC gives more responsive programs
What is a thread?
The execution of a program within a process
What is multithreading?
Scheduling multiple threads on a single core.
Duplicate independent state for each thread (register file, PC, page table)
Memory sharing is done by using virtual memory
HW must support thread switching. Latency of this must be much lower than context switching
What are two types of switch strategies?
Coarse and fine grained
What are coarse-grained switch strategies?
Switch thread on long stall (L2 miss, TLB miss)
Advantage:
- Low HW cost due to slow thread switch
- Fast forward progress: thread is only switched when it would be delayed anyways
Disadvantages:
- CPU only issues from one thread. On stall, pipeline must be flushed before new thread can issue
- New thread must then refill pipeline from start - restart penalty
- Need added HW to detect costly stall and to start switch
What are fine-grained switch strategies?
Switches between threads every cycle, interleaving the different threads.
Usually round-robin - skipping stalled threads
Advantages:
- Pipeline must not be refilled on stall
- Hides both short and long stalls
Cons:
- Slows down execution of individual threads
- Extra HW support
What is simultaneous Multithreading (SMT)?
Thread switching happens within cycles, on open slots. An advantage is that we are much more likely to be able to fill up all available slots because of this.
The motivation for SMT is that dynamically scheduled processors already have HW mechanisms to support multithreading.
- Lots of physical registers because of register renaming
- Dynamic execution scheduling
Required hardware:
- Per-thread renaming table
- Separate PC
- Separate ROBs with commit capabilities
What are some design challenges with SMTs? (3)
Need a large register file:
- Need a lot of physical registers to be able to map the architectural ones.
Must also avoid worsening the critical path:
- Don’t introduce additional bottlenecks
- From the issue to execution stage, need to make sure each thread is able to make as much progress, as if they were running on their own processor unit.
Make sure that threads don’t worsen the cache behaviour of other threads. This can happen if a working set of one thread only barely fit within the cache. The next thread will then need to evict to get its own data.
- Threads should be “good neighbours”
- Avoid evicting each others working set
- Possibly be able to share resources or have fairness between resources
How does renaming work with SMTs?
Each thread have their own mapping table. So if threads are using the same instructions, these can be mapped to different physical instructions.
Why don’t we need to add more physical registers to allow SMTs?
Because a thread only uses the whole physical register file when it runs at peak performance. In SMTs this won’t be the case, as the threads won’t be running at peak performance.
What does an OoO pipeline look like with SMTs?
Separate PCs for each thread, used to fetch instructions from instruction cache. Fetch stage needs to support providing instructions to multiple threads at the same time.
Separate renaming units for each thread, with no crossing wires.
Separate ROOBs for each thread. (in-order commit, precise exceptions)