Lectures 5, 6 & 7 - Superscalar Techniques & VLIW Flashcards
Superscalar Processor
Multiple instructions are processed in each pipeline stage
Superpipelined Processor
S stages of a pipelined processor are further divided into M sub-stages.
How much Instruction Level Parallelism in each basic block?
1.72
How much ILP is there if we can bypass conditional branches?
2.72 * sqrt(k) up until an ILP of ~50
What is the data flow limit?
True data dependencies seem to enforce an upper limit on instruction level parallelism. We cannot execute two instructions in parallel where on relies on the result of the other.
How can we exceed the data flow limit?
Data value prediction - Perfect data value prediction would provide infinite ILP but this is very difficult to implement at a reasonable cost in practice.
Cheat - execute dependent instructions within a single clock cycle using multi-input or fast functional units (the intel P4 has a high-speed ALU which is clocked at double rate of processor) and cache & recall common results (at instruction or function level)
The effect of alignment on instruction fetch rate
Cache conflicts when fetching unaligned instructions means some instructions in cache lines accessed are valid. This limits the effectiveness of the superscalar processor as valid instructions can’t be supplied fast enough.
What is required to fetch multiple basic blocks per cycle?
- A branch predictor that can predict the direction of multiple branches in one clock cycle.
- A stored tree of possible branch tagets for each branch in a Branch Address Cache
Trace Cache
Cache dynamic instruction sequences and contains both instructions and an indicator of whether each branch was taken or not taken. A trace hit occurs when the fetch address matches the tag and the branch predictions match the branch flags are stored with the trace. It’s very complex and requires lots of computation and caching to do matching so only used in Intel’s pentium 4.
Register Renaming
Name dependencies unnecessarily limit our ability to exploit ILP. Number of architectural registers (as defined by ISA) may be limited preventing this being solved at compile time. Register renaming map from architectural registers to physical registers at run time. This renaming removes false dependencies.
Dynamic Scheduling
Instructions from a large window spanning multiple basic blocks are executed out-of-order. The compiler can’t produce best schedule as limited by unpredictable conditional branches, limited number of registers, inability to disambiguate some memory addresses statically and unpredictable cache behaviour. Unpredictable can be tolerated with dynamic scheduling by meanwhile executing independent instructions.
Memory Aliasing
Two memory references involving the same memory location
Memory Disambiguation
The process of determining if two memory references will alias or not.
Out-of-order loads
Loads often start chains of dependent instructions so want to execute ASAP:
- Load-bypassing - execute LOAD before preceding STORES (if access different locations)
- Store-to-load forwarding - if load is directly dependent on store, receive data directly from STORE without accessing memory
- Load-to-Load forwarding - Forward data read by one Load to a later one
Restrictions on executing memory instructions out-of-order
- Store operations cannot be undone so must ensure exception caused by earlier instructions handled before store changes memory.Must ensure not speculative when accesses memory.
- Must respect memory carried dependencies and must be careful not to read a memory location before it has been updated by an older store instruction.
- Shared memory consistency models - multiprocessor systems often require loads that access the same memory address execute in program order.