Lectures 5, 6 & 7 - Superscalar Techniques & VLIW Flashcards
Superscalar Processor
Multiple instructions are processed in each pipeline stage
Superpipelined Processor
S stages of a pipelined processor are further divided into M sub-stages.
How much Instruction Level Parallelism in each basic block?
1.72
How much ILP is there if we can bypass conditional branches?
2.72 * sqrt(k) up until an ILP of ~50
What is the data flow limit?
True data dependencies seem to enforce an upper limit on instruction level parallelism. We cannot execute two instructions in parallel where on relies on the result of the other.
How can we exceed the data flow limit?
Data value prediction - Perfect data value prediction would provide infinite ILP but this is very difficult to implement at a reasonable cost in practice.
Cheat - execute dependent instructions within a single clock cycle using multi-input or fast functional units (the intel P4 has a high-speed ALU which is clocked at double rate of processor) and cache & recall common results (at instruction or function level)
The effect of alignment on instruction fetch rate
Cache conflicts when fetching unaligned instructions means some instructions in cache lines accessed are valid. This limits the effectiveness of the superscalar processor as valid instructions can’t be supplied fast enough.
What is required to fetch multiple basic blocks per cycle?
- A branch predictor that can predict the direction of multiple branches in one clock cycle.
- A stored tree of possible branch tagets for each branch in a Branch Address Cache
Trace Cache
Cache dynamic instruction sequences and contains both instructions and an indicator of whether each branch was taken or not taken. A trace hit occurs when the fetch address matches the tag and the branch predictions match the branch flags are stored with the trace. It’s very complex and requires lots of computation and caching to do matching so only used in Intel’s pentium 4.
Register Renaming
Name dependencies unnecessarily limit our ability to exploit ILP. Number of architectural registers (as defined by ISA) may be limited preventing this being solved at compile time. Register renaming map from architectural registers to physical registers at run time. This renaming removes false dependencies.
Dynamic Scheduling
Instructions from a large window spanning multiple basic blocks are executed out-of-order. The compiler can’t produce best schedule as limited by unpredictable conditional branches, limited number of registers, inability to disambiguate some memory addresses statically and unpredictable cache behaviour. Unpredictable can be tolerated with dynamic scheduling by meanwhile executing independent instructions.
Memory Aliasing
Two memory references involving the same memory location
Memory Disambiguation
The process of determining if two memory references will alias or not.
Out-of-order loads
Loads often start chains of dependent instructions so want to execute ASAP:
- Load-bypassing - execute LOAD before preceding STORES (if access different locations)
- Store-to-load forwarding - if load is directly dependent on store, receive data directly from STORE without accessing memory
- Load-to-Load forwarding - Forward data read by one Load to a later one
Restrictions on executing memory instructions out-of-order
- Store operations cannot be undone so must ensure exception caused by earlier instructions handled before store changes memory.Must ensure not speculative when accesses memory.
- Must respect memory carried dependencies and must be careful not to read a memory location before it has been updated by an older store instruction.
- Shared memory consistency models - multiprocessor systems often require loads that access the same memory address execute in program order.
How are stores executed with dynamic scheduling?
In program order to prevent the possibility of an outstanding exception or mispredicted branch.
How can we enforce memory carried dependencies with dynamic scheduling?
All loads search the store queue for older instrcutions accessing the same memory location if a match is found then the load receives data from youngest matching store in queue. Calculation of some addresses may be outstanding when load executes so result from cache should be treated speculatively. If found to be invalid on checking restart execution from problem load.
Load Wait Table
Record of load instructions that access memory speculatively but then fail due to a dependency with an earlier store. If in this table then load is forced to wait for prior stored before it attempts to access the data cache
Load-Load ordering
Search load queue to check for load-load ordering violations and ensure loads to same memory address are executed in program order.
Eliminating load-store queue
- disable store-to-load forwarding
- flush pipeline when a LOAD and STORE< which access the same address have executed out of order
- use simple has table indexed by memory address to detect problems
Each entry in hash table has amll counter. This is incremented by LOAD when it’s issued and decremented when it commits. If count is non-zero on store may have stale value so flush and restart pipeline.
Precise Interrupts - Reorder buffer
Buffer results generated out-of-order and commit them to register file when all earlier instructions have completed. Commit process checks for mispredicted branches and exceptions. If problem detected clear reorder buffer and start fetching from new path. Reorder buffer provides form of register renaming as operands could be in either register file or buffer. We can either get both and use most recent or maintain mapping table.
Precise Interrupts - Unified register file
Store all registers in single large register file and use two mapping tables:
- one with current (future) mapping)
- one which updates as instruction commits
On exception simply copy back committed table to current.
Clustered superscalar processors
Partition the functional units into two groups allowing design to be decentralised. This, however, requires an extra clock cycle to communicate between clusters.
Classic VLIW
Pack multiple independent operations into one very-long instruction. Each operation slot has a fixed function associated with it. Execution is statically schedule by compiler. This gives the potential for simpler hardware.