Loads often start chains of dependent instructions so want to execute ASAP: Load-bypassing - execute LOAD before preceding STORES (if access different locations) Store-to-load forwarding - if load is directly dependent on store, receive data directly from STORE without accessing memory Load-to-Load forwarding - Forward data read by one Load to a later one

Lectures 5, 6 & 7 - Superscalar Techniques & VLIW Flashcards by Joel Moss

Superscalar Processor

Multiple instructions are processed in each pipeline stage

How well did you know this?

Not at all

Perfectly

Superpipelined Processor

S stages of a pipelined processor are further divided into M sub-stages.

How well did you know this?

Not at all

Perfectly

How much Instruction Level Parallelism in each basic block?

1.72

How well did you know this?

Not at all

Perfectly

How much ILP is there if we can bypass conditional branches?

2.72 * sqrt(k) up until an ILP of ~50

How well did you know this?

Not at all

Perfectly

What is the data flow limit?

True data dependencies seem to enforce an upper limit on instruction level parallelism. We cannot execute two instructions in parallel where on relies on the result of the other.

How well did you know this?

Not at all

Perfectly

How can we exceed the data flow limit?

Data value prediction - Perfect data value prediction would provide infinite ILP but this is very difficult to implement at a reasonable cost in practice.

Cheat - execute dependent instructions within a single clock cycle using multi-input or fast functional units (the intel P4 has a high-speed ALU which is clocked at double rate of processor) and cache & recall common results (at instruction or function level)

How well did you know this?

Not at all

Perfectly

The effect of alignment on instruction fetch rate

Cache conflicts when fetching unaligned instructions means some instructions in cache lines accessed are valid. This limits the effectiveness of the superscalar processor as valid instructions can’t be supplied fast enough.

How well did you know this?

Not at all

Perfectly

What is required to fetch multiple basic blocks per cycle?

A branch predictor that can predict the direction of multiple branches in one clock cycle.
A stored tree of possible branch tagets for each branch in a Branch Address Cache

How well did you know this?

Not at all

Perfectly

Trace Cache

Cache dynamic instruction sequences and contains both instructions and an indicator of whether each branch was taken or not taken. A trace hit occurs when the fetch address matches the tag and the branch predictions match the branch flags are stored with the trace. It’s very complex and requires lots of computation and caching to do matching so only used in Intel’s pentium 4.

How well did you know this?

Not at all

Perfectly

Name dependencies unnecessarily limit our ability to exploit ILP. Number of architectural registers (as defined by ISA) may be limited preventing this being solved at compile time. Register renaming map from architectural registers to physical registers at run time. This renaming removes false dependencies.

How well did you know this?

Not at all

Perfectly

Dynamic Scheduling

Instructions from a large window spanning multiple basic blocks are executed out-of-order. The compiler can’t produce best schedule as limited by unpredictable conditional branches, limited number of registers, inability to disambiguate some memory addresses statically and unpredictable cache behaviour. Unpredictable can be tolerated with dynamic scheduling by meanwhile executing independent instructions.

How well did you know this?

Not at all

Perfectly

Memory Aliasing

Two memory references involving the same memory location

How well did you know this?

Not at all

Perfectly

Memory Disambiguation

The process of determining if two memory references will alias or not.

How well did you know this?

Not at all

Perfectly

Out-of-order loads

Loads often start chains of dependent instructions so want to execute ASAP:

Load-bypassing - execute LOAD before preceding STORES (if access different locations)
Store-to-load forwarding - if load is directly dependent on store, receive data directly from STORE without accessing memory
Load-to-Load forwarding - Forward data read by one Load to a later one

How well did you know this?

Not at all

Perfectly

Restrictions on executing memory instructions out-of-order

Store operations cannot be undone so must ensure exception caused by earlier instructions handled before store changes memory.Must ensure not speculative when accesses memory.
Must respect memory carried dependencies and must be careful not to read a memory location before it has been updated by an older store instruction.
Shared memory consistency models - multiprocessor systems often require loads that access the same memory address execute in program order.

How well did you know this?

Not at all

Perfectly

How are stores executed with dynamic scheduling?

In program order to prevent the possibility of an outstanding exception or mispredicted branch.

How can we enforce memory carried dependencies with dynamic scheduling?

All loads search the store queue for older instrcutions accessing the same memory location if a match is found then the load receives data from youngest matching store in queue. Calculation of some addresses may be outstanding when load executes so result from cache should be treated speculatively. If found to be invalid on checking restart execution from problem load.

Load Wait Table

Record of load instructions that access memory speculatively but then fail due to a dependency with an earlier store. If in this table then load is forced to wait for prior stored before it attempts to access the data cache

Load-Load ordering

Search load queue to check for load-load ordering violations and ensure loads to same memory address are executed in program order.

Eliminating load-store queue

disable store-to-load forwarding
flush pipeline when a LOAD and STORE< which access the same address have executed out of order
use simple has table indexed by memory address to detect problems

Each entry in hash table has amll counter. This is incremented by LOAD when it’s issued and decremented when it commits. If count is non-zero on store may have stale value so flush and restart pipeline.

Precise Interrupts - Reorder buffer

Buffer results generated out-of-order and commit them to register file when all earlier instructions have completed. Commit process checks for mispredicted branches and exceptions. If problem detected clear reorder buffer and start fetching from new path. Reorder buffer provides form of register renaming as operands could be in either register file or buffer. We can either get both and use most recent or maintain mapping table.

Precise Interrupts - Unified register file

Store all registers in single large register file and use two mapping tables:

one with current (future) mapping)
one which updates as instruction commits

On exception simply copy back committed table to current.

Clustered superscalar processors

Partition the functional units into two groups allowing design to be decentralised. This, however, requires an extra clock cycle to communicate between clusters.

Classic VLIW

Pack multiple independent operations into one very-long instruction. Each operation slot has a fixed function associated with it. Execution is statically schedule by compiler. This gives the potential for simpler hardware.

Loop Unrolling

Replicate loop body n times and apply acyclic scheduler. Multiply loop variable increment/decrement by n and adjust offsets for memory accesses according to block to which they belong.

Benefits of loop unrolling

Permits compiler to generate a better schedule by increasing loop basic block size. Eliminates branches Eliminates several loop count increment/decrement instructions

Disadvantages of loop unrolling

Code size will increase Additional registers may be required to schedule the larger loop body. ILP is limited at start and end of loop which must be incurred on every iteration.