Lectures 5, 6 & 7 - Superscalar Techniques & VLIW Flashcards

1
Q

Superscalar Processor

A

Multiple instructions are processed in each pipeline stage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Superpipelined Processor

A

S stages of a pipelined processor are further divided into M sub-stages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How much Instruction Level Parallelism in each basic block?

A

1.72

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How much ILP is there if we can bypass conditional branches?

A

2.72 * sqrt(k) up until an ILP of ~50

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the data flow limit?

A

True data dependencies seem to enforce an upper limit on instruction level parallelism. We cannot execute two instructions in parallel where on relies on the result of the other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How can we exceed the data flow limit?

A

Data value prediction - Perfect data value prediction would provide infinite ILP but this is very difficult to implement at a reasonable cost in practice.

Cheat - execute dependent instructions within a single clock cycle using multi-input or fast functional units (the intel P4 has a high-speed ALU which is clocked at double rate of processor) and cache & recall common results (at instruction or function level)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The effect of alignment on instruction fetch rate

A

Cache conflicts when fetching unaligned instructions means some instructions in cache lines accessed are valid. This limits the effectiveness of the superscalar processor as valid instructions can’t be supplied fast enough.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is required to fetch multiple basic blocks per cycle?

A
  • A branch predictor that can predict the direction of multiple branches in one clock cycle.
  • A stored tree of possible branch tagets for each branch in a Branch Address Cache
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Trace Cache

A

Cache dynamic instruction sequences and contains both instructions and an indicator of whether each branch was taken or not taken. A trace hit occurs when the fetch address matches the tag and the branch predictions match the branch flags are stored with the trace. It’s very complex and requires lots of computation and caching to do matching so only used in Intel’s pentium 4.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Register Renaming

A

Name dependencies unnecessarily limit our ability to exploit ILP. Number of architectural registers (as defined by ISA) may be limited preventing this being solved at compile time. Register renaming map from architectural registers to physical registers at run time. This renaming removes false dependencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Dynamic Scheduling

A

Instructions from a large window spanning multiple basic blocks are executed out-of-order. The compiler can’t produce best schedule as limited by unpredictable conditional branches, limited number of registers, inability to disambiguate some memory addresses statically and unpredictable cache behaviour. Unpredictable can be tolerated with dynamic scheduling by meanwhile executing independent instructions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Memory Aliasing

A

Two memory references involving the same memory location

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Memory Disambiguation

A

The process of determining if two memory references will alias or not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Out-of-order loads

A

Loads often start chains of dependent instructions so want to execute ASAP:

  • Load-bypassing - execute LOAD before preceding STORES (if access different locations)
  • Store-to-load forwarding - if load is directly dependent on store, receive data directly from STORE without accessing memory
  • Load-to-Load forwarding - Forward data read by one Load to a later one
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Restrictions on executing memory instructions out-of-order

A
  1. Store operations cannot be undone so must ensure exception caused by earlier instructions handled before store changes memory.Must ensure not speculative when accesses memory.
  2. Must respect memory carried dependencies and must be careful not to read a memory location before it has been updated by an older store instruction.
  3. Shared memory consistency models - multiprocessor systems often require loads that access the same memory address execute in program order.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How are stores executed with dynamic scheduling?

A

In program order to prevent the possibility of an outstanding exception or mispredicted branch.

17
Q

How can we enforce memory carried dependencies with dynamic scheduling?

A

All loads search the store queue for older instrcutions accessing the same memory location if a match is found then the load receives data from youngest matching store in queue. Calculation of some addresses may be outstanding when load executes so result from cache should be treated speculatively. If found to be invalid on checking restart execution from problem load.

18
Q

Load Wait Table

A

Record of load instructions that access memory speculatively but then fail due to a dependency with an earlier store. If in this table then load is forced to wait for prior stored before it attempts to access the data cache

19
Q

Load-Load ordering

A

Search load queue to check for load-load ordering violations and ensure loads to same memory address are executed in program order.

20
Q

Eliminating load-store queue

A
  • disable store-to-load forwarding
  • flush pipeline when a LOAD and STORE< which access the same address have executed out of order
  • use simple has table indexed by memory address to detect problems

Each entry in hash table has amll counter. This is incremented by LOAD when it’s issued and decremented when it commits. If count is non-zero on store may have stale value so flush and restart pipeline.

21
Q

Precise Interrupts - Reorder buffer

A

Buffer results generated out-of-order and commit them to register file when all earlier instructions have completed. Commit process checks for mispredicted branches and exceptions. If problem detected clear reorder buffer and start fetching from new path. Reorder buffer provides form of register renaming as operands could be in either register file or buffer. We can either get both and use most recent or maintain mapping table.

22
Q

Precise Interrupts - Unified register file

A

Store all registers in single large register file and use two mapping tables:

  • one with current (future) mapping)
  • one which updates as instruction commits

On exception simply copy back committed table to current.

23
Q

Clustered superscalar processors

A

Partition the functional units into two groups allowing design to be decentralised. This, however, requires an extra clock cycle to communicate between clusters.

24
Q

Classic VLIW

A

Pack multiple independent operations into one very-long instruction. Each operation slot has a fixed function associated with it. Execution is statically schedule by compiler. This gives the potential for simpler hardware.

25
Q

Loop Unrolling

A

Replicate loop body n times and apply acyclic scheduler. Multiply loop variable increment/decrement by n and adjust offsets for memory accesses according to block to which they belong.

26
Q

Benefits of loop unrolling

A

Permits compiler to generate a better schedule by increasing loop basic block size.

Eliminates branches

Eliminates several loop count increment/decrement instructions

27
Q

Disadvantages of loop unrolling

A

Code size will increase

Additional registers may be required to schedule the larger loop body.

ILP is limited at start and end of loop which must be incurred on every iteration.