Superscalar, VLIW, OoO Flashcards
What are the 3 types of multiple-issue processors?
Statically-scheduled
dynamically-scheduled
VLIW
What is the goal of Multipe-issue processors?
Achieve a CPI < 1
What is a statically-scheduled superscalar processor?
In-order execution
Multiple issues of instructions
What is a dynamically-scheduled superscalar processor?
OoO-execution
Multiple number of instructions scheduled
What is a Very long instruction word processor?
In-order execution
Fixed number of instructions issued.
What is a superscalar architecture
Several instructions issued simultaneously and executed independently.
Pipelining enables parallel execution across different stages.
Superscalar enables parallel execution within the same stage.
What components do we need to add to enable multiple issuing of instructions?
One way is to duplicate the pipeline, in essens this gives a multi-core system.
Both pipelines must fetch from the same instruction cache, and write to the same data-caches.
PC must be synchronised across cores (PC, PC + 1).
Share register file.
Implement forwarding across cores.
More complex hazard detection across pipelines.
What is the downside of duplicating the pipeline to achieve multi-issuing of instructions?
Much added complexity in the implementation
What does a 3-issue superscalar processor consist of?
Multiple fetch-, and decode units.
Instructions share the functional units (EX stage). There are multiple FUs with different purposes (INT, FP, BRANCH, etc.)
Multiple MEM- and WB-stage. Might not want to duplicate MEM stage to all instructions, to avoid many ports to the caches.
CPU decides what instructions can be executed in parallel. Because of this, needs extra logic to see if an instruction actually can be executed in parallel.
What does a VLIW consist of?
Only one instruction is fetched at a time, however this instruction has an an explicit encoding for allowing multiple instructions.
A VLIW instruction format defines the different types of instructions that can be executed in parallel.
The instruction format consist of a field for:
FP ADD instruction, FP MUL instruction, INT ALU, Branch and ld/sw.
Compiler takes care of detecting independent instructions.
The goal is to have as many instructions within the format of a VLIW. However, the trade off here is that the compiler must be able to identify the independent instructions that are able to fill it.
If no instructions are found for e.g. the FP MUL and Branch field, these must be inserted with NOP instructions. This results in space being taken up by instructions that are not being executed.
Trade offs: Instruction format width
- Wider allows for more instructions being executed in parallel
- However it is harder for the compiler to find instructions to fill all the fields
What is an advantage of VLIWs?
Avoids needing to add additional logic in hardware to detect if instructions are independent or not.
This is done by the compiler.
How can we avoid empty instruction slots in a VLIW?
Loop unrolling: Gives more loop bodies, more likely to find independent instructions
Local scheduling: Re-organise instructions within one body of execution
Global scheduling: Scheduling across branches
What are some issues with VLIW? (3)
Increased code size: loop enrolling, potential empty slots
VLIWs operate in lock-step: There are no hazards between instruction - this is taken care of by the compiler. However, if one instruction has a delay, the whole pipeline must stall.
Reduced binary code compatibility: The VLIWs instruction format varies across architectures. More performant CPUs might want more instructions in parallel. Smaller CPUs might only need a few instructions in parallel. These would require different formats of instruction-encoding.
What is static scheduling?
Processor executes instructions in program order.
Any prior reordering is done by the compiler. No re-ordering in hardware.
What is dynamic scheduling?
OoO-execution
CPU itself reorders instructions by looking for instructions that can execute in parallel, and instructions that are dependent of each other.
These CPUs must preserve the original order of operations, as defined by the program to achieve correct execution.
Reordering and renaming is defined in tomasulo’s algorithm.
How are anti-dependences taken care of in OoO-processors?
Register renaming