Superscalar, VLIW, OoO Flashcards

1
Q

What are the 3 types of multiple-issue processors?

A

Statically-scheduled
dynamically-scheduled
VLIW

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the goal of Multipe-issue processors?

A

Achieve a CPI < 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a statically-scheduled superscalar processor?

A

In-order execution
Multiple issues of instructions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a dynamically-scheduled superscalar processor?

A

OoO-execution
Multiple number of instructions scheduled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a Very long instruction word processor?

A

In-order execution
Fixed number of instructions issued.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a superscalar architecture

A

Several instructions issued simultaneously and executed independently.

Pipelining enables parallel execution across different stages.
Superscalar enables parallel execution within the same stage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What components do we need to add to enable multiple issuing of instructions?

A

One way is to duplicate the pipeline, in essens this gives a multi-core system.

Both pipelines must fetch from the same instruction cache, and write to the same data-caches.

PC must be synchronised across cores (PC, PC + 1).

Share register file.

Implement forwarding across cores.

More complex hazard detection across pipelines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the downside of duplicating the pipeline to achieve multi-issuing of instructions?

A

Much added complexity in the implementation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does a 3-issue superscalar processor consist of?

A

Multiple fetch-, and decode units.

Instructions share the functional units (EX stage). There are multiple FUs with different purposes (INT, FP, BRANCH, etc.)

Multiple MEM- and WB-stage. Might not want to duplicate MEM stage to all instructions, to avoid many ports to the caches.

CPU decides what instructions can be executed in parallel. Because of this, needs extra logic to see if an instruction actually can be executed in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does a VLIW consist of?

A

Only one instruction is fetched at a time, however this instruction has an an explicit encoding for allowing multiple instructions.

A VLIW instruction format defines the different types of instructions that can be executed in parallel.
The instruction format consist of a field for:
FP ADD instruction, FP MUL instruction, INT ALU, Branch and ld/sw.

Compiler takes care of detecting independent instructions.

The goal is to have as many instructions within the format of a VLIW. However, the trade off here is that the compiler must be able to identify the independent instructions that are able to fill it.

If no instructions are found for e.g. the FP MUL and Branch field, these must be inserted with NOP instructions. This results in space being taken up by instructions that are not being executed.

Trade offs: Instruction format width
- Wider allows for more instructions being executed in parallel
- However it is harder for the compiler to find instructions to fill all the fields

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is an advantage of VLIWs?

A

Avoids needing to add additional logic in hardware to detect if instructions are independent or not.
This is done by the compiler.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can we avoid empty instruction slots in a VLIW?

A

Loop unrolling: Gives more loop bodies, more likely to find independent instructions

Local scheduling: Re-organise instructions within one body of execution

Global scheduling: Scheduling across branches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are some issues with VLIW? (3)

A

Increased code size: loop enrolling, potential empty slots

VLIWs operate in lock-step: There are no hazards between instruction - this is taken care of by the compiler. However, if one instruction has a delay, the whole pipeline must stall.

Reduced binary code compatibility: The VLIWs instruction format varies across architectures. More performant CPUs might want more instructions in parallel. Smaller CPUs might only need a few instructions in parallel. These would require different formats of instruction-encoding.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is static scheduling?

A

Processor executes instructions in program order.

Any prior reordering is done by the compiler. No re-ordering in hardware.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is dynamic scheduling?

A

OoO-execution

CPU itself reorders instructions by looking for instructions that can execute in parallel, and instructions that are dependent of each other.

These CPUs must preserve the original order of operations, as defined by the program to achieve correct execution.
Reordering and renaming is defined in tomasulo’s algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How are anti-dependences taken care of in OoO-processors?

A

Register renaming

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the stages of an OoO pipeline?

A

IF, ID, renaming and dispatch is done in order

On dispatch, the instructions are inserted in the ROB, and into the issue buffers of the relevant execution unit.

In the issue buffers, the instructions access the register files to get their operands.

When all operands are ready, the instructions are issued on the different FUs.

18
Q

What are 3 components necessary for OoO execution?

A

Tomasulo’s algorithm

Register renaming

Reorder buffer

19
Q

How do we decide what reservation station to place an instruction?

A

Look at what type of instruction it is.

Load instruction -> load unit

20
Q

How is the ROB structured?

A

Newest on top of buffer, oldest on bottom

21
Q

What does ROB entries consist of? (4)

A

Instruction itself
Flag saying if instruction has been computed or not
Destination register
Value - when execution has completed.

22
Q

What does the reservation station entries consist of?

A

Operation to execute

Operands: the operands can be identified using the ROB-entry identifier. If this ROB entry has not finished execution, the current operand is not yet available.
This is a way of register renaming. Instead of writing the source registers, write the ROB-entry which writes to this source register.

What ROB-entry is this instruction -> what ROB entry must be updated when computation finishes.

23
Q

How is true dependencies detected in Tomasulo algorithm (for renaming)?

A

Looks at all the instructions in the ROB, and if these have written to any of the registers that a current instruction is using. This way, true dependencies are identified.

24
Q

What happens when an instruction finishes execution?

A

ROB entry updates its done-flag and insert the value computed.
Instructions dependent on this ROB-entry will then be able to use this value.

25
Q

What is branch prediction (BTP - branch target prediction?)

A

Predict if a branch is taken.
Hardware structure
Uses previous history

26
Q

What are some types of branch predictors?

A

2-bit counter
2-level adaptive predictor

27
Q

What is a 2-bit saturating counter?

A

Local branch predictor

Only considers current branch, for each branch have a 2-bit saturated counter.

2 bits - 4 states (strongly not taken - strongly taken)

28
Q

What are a two-level adaptive predictor

A

Looks at global history to find correlation between branches.

Keeps a n-bit branch history. For each branch that passes, register as either a 1 or a 0.
The length of the history is dependent on how many bits are used in the branch history register.

The value in the branch history register is used to index into a pattern history table.

Each entry in the pattern history table keeps a 2-bit saturated counter

29
Q

What is a Gshare predictor?

A

Combines local and two-level predictors.

Have an n-bit branch history.

Take XOR of branch history and n-bit hash of current branch address

The result is used to index into a table keeping the saturated counter.

Local predictor: Used by using the branch address
Global predictor: Using the branch history

30
Q

What is a disadvantage of the Gshare predictor?

A

If we want to increase the number of bits in the branch history register, we need to increase the number of bits in the hash and therefor increasing the size of the pattern history table.

It also takes time to fill the bits of the branch history register.

31
Q

What is a tournament predictor?

A

Use more than one type of predictor, e.g. have one global- and one local-predictor.

Have a selector that over time learns if the local- or the global-predictor gives best predictions.

Index into the selector using the branch address. The selector provides a control signal to a mux that decides if the local- or the global prediction should be used.

32
Q

What is a problem with needing a predictor for each branch and history?

A

Implies huge tables

History takes a long time to learn

33
Q

What is the Tagged Hybrid Predictor?

A

Tries to solve the problem of having large tables.
Instead, uses a hash the tables, and use the history that is longest.

Have a local 3-bit saturated predictor.
Then we have additional history based predictors. For each of the history based tables, the history size is increasing.
Does not store all of the bits in these tables, or else these would be very large.

First, we take the prediction from the local predictor. Then we see if we have a hit in the smallest table. If so, we choose this prediction. Then we check in all the other tables, and chooses the prediction where we get a hit in the largest history table.

34
Q

What is a BTB?

A

In parallell to branch prediction, we need to predict the target of the branch.

The BTB contains addresses of predicted targets. The PC of the current branch is used to index into the BTB. Meaning, we check all the entries in the BTB to see if the first field in the BTB entry is the current target. The second entry will in case of a hit contain the predicted target address.

Functions as a cache for target addresses. The first time a branch address index into the table, there will be a miss. The target address will in this case be calculated in the pipeline, and then stored in the BTB. This way, next time the branch address is used as index, we can retrieve the target address without needing to do a calculation.

35
Q

What is a return address predictor, and why is it used (RAS)?

A

A function(branch target) can be called from multiple addresses in a program. This makes it difficult to predict where the program should return.

The RAS is a small cache that pushes the next instruction on the stack (PC + 4), and pop from the stack on return.

36
Q

What happens if we mispredict a branch, in the tomasulo pipeline?

A

Need to flush instructions on the wrong execution path - flush from ROB.

37
Q

What is a memory hazard?

A

If the same memory location is read/written to by different instructions.

This instructions can use different registers to calculate the target memory address, and still end up with the same memory address.

38
Q

How can we avoid memory hazards?

A

WAW and WAR are eliminated because tomasulo ensure memory update in order.

For loads, not allow load to happen, if any older ROB entry occupied by a store, has a destination field that matches the value of the address field in the load.

Maintain the program order for the computation of the effective address of a load, with respect to all earlier stores.

This ensures that loads cannot access a memory location written to by an earlier store, unless this writing has finished.

39
Q

In the tomasulo algorithm, what happens during the write result stage?

A

Execution has finished.
Data is written on the Common Data Bus (watched by instructions in the issue buffers to see if their operands are ready).
Mark reservation station as available

40
Q

In the tomasulo algorithm, what happens during the commit stage?

A

Update registers with reorder results.
This happenbs when the instruction is at the head (last) in the ROB.
Result can also be written back to memory.
Instruction is removed from the ROB.