Lectures 3 and 4 - Pipelining Flashcards
Hazard Classes
Structural Hazards - arise from resource conflicts
Data Hazards - Arise from inter-instruction dependencies
Control Hazards - Cause by instructions that change the program counter
Pipeline hazards
Situations that arise that force the pipeline to stall.
Data dependencies
Real - If j uses the result of i then it is data dependent on i. They are also dependent if there is a chain of dependencies.
Name - simply a result of register reuse with no data actually communicated.
Data Hazard
Created whenever the parallel execution of instructions makes it possible for a dependency to be violated.
Methods for Avoiding Data Hazards
Add hardware to avoid hazard such as data forwarding or bypass paths
Transform or schedule the code in order to prevent a data dependence becoming a data hazard, this could be done by inserting independent instructions to separate dependent instructions that would cause a stall. It can be performed by the compiler or by hardware.
Hazards arising from name dependencies can be avoided by renaming either in compiler or hardware.
Load-use Data Hazard
ALU operation must wait for result from load, it cannot execute in same lock cycle even with forwarding.
Cukd insert independent instructions or do nothing instructions in load-delay slot.
Alternatively hardware could detect hazard and stall pipeline
Control Hazards
Branch instructions may or may not change the value of the PC so how do we know what to do after we fetch a branch. Simplest approach is to just stall until outcome of branch is known.
Precise Exceptions
A processor is said to support precise exceptions if it can guarantee all instructions prior to the faulting instruction have executed and all those prior to it have not begun to execute.
Method of handling exceptions
Tag instruction if it causes an exception.
Deal with exception when instruction reaches the write back stage.
Faulting instructions are prevented from updating state. They’re handled at WB stage to ensure they’re handle in order. We save the program counter of the faulting instruction and use it to restart the program after the exception handler had completed.
Flush remaining instructions from pipeline.
Alternative approaches to exception handling
Settle for imprecise exceptions.
Use special hardware for handling exceptions which doesn’t require flushing but just stalling while dedicated hardware deals with exception.
Limits of pipelining
Difficult to achieve linear performance gains with deeper pipelines because:
- pipelining registers introduce overheads and it is difficult to provide low-skew global clock within a reasonable power budget
- adding stages may increase complexity
- must balance logic in each stage
- number of stalls due to pipeline hazards grow
Optimal Pipeline Depth
Throughout = 1/(1+(s-1)b) x 1/(t/s+c)
Then can differentiate to get ideal pipeline length ends up being approximately 23 stages.
Condition codes or flags
Branches test special flags set by ALU operations. These usually include zero, carry, negative and overflow.
CCs may constrain instruction ordering.
Condition Registers
Perform comparison and place condition codes in general purpose registers. This permits a simple implementation but costs the use of a register.
Compare and branch
Performs compare and branch in single instruction. This however potentially complicates pipelining.
Predict-not-taken scheme
Assume branch won’t be taken and continue fetching instructions. This means the penalty for intakes branches is now zero.
Evaluate branch earlier in pipeline.
Move branch test to decode stage.
Add an additional adder to calculate new pc in decode stage.
The branch penalty is now reduced to 1 cycle.
This requires a simple to evaluate branch condition. The tested condition must be ready during instruction decode. A data hazard is introduced if an instruction that sets the condition is followed by a branch.
Delayed Branch
Assume instruction after branch is always executed regardless of the branch condition. This slot is then filled with independent instruction by the compiler if one exists otherwise with no-op. The compiler can fill a single branch delay slot around 60-70% of the time.
Static branch prediction
Exploits observation that a given branch instruction is likely to be highly biased in one direction. Displacement-based prediction - predict branch taken if target address < program counter. Alternatively have compiler set prediction bit in instruction. If prediction incorrect then cancel branch. It can be improved using information collected during a profiling run.
Bi-modal prediction
1-bit:
Store branch outcome
Use state as prediction next time is taken.
2-bit:
Use 2-bit saturating counter which prevents prediction being flipped in case of single misprediction.
Local History Predictor
If branch behaviour has a repetitive pattern in its local history this can be used to improve prediction accuracy.
Global correlating predictor
Behaviour of branch is often correlated with behaviour of other recent branches so can access a different counter dispensing on behaviour of recent branches. It’s is because an individual branch can be reached along different paths so maintain unique prediction for particular path. History of recent branches is stored in k-bit register.
Tournament predictor
For particular branches global or local predictors will be superior so implement both and see which performs best on a per branch basis. This requires local predictor, global predictor and 2-bit saturating counter per branch to decide between them.
Limits to ranch prediction
Some branches are unpredictable and will appear random.
Time is needed to train predictor before correct predictions can be made
Limited Size - history and number of predictors
Aliasing/interference -
Negative aliasing - branches map to same entry in table and move counter in opposite directions. There’s also positive/neutral aliasing.
Determining branch target
To avoid stalling on branch we need to known the next address to be fetched by end of IF stage. We use a small cache indexed by the PC of recent branches and their most recent target addresses.
Return Address Predictors
Procedures and functions may be called from multiple places in th program. Accuracy of branch target stored in BTB may be very low. This can be solve by using a small hardware stack to store these addresses.
Avoiding branches
Loop count register - decrement and branch instruction
Predicated execution - execution of all instructions is conditional this transforms control dependency into data dependency but costs opcode bits and nullified instructions in pipeline.
Conditional moves - move if flag/nominated reg is set.