Appendix C Flashcards
Pipelining
- process of overlapping multiple instructions
- takes advantage of parallelism
- most processors use this to implement parallelism
- breaks instruction execution into multiple stages
- usually Instruction Fetch, Instruction decode, Instruction execute, Memory write, Write back
- also called pipe stages or segments
- usually Instruction Fetch, Instruction decode, Instruction execute, Memory write, Write back
- goal is to increase throuput - number of instructions per time unit
- one step inside of a pipeline is called processor cycle
Speed up on ideal machine:
time per instruction on unpipelined machine / number of pipe stages
Instruction cycle in Risc-V
5 phases:
IF - instruction fetch:
- send PC to memory and load instruction
- increment PC by 4 (instructions are 4 bytes)
ID - instruction decode:
- decode instruction
- read registers
- resolve branches and possible branch targets
- sign extension (register and immidiate values)
EX - execution/effective address:
- ALU operates on three functions:
- memory reference - ALU adds base register and offset to form effective address
- Register-register ALU instruction - ALU performs operation specified by the ALU opcode on the values in registers
- Register-immidiate ALU instruction - ALU performs operation specified by ALU opcode on value read from register and the sign-extended immidiate value
- Conditional branch - determine whether branch condition is true
- load-store architectures may perform effective address and execution in one cycle, because instructions don`t need to perform operation and calculating data address at the same time
MEM - memory access
- LOAD instruction reads using effective address
- STORE instruction performs store of the element from one register based on the address specified in second register
WB - write-back cycle:
- Register-register instruction
- load instruction
- writes result of the instruction into register
- With this model branch instructions require 3 cycles, store instructions 4 cycles, all other instructions require 5 cycles
- each stage contains pipeline registers
Fixed-field decoding
- all registers are in fixed locations
- speed-ups decoding of the instructions
- used in RISC
Hazards
- structural hazards - hardware cannot support overlapping multiple instructions
- in modern systems occurs only in slow special purpose units
- can easily be solved by compiler constructors
- data hazards - result of one instruction depends on the result of another
- control hazards - results from branches and other instructions that change the contents of PC
- solving these hazards causes stalls
Pipeline performance measurement
CPI pipelined = Ideal CPI + pipeline stalled cycles = 1 + pipelined stalled cycles
Speedup = (CPI unpipelined x clock cycle unpipelined)/(CPI pipelined x Clock cycle pipelined)
Data hazards - types
- occurs when pipeline changes order of the sequential instructions in unpiplined design
Types:
1. Read after write hazard - instruction j reads x as result from instruction i, which was not written yet. j thus has old value
2. Write after read hazard - instruction i reads value x after instruction j written its result. i thus has wrong value of x
Cannot occur in simple five stage pipeline, but can be produced if instructions are out-of-order
3. Write after write hazard - write of value x by instruction i occurs after write of x by instruction j. x will thus have a wrong value
Cannot occur in simple five stage pipeline, but can be produced if instructions are out-of-order
Forwarding(bypassing)
- technique used to reduce hazards
- result from ALU both from EX/MEM and MEM/WB is fed back to the ALU
- if forwarding hardware detects that result from the ALU corresponds to the register as source to the ALU it uses this value instead of value from the register
- doesn`t solve all hazards
Data hazards with stalls
- when data dependency cannot be forwarded (e.g. instruction needs data earlier than it`s actually produced, such as ld instruction dependency)
- pipeline interlock - detects hazards and stalls the pipeline
- pipeline is stalled until the hazard is solved
Branch hazards
- control hazard
- cause greater performance loss than data hazards
- if branch is taken, PC register will change
- if branch is untaken, value of PC is not taken
- this can cause repetition of IF stage or other different problems (e.g. incorrect instruction is taken)
- branch stalls can reduce the performance by 10% to 30%!
Reducing branch penalties
- simplest solution - freeze or flush instructions after the branch until branch is resolved - hardware simplicity
- causes fixed branch penalty
- taking every branch as not taken and performing instructions after the branch and not changing the state of the processor until the branch is solved
- determining when the state might have changed and “back out” from this state
- alternative is to take every branch as taken
- delayed branch - fetching next instruction after branch and then fetching instruction as if branch was taken
- effective without branch prediction
- problematic if the branch delay instruction is another branch - many times prohibited!
Performance of branch schemes
Pipeline speedup = Pipeline depth / (1 + Branch frequency x Branch penalty)
Branch prediction
- static - branch predictions are known in compile time
- dynamic - based on program behaviour
Static branch prediction
- based on the fact that branch prediction is bimodally distributed either in favour of branch taken or branch untaken
Dynamic branch predictions
simplest version - branch-prediction buffer, or branch history table
- simple buffer of lower bits of addresses with bit that indicates if the branch there was taken - could be modified also by different branch with the same lower bits of address - if branch was not taken, then bit is reversed to 0 - improvement is using 2 bit prediction schemes - prediction must miss twice before it changes - 11 taken, 10 taken, 00 untaken, 01 untaken - also more bits can be used, but practice shows that 2 bits are enough