Branches Flashcards
What are the requirements for branch prediction?
Works using only PC of current instruction.
Must correctly guess whether a branch is taken.
If taken, what is the target PC?
Equation for branch prediction accuracy
CPI = 1 (or ideal CPI) + (# mispreds/# total insts) * (penalty per mispred)
What percent of instructions are branches in the average program?
20%
Does it ever make sense to NOT use branch prediction?
No! If you don’t use branch prediction, you always have to wait until the instruction is decoded to move forward, so in the canonical 5 stage pipeline you lose at least two cycles. If it’s a branch, you lose 3 cycles waiting to find out which way the branch goes.
With the simple not-taken predictor, you only waste cycles when there’s a mispredicted branch.
Why do we need predictors that are better than predict-not-taken?
Because the difference because large for deep pipelines or when we can do multiple instructions per cycle. i.e., it lets us maximize performance on more sophisticated hardware.
Branch Target Buffer (BTB)
The simplest history predictor. Indexes each instruction PC into a table where we stored the PC for the actual next instruction the last time this instruction was run. Keep the table small so it works fast, we only need the instructions that are coming soon anyway.
How do you map instructions PCs to the their entry in the BTB?
Using the least significant bits!
Direction predictor
We index into a Branch History Table using the least significant digits, just like the BTB. The BHT has a 0 if the branch was not taken last time and 1 if it was. If it’s a branch, we go to the BTB to get the target. Once the branch resolves, we update the table entry. If taken, we also update the BTB.
The BHT can be large because each entry is 1 bit!
When do 1-bit predictors work/not work?
They work when the number of not taken branches vastly outweighs the number taken or vice versa.
They don’t work when the number taken/not taken are closer to each other, or in short loops.
Note that each anomaly causes 2 mispredictions, not one.
Explain the 2-bit predictor
It’s a counter that counts up when a branch is taken and counts down when it’s not.The advantage is that single anomalies only cost one misprediction.
It has 2 bits, a predictor bit and conviction bit. From Strong Not Taken to Strong Taken it looks like
00
01
10
11
Why not use more bits for a predictor/counter (3-bit, 4-bit, etc)?
First, cost. Second, these predictors are only really helpful when anomalies come in streaks, but empirically, they don’t!
1-bit history with two 2-bit counters
This allows us to learn a simple pattern like (T, N, T, N). The 1-bit history just tells us whether the last branch was taken or not, and it indexes into the two counters (one for each of two possible states of the 1-bit history).
It looks like this: (0, 01, 11). That would mean the last branch was not taken and we predict the current branch is not taken because of the 01 (weak not taken) value.
If the 1-bit history had a value of one, we would index into the 11 (strong taken).
2-bit history predictor
It looks like the 1-bit history with two 2-bit counters, just bigger! We have something like (2-bit history, counter for history of 00, counter for 01, 10, and 11). Five total 2-bit entries, so 10 bits per branch.
This allows us to learn a pattern of three like (N, N, T).
What patterns can an n-bit history predict correctly?
All patterns of length <= n+1
What is the cost/size of an n-bit history predictor?
n + 2*2^n bits
What is the waste of an n-bit history predictor?
Most 2-bit counters are wasted! To learn a pattern of length 11, you need a 10-bit history counter.
That has 2^10 2-bit counters, or 1024. But there are only 11 histories that need to be tracked to predict the pattern (one pattern starts on the first instruction, another on the second, etc, always looping back around). So we only ever use 11 of the 2-bit counters. The rest are wasted.
Put another way, the 1024 2-bit counters enable you to track every possible combination of 10 bits, but you only need to track 11 of them to detect a pattern of length 11.
History with shared counters
You store n-bit histories in a Pattern History Table (PHT) and a POOL of 2-bit counters (instead of having 2^n of them) in a Branch History Table (BHT). The pool is shared by different PCs with different histories, allowing you to have fewer counters.
It works by indexing into the PHT using least significant bits of the instruction PC, like with other methods. The PHT entry stores the pattern. We index into the BHT using a combination of PC and the pattern (maybe with XOR), where we increment/decrement our 2-bit counters.
With this method you can mostly avoid overlap (though it’s still possible) and require only kb instead of mb or worse.
How well does history with shared counters work for different patterns?
For simple patterns that only have one possible history, they will only require one entry in the PHT and one counter in the BHT. For more complex patterns, they will require one PHT entry and one BHT counter for every possible history. This allows history with shared counters to learn complex patterns without allocating unneeded space for simple patterns.
What is Pshare and what is it good for?
Private history with shared counters. Good for even-odd or short loops, any time the previous behavior of THAT BRANCH is predictive of its future behavior
What is Gshare and what is it good for?
Global history with shared counters. We have only one history that is combined with the instruction PC to index into BHT. Good for correlated branches, such as…
if (x == y) {do stuff}
if (x != y) {do different stuff}
Actually very common
What’s better, using Gshare or Pshare?
Using both! They work well in different cases, so combine them!
Tournament predictor
You have two predictors, one good for some type of branches, another good for others (like Gshare and Pshare).
You index into both Gshare and Pshare with the PC, but also into a metapredictor. This metapredictor has a counter that, instead of predicting branch direction, predicts which predictor works best. Increment when one predictor works and the other doesn’t, decrement vice versa, and keep it the same when both are right or both are wrong.
Update both predictors every time.
Hierarchical predictor
Like a tournament predictor, but instead of choosing between two good predictors, you choose between one good and one just OK predictor.
This allows you to use a just OK (but also very cheap!) predictor for the branches that don’t need anything more sophisticated.
You save even more by only updating the good predictor when the OK predictor fails.
When using a hierarchical predictor, how do you know which predictor to use?
The good predictor(s) will have tag array that indicate whether or not a particular instruction needs that one. If not, you move on to the less sophisticated predictors.