Branches Flashcards by Robert Lindgren

What are the requirements for branch prediction?

Works using only PC of current instruction.

Must correctly guess whether a branch is taken.

If taken, what is the target PC?

How well did you know this?

Not at all

Perfectly

Equation for branch prediction accuracy

CPI = 1 (or ideal CPI) + (# mispreds/# total insts) * (penalty per mispred)

How well did you know this?

Not at all

Perfectly

What percent of instructions are branches in the average program?

20%

How well did you know this?

Not at all

Perfectly

Does it ever make sense to NOT use branch prediction?

No! If you don’t use branch prediction, you always have to wait until the instruction is decoded to move forward, so in the canonical 5 stage pipeline you lose at least two cycles. If it’s a branch, you lose 3 cycles waiting to find out which way the branch goes.

With the simple not-taken predictor, you only waste cycles when there’s a mispredicted branch.

How well did you know this?

Not at all

Perfectly

Why do we need predictors that are better than predict-not-taken?

Because the difference because large for deep pipelines or when we can do multiple instructions per cycle. i.e., it lets us maximize performance on more sophisticated hardware.

How well did you know this?

Not at all

Perfectly

Branch Target Buffer (BTB)

The simplest history predictor. Indexes each instruction PC into a table where we stored the PC for the actual next instruction the last time this instruction was run. Keep the table small so it works fast, we only need the instructions that are coming soon anyway.

How well did you know this?

Not at all

Perfectly

How do you map instructions PCs to the their entry in the BTB?

Using the least significant bits!

How well did you know this?

Not at all

Perfectly

Direction predictor

We index into a Branch History Table using the least significant digits, just like the BTB. The BHT has a 0 if the branch was not taken last time and 1 if it was. If it’s a branch, we go to the BTB to get the target. Once the branch resolves, we update the table entry. If taken, we also update the BTB.

The BHT can be large because each entry is 1 bit!

How well did you know this?

Not at all

Perfectly

When do 1-bit predictors work/not work?

They work when the number of not taken branches vastly outweighs the number taken or vice versa.

They don’t work when the number taken/not taken are closer to each other, or in short loops.

Note that each anomaly causes 2 mispredictions, not one.

How well did you know this?

Not at all

Perfectly

Explain the 2-bit predictor

It’s a counter that counts up when a branch is taken and counts down when it’s not.The advantage is that single anomalies only cost one misprediction.

It has 2 bits, a predictor bit and conviction bit. From Strong Not Taken to Strong Taken it looks like
00
01
10
11

How well did you know this?

Not at all

Perfectly

Why not use more bits for a predictor/counter (3-bit, 4-bit, etc)?

First, cost. Second, these predictors are only really helpful when anomalies come in streaks, but empirically, they don’t!

How well did you know this?

Not at all

Perfectly

1-bit history with two 2-bit counters

This allows us to learn a simple pattern like (T, N, T, N). The 1-bit history just tells us whether the last branch was taken or not, and it indexes into the two counters (one for each of two possible states of the 1-bit history).

It looks like this: (0, 01, 11). That would mean the last branch was not taken and we predict the current branch is not taken because of the 01 (weak not taken) value.

If the 1-bit history had a value of one, we would index into the 11 (strong taken).

How well did you know this?

Not at all

Perfectly

2-bit history predictor

It looks like the 1-bit history with two 2-bit counters, just bigger! We have something like (2-bit history, counter for history of 00, counter for 01, 10, and 11). Five total 2-bit entries, so 10 bits per branch.

This allows us to learn a pattern of three like (N, N, T).

How well did you know this?

Not at all

Perfectly

What patterns can an n-bit history predict correctly?

All patterns of length <= n+1

How well did you know this?

Not at all

Perfectly

What is the cost/size of an n-bit history predictor?

n + 2*2^n bits

How well did you know this?

Not at all

Perfectly

What is the waste of an n-bit history predictor?

Most 2-bit counters are wasted! To learn a pattern of length 11, you need a 10-bit history counter.

That has 2^10 2-bit counters, or 1024. But there are only 11 histories that need to be tracked to predict the pattern (one pattern starts on the first instruction, another on the second, etc, always looping back around). So we only ever use 11 of the 2-bit counters. The rest are wasted.

Put another way, the 1024 2-bit counters enable you to track every possible combination of 10 bits, but you only need to track 11 of them to detect a pattern of length 11.

History with shared counters

You store n-bit histories in a Pattern History Table (PHT) and a POOL of 2-bit counters (instead of having 2^n of them) in a Branch History Table (BHT). The pool is shared by different PCs with different histories, allowing you to have fewer counters.

It works by indexing into the PHT using least significant bits of the instruction PC, like with other methods. The PHT entry stores the pattern. We index into the BHT using a combination of PC and the pattern (maybe with XOR), where we increment/decrement our 2-bit counters.

With this method you can mostly avoid overlap (though it’s still possible) and require only kb instead of mb or worse.

How well does history with shared counters work for different patterns?

For simple patterns that only have one possible history, they will only require one entry in the PHT and one counter in the BHT. For more complex patterns, they will require one PHT entry and one BHT counter for every possible history. This allows history with shared counters to learn complex patterns without allocating unneeded space for simple patterns.

What is Pshare and what is it good for?

Private history with shared counters. Good for even-odd or short loops, any time the previous behavior of THAT BRANCH is predictive of its future behavior

What is Gshare and what is it good for?

Global history with shared counters. We have only one history that is combined with the instruction PC to index into BHT. Good for correlated branches, such as…

if (x == y) {do stuff}
if (x != y) {do different stuff}

Actually very common

What’s better, using Gshare or Pshare?

Using both! They work well in different cases, so combine them!

Tournament predictor

You have two predictors, one good for some type of branches, another good for others (like Gshare and Pshare).

You index into both Gshare and Pshare with the PC, but also into a metapredictor. This metapredictor has a counter that, instead of predicting branch direction, predicts which predictor works best. Increment when one predictor works and the other doesn’t, decrement vice versa, and keep it the same when both are right or both are wrong.

Update both predictors every time.

Hierarchical predictor

Like a tournament predictor, but instead of choosing between two good predictors, you choose between one good and one just OK predictor.

This allows you to use a just OK (but also very cheap!) predictor for the branches that don’t need anything more sophisticated.

You save even more by only updating the good predictor when the OK predictor fails.

When using a hierarchical predictor, how do you know which predictor to use?

The good predictor(s) will have tag array that indicate whether or not a particular instruction needs that one. If not, you move on to the less sophisticated predictors.

Return address stack predictor

RAS is for function returns! The "direction" of a function return is always taken, so that's trivial to predict, but the whole point of a function is that it can be called from many different places, so the target return address can change. Each time a function is called, the return address is pushed onto the return address stack. When the function returns, it gets the appropriate target by popping the return address.

Why use RAS instead of just the regular program stack?

We need to be able get the return address very quickly, so we want the RAS to be small and close to the rest of branch prediction on chip.

If the RAS is so small, what do we do when it's full?

Two options: - Don't push the new address to the stack - Wrap around Wrap around is better! If you don't push, you're preserving targets for earlier function calls, which are less likely to be repeated/looped. You lose the benefit of optimizing these looped, short functions just to avoid a single misprediction when you finally return the earlier function call (probably a big outer function that's only called once anyway).

How do we know an instruction is a return?

Two options: - Use a predictor - Or, better, predecode!

What do you use to index into the BHT (or probably other stuff too)?

The least significant digits that actually change from instruction to instruction. If you use fixed-size 4-byte instructions, word-aligned (so each instruction starts at 0, 4, 8, etc), then the last two bits of each will be 11 (because they're divisible by 4). This means we exclude those last two bits.

How do you convert decimal to binary?

Step 1: Divide the given decimal number by 2 and write the remainder in the least significant place. Step 2: Now, divide the resulting quotient by 2, and write the remainder in the NEXT most significant place. Step 3: Repeat the above steps until you get 0 as the quotient.