Compiler techniques and prefetching Flashcards by E Mi

What are some good and bad features of compilers?

Good: Has good amount of time to decide what to do

Bad: Has less knowledge about a program

How well did you know this?

Not at all

Perfectly

When illustrating a pipeline. How can you visualize that some functional units take more than one cycle (e.g. FP)

Draw a loop from the end of the FU to the beginning of it

How well did you know this?

Not at all

Perfectly

What does compiler optimization aim to fix?

Reduce effect of stalling (hazard, FP operations, etc.)

How well did you know this?

Not at all

Perfectly

What can the compiler do to reduce the amount of stalls?

Static scheduling
Loop unrolling

How well did you know this?

Not at all

Perfectly

What is static scheduling?

Reorder ‘independent’ instructions to places where the program might otherwise have stalls
If an instruction, that is otherwise independent of other instructions, update the value of a register. If this register is in use by e.g. a load, because this load adds an offset to the register value, we can update this offset to fit the new register value. Meaning, the load address is still preserved.

How well did you know this?

Not at all

Perfectly

What is loop unrolling?

Place a number of iterations after each other.

If an operation is common for all iterations, e.g. ADD R1, R1, #-8.
If 4 iterations are placed after each other, this ADD can be removed from the iterations and placed at the bottom, but with an updated immediate value:
ADD, R1, R1, #-32

Instead of branching back to the loop label after every iteration, place the branch instruction after the iterations that has been placed after each other.
If 4 iterations are placed after each other, we only need to branch every 4 iterations, instead of the previous 1.

How well did you know this?

Not at all

Perfectly

How can static scheduling be used with loop unrolling?

When iterations has been layed out, we now have more iterations to possibly be able to reorganize.

An example is if every iteration contains independent ADD-instructions and every iteration contains a LOAD, the loads can first be placed directly after each other, and then the ADDs that are dependent on the loads. As the loads and ADDs now are far enough apart, we won’t need to stall the pipeline.

(Might need to use register renaming to avoid hazards)

How well did you know this?

Not at all

Perfectly

How can loop unrolling be done for dynamic loops?

Dynamic loops are loops that we do not know how many iterations they have. The number of iterations can for example depend on user input.

Do the unrolling in 2 stages:
- (n mod k), n: iterations, k: number of copies of body
- Execute (n mod k) times the body of the original loop (all the cases not dividable by k, meaning they do not fit the unrolling)
- Unroll the body and execute this (n/k) times

Most of the cases, n is large, meaning most of the time is spent on the unrolled loop.

How well did you know this?

Not at all

Perfectly

What are some cons of loop unrolling? (3)

Growth in code size: need more flash memory

For large loops: Can get multiple copies of instructions that takes up space in the L1 cache. Can increase instruction cache miss rate.

Register pressure: Requires a lot more registers. If we don’t have enough registers to allocate all live values, might lose all advantage of loop unrolling.

How well did you know this?

Not at all

Perfectly

What are the 3 types of cache misses?

Compulsary: First-reference miss, miss on infinite cache

Capacity: Would not happen in an infinite

Conflict: Because of restrictions on where a block can be placed. Does not occur in fully associative.

How well did you know this?

Not at all

Perfectly

What is prefetching?

Speculating on what accesses can happen in the future.

Bring the speculated data from lower parts of the memory hierarchy, and into the caches.

How well did you know this?

Not at all

Perfectly

What types of cache misses does prefetching aim to fix?

Compulsary misses

How well did you know this?

Not at all

Perfectly

What are some issues that prefetching needs to deal with?

Prediction: Need good predictions

Timeliness: Data need to be available at the right time.

Resources: Avoid cache and bandwidth pollution. If we use too much prefetching, we are taking up resources.

How well did you know this?

Not at all

Perfectly

What are some questions we need to consider when doing prefetching?

What addresses to fetch?

When to initiate a prefetch request?

Where in the memory hierarchy to place the prefetched data?

How should prefetching be done? (SW, HW, execution-based, cooperative)

How well did you know this?

Not at all

Perfectly

Why is it important to be careful when deciding what addresses to prefetch?

Prefetching uses resources, if the data that are prefetched are useless, we are wasting these resources.

(Bandwidth, cache space, prefetch buffer space, energy)

How well did you know this?

Not at all

Perfectly

How is the metric Accuracy calculated in prefetching?

Accuracy = Useful_prefetches / total_number_of_prefetches

How well did you know this?

Not at all

Perfectly

How do we know what to prefetch?

Predict based on previous access patterns.

Use the compiler’s knowledge of data structures (useful when doing software prefetching).

Prefetching algorithm.

How well did you know this?

Not at all

Perfectly

What are some consequences of timing in prefetching?

Study These Flashcards

Prefetch too early: Data might be evicted before used. Can also evict data needed later

Prefetch too late: Data is not ready when requested, meaning the prefetch won’t hide the memory latency

What can happen when prefetching is initiated too early?

Study These Flashcards

Data might not be used before it is evicted from cache.

The prefetched data might evict other data that can be requested in the future.

What can happen when prefetching is initiated too late?

Study These Flashcards

Prefetching might not be able to hide the whole memory latency.

How can a prefetcher be made more timelier?

Study These Flashcards

Making it more or less agressive:
- More ahead of the processor access stream: done in HW, fetching more blocks at the same time, speculating further into the future.

Move the prefetch instructions earlier or later in the code (done in SW)

How can prefetching lead to cache pollution?

Study These Flashcards

If prefetched data is stored in cache, it might evict useful data that would have been used in the future.

Where can prefetched data be stored?

Study These Flashcards

Cache:
- Simple to implement
- might cause cache pollution

Separate prefetch buffer:
- more complex design

How does prefetch buffers add complexity? (5)

Study These Flashcards

need to decide where to place the buffer.

When to acces the buffer (after cache miss, parallel to cache)

When to move data from buffer to cache

Buffer size

Coherency

What are some things we need to decide when we store prefetch data in cache?

Are we treating the prefetch data the same as demand-fetched cache blocks? As prefetched blocks only are predicted, meaning we do not actually know if they will be used. Might want to skew the replacement policy to favor demand-fetched blocks. How much in favor should we skew the replacement policy?

What are some pros and cons of fetching data to L1 or L2

Cons L1: - More likely to evict demand-fetched blocks that will be used Pros L1: - If predictor is good, will have quicker access to this data Pros L2: - Much larger cache, less chance to evict demand-fetched blocks -

How can placement of the prefetcher itself affect performance?

Where the prefetcher is placed in the memory hierarchy, affects what access pattern it sees. L1: All memory accesses, sees hits and misses L2: Only sees misses from L1 L3: Only sees misses from L2 More complete access pattern (L1) can potentially give better accuracy and coverage. But because more requests are being examined, it is more bandwidth intensive.

How is prefetching performed in hardware?

Have hardware that monitors prosessor accesses. It memorizes these accesses, and finds patterns/strides within these. Generate addresses automatically

How is prefetching performed in software?

ISA provides prefetch instructions that can be inserted into programs by the programmer or compiler. These works well only for "regular access patterns"

What is execution-based prefetching?

Have a threads thar run ahead of the programs, and for example only performs load instructions. Can be generated by the SW/programmer or HW

How does an adjacent prefetcher work?

HW prefetcher Prefetch the adjacent block of the block being requested. Install data in cache

How does a strided prefetcher work?

HW prefetcher Searches for strided accesses (current block + n) Triggered on successive cache misses.

What is an issue with SW prefetching?

Causes instruction overhead Timeliness: difficult to make sure that prefetched data arrives in cache when needed

Give an example of how prefetching can be implemented in SW

for (...) { prefetch(a[i + 1]); prefetch(b[i + 1]); sum = sum + a[i]*b[i] }

How does the intel L2 hardware prefetcher work?

Detects strides

How does the intel L2 adjacent prefetcher work?

Fetches a pair of blocks of 128 bytes

How does the intel DCU prefetcher work?

Fetches adjacent block into the L1-D cache

How does the intel DCU IP prefetcher work?

Per instruction stride detection

What are some software prefetching instruction implemented by Intel?

PREFETCH0: Into all levels PREFETCH1: All levels except 1 PREFETCH2 PREFETCHHNTA: Into all levels, non-temporal data 0, 1, 2: What level to prefetch into Excluding higher levels from instructions helps with prefetching less data into the smaller caches, and then avoiding evicting important demand-data. Non-temporal data: Data that won't be used again.Makes it easy for the replacement policy to know that this data should be evicted.

What are some metrics used to evaluate prefetching?

Coverage Accuracy Timeliness

What is the formula for Coverage?

Coverage = Eliminated_misses / total_misses_without_prefetching Eliminated misses: Total misses without prefetching, minus the total misses with prefetching

What is the formula for accuracy?

A = eliminated_misses / total_number_of_prefetches Eliminated misses: Total misses without prefetching, minus the total misses with prefetching

Formula for timeliness

On time prefetches / Usefull prefetches

Compiler techniques and prefetching Flashcards

(43 cards)