Compiler techniques and prefetching Flashcards
What are some good and bad features of compilers?
Good: Has good amount of time to decide what to do
Bad: Has less knowledge about a program
When illustrating a pipeline. How can you visualize that some functional units take more than one cycle (e.g. FP)
Draw a loop from the end of the FU to the beginning of it
What does compiler optimization aim to fix?
Reduce effect of stalling (hazard, FP operations, etc.)
What can the compiler do to reduce the amount of stalls?
Static scheduling
Loop unrolling
What is static scheduling?
- Reorder ‘independent’ instructions to places where the program might otherwise have stalls
- If an instruction, that is otherwise independent of other instructions, update the value of a register. If this register is in use by e.g. a load, because this load adds an offset to the register value, we can update this offset to fit the new register value. Meaning, the load address is still preserved.
What is loop unrolling?
Place a number of iterations after each other.
If an operation is common for all iterations, e.g. ADD R1, R1, #-8.
If 4 iterations are placed after each other, this ADD can be removed from the iterations and placed at the bottom, but with an updated immediate value:
ADD, R1, R1, #-32
Instead of branching back to the loop label after every iteration, place the branch instruction after the iterations that has been placed after each other.
If 4 iterations are placed after each other, we only need to branch every 4 iterations, instead of the previous 1.
How can static scheduling be used with loop unrolling?
When iterations has been layed out, we now have more iterations to possibly be able to reorganize.
An example is if every iteration contains independent ADD-instructions and every iteration contains a LOAD, the loads can first be placed directly after each other, and then the ADDs that are dependent on the loads. As the loads and ADDs now are far enough apart, we won’t need to stall the pipeline.
(Might need to use register renaming to avoid hazards)
How can loop unrolling be done for dynamic loops?
Dynamic loops are loops that we do not know how many iterations they have. The number of iterations can for example depend on user input.
Do the unrolling in 2 stages:
- (n mod k), n: iterations, k: number of copies of body
- Execute (n mod k) times the body of the original loop (all the cases not dividable by k, meaning they do not fit the unrolling)
- Unroll the body and execute this (n/k) times
Most of the cases, n is large, meaning most of the time is spent on the unrolled loop.
What are some cons of loop unrolling? (3)
Growth in code size: need more flash memory
For large loops: Can get multiple copies of instructions that takes up space in the L1 cache. Can increase instruction cache miss rate.
Register pressure: Requires a lot more registers. If we don’t have enough registers to allocate all live values, might lose all advantage of loop unrolling.
What are the 3 types of cache misses?
Compulsary: First-reference miss, miss on infinite cache
Capacity: Would not happen in an infinite
Conflict: Because of restrictions on where a block can be placed. Does not occur in fully associative.
What is prefetching?
Speculating on what accesses can happen in the future.
Bring the speculated data from lower parts of the memory hierarchy, and into the caches.
What types of cache misses does prefetching aim to fix?
Compulsary misses
What are some issues that prefetching needs to deal with?
Prediction: Need good predictions
Timeliness: Data need to be available at the right time.
Resources: Avoid cache and bandwidth pollution. If we use too much prefetching, we are taking up resources.
What are some questions we need to consider when doing prefetching?
What addresses to fetch?
When to initiate a prefetch request?
Where in the memory hierarchy to place the prefetched data?
How should prefetching be done? (SW, HW, execution-based, cooperative)
Why is it important to be careful when deciding what addresses to prefetch?
Prefetching uses resources, if the data that are prefetched are useless, we are wasting these resources.
(Bandwidth, cache space, prefetch buffer space, energy)
How is the metric Accuracy calculated in prefetching?
Accuracy = Useful_prefetches / total_number_of_prefetches
How do we know what to prefetch?
Predict based on previous access patterns.
Use the compiler’s knowledge of data structures (useful when doing software prefetching).
Prefetching algorithm.
What are some consequences of timing in prefetching?
Prefetch too early: Data might be evicted before used. Can also evict data needed later
Prefetch too late: Data is not ready when requested, meaning the prefetch won’t hide the memory latency
What can happen when prefetching is initiated too early?
Data might not be used before it is evicted from cache.
The prefetched data might evict other data that can be requested in the future.
What can happen when prefetching is initiated too late?
Prefetching might not be able to hide the whole memory latency.
How can a prefetcher be made more timelier?
Making it more or less agressive:
- More ahead of the processor access stream: done in HW, fetching more blocks at the same time, speculating further into the future.
- Move the prefetch instructions earlier or later in the code (done in SW)
How can prefetching lead to cache pollution?
If prefetched data is stored in cache, it might evict useful data that would have been used in the future.
Where can prefetched data be stored?
Cache:
- Simple to implement
- might cause cache pollution
Separate prefetch buffer:
- more complex design
How does prefetch buffers add complexity? (5)
need to decide where to place the buffer.
When to acces the buffer (after cache miss, parallel to cache)
When to move data from buffer to cache
Buffer size
Coherency
What are some things we need to decide when we store prefetch data in cache?
Are we treating the prefetch data the same as demand-fetched cache blocks?
As prefetched blocks only are predicted, meaning we do not actually know if they will be used. Might want to skew the replacement policy to favor demand-fetched blocks.
How much in favor should we skew the replacement policy?
What are some pros and cons of fetching data to L1 or L2
Cons L1:
- More likely to evict demand-fetched blocks that will be used
Pros L1:
- If predictor is good, will have quicker access to this data
Pros L2:
- Much larger cache, less chance to evict demand-fetched blocks
-
How can placement of the prefetcher itself affect performance?
Where the prefetcher is placed in the memory hierarchy, affects what access pattern it sees.
L1: All memory accesses, sees hits and misses
L2: Only sees misses from L1
L3: Only sees misses from L2
More complete access pattern (L1) can potentially give better accuracy and coverage. But because more requests are being examined, it is more bandwidth intensive.
How is prefetching performed in hardware?
Have hardware that monitors prosessor accesses.
It memorizes these accesses, and finds patterns/strides within these.
Generate addresses automatically
How is prefetching performed in software?
ISA provides prefetch instructions that can be inserted into programs by the programmer or compiler.
These works well only for “regular access patterns”
What is execution-based prefetching?
Have a threads thar run ahead of the programs, and for example only performs load instructions.
Can be generated by the SW/programmer or HW
How does an adjacent prefetcher work?
HW prefetcher
Prefetch the adjacent block of the block being requested.
Install data in cache
How does a strided prefetcher work?
HW prefetcher
Searches for strided accesses (current block + n)
Triggered on successive cache misses.
What is an issue with SW prefetching?
Causes instruction overhead
Timeliness: difficult to make sure that prefetched data arrives in cache when needed
Give an example of how prefetching can be implemented in SW
for (…) {
prefetch(a[i + 1]);
prefetch(b[i + 1]);
sum = sum + a[i]*b[i]
}
How does the intel L2 hardware prefetcher work?
Detects strides
How does the intel L2 adjacent prefetcher work?
Fetches a pair of blocks of 128 bytes
How does the intel DCU prefetcher work?
Fetches adjacent block into the L1-D cache
How does the intel DCU IP prefetcher work?
Per instruction stride detection
What are some software prefetching instruction implemented by Intel?
PREFETCH0: Into all levels
PREFETCH1: All levels except 1
PREFETCH2
PREFETCHHNTA: Into all levels, non-temporal data
0, 1, 2: What level to prefetch into
Excluding higher levels from instructions helps with prefetching less data into the smaller caches, and then avoiding evicting important demand-data.
Non-temporal data: Data that won’t be used again.Makes it easy for the replacement policy to know that this data should be evicted.
What are some metrics used to evaluate prefetching?
Coverage
Accuracy
Timeliness
What is the formula for Coverage?
Coverage = Eliminated_misses / total_misses_without_prefetching
Eliminated misses: Total misses without prefetching, minus the total misses with prefetching
What is the formula for accuracy?
A = eliminated_misses / total_number_of_prefetches
Eliminated misses: Total misses without prefetching, minus the total misses with prefetching
Formula for timeliness
On time prefetches / Usefull prefetches