Compiler techniques and prefetching Flashcards
What are some good and bad features of compilers?
Good: Has good amount of time to decide what to do
Bad: Has less knowledge about a program
When illustrating a pipeline. How can you visualize that some functional units take more than one cycle (e.g. FP)
Draw a loop from the end of the FU to the beginning of it
What does compiler optimization aim to fix?
Reduce effect of stalling (hazard, FP operations, etc.)
What can the compiler do to reduce the amount of stalls?
Static scheduling
Loop unrolling
What is static scheduling?
- Reorder ‘independent’ instructions to places where the program might otherwise have stalls
- If an instruction, that is otherwise independent of other instructions, update the value of a register. If this register is in use by e.g. a load, because this load adds an offset to the register value, we can update this offset to fit the new register value. Meaning, the load address is still preserved.
What is loop unrolling?
Place a number of iterations after each other.
If an operation is common for all iterations, e.g. ADD R1, R1, #-8.
If 4 iterations are placed after each other, this ADD can be removed from the iterations and placed at the bottom, but with an updated immediate value:
ADD, R1, R1, #-32
Instead of branching back to the loop label after every iteration, place the branch instruction after the iterations that has been placed after each other.
If 4 iterations are placed after each other, we only need to branch every 4 iterations, instead of the previous 1.
How can static scheduling be used with loop unrolling?
When iterations has been layed out, we now have more iterations to possibly be able to reorganize.
An example is if every iteration contains independent ADD-instructions and every iteration contains a LOAD, the loads can first be placed directly after each other, and then the ADDs that are dependent on the loads. As the loads and ADDs now are far enough apart, we won’t need to stall the pipeline.
(Might need to use register renaming to avoid hazards)
How can loop unrolling be done for dynamic loops?
Dynamic loops are loops that we do not know how many iterations they have. The number of iterations can for example depend on user input.
Do the unrolling in 2 stages:
- (n mod k), n: iterations, k: number of copies of body
- Execute (n mod k) times the body of the original loop (all the cases not dividable by k, meaning they do not fit the unrolling)
- Unroll the body and execute this (n/k) times
Most of the cases, n is large, meaning most of the time is spent on the unrolled loop.
What are some cons of loop unrolling? (3)
Growth in code size: need more flash memory
For large loops: Can get multiple copies of instructions that takes up space in the L1 cache. Can increase instruction cache miss rate.
Register pressure: Requires a lot more registers. If we don’t have enough registers to allocate all live values, might lose all advantage of loop unrolling.
What are the 3 types of cache misses?
Compulsary: First-reference miss, miss on infinite cache
Capacity: Would not happen in an infinite
Conflict: Because of restrictions on where a block can be placed. Does not occur in fully associative.
What is prefetching?
Speculating on what accesses can happen in the future.
Bring the speculated data from lower parts of the memory hierarchy, and into the caches.
What types of cache misses does prefetching aim to fix?
Compulsary misses
What are some issues that prefetching needs to deal with?
Prediction: Need good predictions
Timeliness: Data need to be available at the right time.
Resources: Avoid cache and bandwidth pollution. If we use too much prefetching, we are taking up resources.
What are some questions we need to consider when doing prefetching?
What addresses to fetch?
When to initiate a prefetch request?
Where in the memory hierarchy to place the prefetched data?
How should prefetching be done? (SW, HW, execution-based, cooperative)
Why is it important to be careful when deciding what addresses to prefetch?
Prefetching uses resources, if the data that are prefetched are useless, we are wasting these resources.
(Bandwidth, cache space, prefetch buffer space, energy)
How is the metric Accuracy calculated in prefetching?
Accuracy = Useful_prefetches / total_number_of_prefetches
How do we know what to prefetch?
Predict based on previous access patterns.
Use the compiler’s knowledge of data structures (useful when doing software prefetching).
Prefetching algorithm.