Lectures 9 & 10 - The Memory Hierachy Flashcards
Which caches are separate?
L1 data and L1 instruction caches. This allows access to data memory and instruction memory in the same cycle. Characteristics can also then be optimised for contents.
Which caches are unified?
L2/L3 caches are typically unified. Allows the proportion of memory dedicated to instructions and data to be dynamically adjusted depending on program requirements.
Blocks or Cache Lines
Blocks help us exploit spatial locality of data. On a cache miss we must load a minimum of one block of data from main memory into the cache.
Block Size - Optimal
Larger block size allows us to better exploit spatial locality - reduces miss rate.
Increasing block size decreases number of different blocks that can be stored. - increases miss rate
Large block size - increases miss rate.
Direct Mapped Cache
Each block has only one place it can be stored in the cache. Low-access time but may suffer from many collisions meaning cache lines may be repeatedly evicted despite free entries.
Set Associative Cache
Each block can be in n differrent places in cache.
Fully associative cache (using cams also known as CAM-tag or CAM-RAN cache)
Highly associative caches often seen in embedded systems where energy consumption is a major concern. They reduce miss rates avoiding high cost of accessing main memory.
Block Replacement Policies
Want to use invalid lines in the set if possible. Otherwise use LRU scheme or approximate LRU by FIFO or Not Last Used.
Write-Allocate
Allocate block in cache and then write
No-write allocate
Just write to lower-level memory
Write through
Write to both cache and lower level of memory when we perform a write.
Write Back
Only write dirty cache blocks to lower level of memory when they are replaced.
Average Memory Access Time
Hit Time + Miss Rate x Miss Penalty
Compulsory Cache Miss
Generated when a block is brought into the cache for the first time.
Capacity
Capacity misses are produced when the number of blocks required by a program exceeds the capacity of the cache.
Conflict
Misses caused by many blocks mapping to the same set.
Write Buffer
Hide lower throughput of underlying memory system particularly useful for write-through cache. It allows loads to bypass buffered stores.
L1 Cache
Optimised for fast access times as cache hits will be very common. They tend to be small (fast) with lower associativity.
L2/L3 Cache
Have a low miss rate and minimise off-chip/DRAM accesses. They are larger with higher associativity.
Multi-Level Inclusion
Include L1 data in L2 etc. Desirable as consistency between I/O and caches can be determined by just checking the L2.
Drawbacks:
May want different block sizes - inclusion still possible.
L2 is not much larger than L1, exclusion may be better policy.
Not required for instruction caches.
Critical Word First
Processor often only requires a single word from a cache block so read required word first and allow processor to continue execution.
Sub-block placement
Uses sub-block placement to transfer sub-blocks. Allows size of tag-RAM to be kept small without incurring large miss penalties larger blocks cause.
Victim Cache
Small fully associative cache used in conjunction with direct-mapped cache. Holds lines displaces from L1 cache. Processor checks both L1 and VC. Data is swapped if VC hits.
Aims to reduce conflict misses (20-40% of direct mapped cache misses).
Assist Cache
Small fully-associative cache used to assist direct-mapped caches.
Data enters assist cache when L1 misses, lines moved out of assist cache in FIFO order.
Helps eliminate thrashing behaviour associated with direct mapped caches.
Non-blocking caches
If out-of-order execution is permitted then could continue to execute even if we are waiting for a cache miss to be serviced. Can effective reduce the load miss penalty. In order to implement need to keep track of outstanding memory requests by adding special registers to store this information. Stored in miss status golding register.
Programming for caches - simple optimisations
Working set is small enough to fit in the cache
Reposition program in memory to optimise instruction cache performance
Loop Interchange
Improve spatial locality by favouring sequential accesses to non-unit strides.
Loop fusion
Fuse together loops that access the same data to improve temporal locality.
Array Merging
Guarantee spatial locality by reorganising data as array. Works well if data is related and accessed together.
Array Padding
Shifts the position of arrays in memory to reduce problematic conflict misses.
Cache Blocking
Organise computation so we access sub-matrices that fit into the cache.
Sequential prefetcher
One Block Lookahead
On a cache miss: fetch required block and prefetch next sequential block into prefect buffer, on miss check this buffer first.
Tagged sequential prefetcher
Adds tag bit per cache line which is set when a block is prefetched into the cache. If bit is set on a cache hit then next cacheline is prefetched.
Striped prefetch
Detect regular access patters and prefetch data
Regrence Prediction Table
indexed by address of load instruction. It stores last address referenced by load which is used to calculate and store delta between subsequent references. If stride between 3 such misses is constant then strided access pattern is assumed and delta is used to prefetch cache blocks.
Helper Thread
A microthread which is run in parallel with the main thread to initiate memory request ahead of time. It can implement complex prefetch algorithm based on addresses of loads that miss.