Lectures 9 & 10 - The Memory Hierachy Flashcards by Joel Moss

Which caches are separate?

L1 data and L1 instruction caches. This allows access to data memory and instruction memory in the same cycle. Characteristics can also then be optimised for contents.

How well did you know this?

Not at all

Perfectly

Which caches are unified?

L2/L3 caches are typically unified. Allows the proportion of memory dedicated to instructions and data to be dynamically adjusted depending on program requirements.

How well did you know this?

Not at all

Perfectly

Blocks or Cache Lines

Blocks help us exploit spatial locality of data. On a cache miss we must load a minimum of one block of data from main memory into the cache.

How well did you know this?

Not at all

Perfectly

Block Size - Optimal

Larger block size allows us to better exploit spatial locality - reduces miss rate.

Increasing block size decreases number of different blocks that can be stored. - increases miss rate

Large block size - increases miss rate.

How well did you know this?

Not at all

Perfectly

Direct Mapped Cache

Each block has only one place it can be stored in the cache. Low-access time but may suffer from many collisions meaning cache lines may be repeatedly evicted despite free entries.

How well did you know this?

Not at all

Perfectly

Set Associative Cache

Each block can be in n differrent places in cache.

How well did you know this?

Not at all

Perfectly

Fully associative cache (using cams also known as CAM-tag or CAM-RAN cache)

Highly associative caches often seen in embedded systems where energy consumption is a major concern. They reduce miss rates avoiding high cost of accessing main memory.

How well did you know this?

Not at all

Perfectly

Block Replacement Policies

Want to use invalid lines in the set if possible. Otherwise use LRU scheme or approximate LRU by FIFO or Not Last Used.

How well did you know this?

Not at all

Perfectly

Write-Allocate

Allocate block in cache and then write

How well did you know this?

Not at all

Perfectly

No-write allocate

Just write to lower-level memory

How well did you know this?

Not at all

Perfectly

Write through

Write to both cache and lower level of memory when we perform a write.

How well did you know this?

Not at all

Perfectly

Write Back

Only write dirty cache blocks to lower level of memory when they are replaced.

How well did you know this?

Not at all

Perfectly

Average Memory Access Time

Hit Time + Miss Rate x Miss Penalty

How well did you know this?

Not at all

Perfectly

Compulsory Cache Miss

Generated when a block is brought into the cache for the first time.

How well did you know this?

Not at all

Perfectly

Capacity

Capacity misses are produced when the number of blocks required by a program exceeds the capacity of the cache.

How well did you know this?

Not at all

Perfectly

Conflict

Study These Flashcards

Misses caused by many blocks mapping to the same set.

Write Buffer

Study These Flashcards

Hide lower throughput of underlying memory system particularly useful for write-through cache. It allows loads to bypass buffered stores.

L1 Cache

Study These Flashcards

Optimised for fast access times as cache hits will be very common. They tend to be small (fast) with lower associativity.

L2/L3 Cache

Study These Flashcards

Have a low miss rate and minimise off-chip/DRAM accesses. They are larger with higher associativity.

Multi-Level Inclusion

Study These Flashcards

Include L1 data in L2 etc. Desirable as consistency between I/O and caches can be determined by just checking the L2.

Drawbacks:

May want different block sizes - inclusion still possible.

L2 is not much larger than L1, exclusion may be better policy.

Not required for instruction caches.

Critical Word First

Study These Flashcards

Processor often only requires a single word from a cache block so read required word first and allow processor to continue execution.

Sub-block placement

Study These Flashcards

Uses sub-block placement to transfer sub-blocks. Allows size of tag-RAM to be kept small without incurring large miss penalties larger blocks cause.

Victim Cache

Study These Flashcards

Small fully associative cache used in conjunction with direct-mapped cache. Holds lines displaces from L1 cache. Processor checks both L1 and VC. Data is swapped if VC hits.

Aims to reduce conflict misses (20-40% of direct mapped cache misses).

Assist Cache

Study These Flashcards

Small fully-associative cache used to assist direct-mapped caches.

Data enters assist cache when L1 misses, lines moved out of assist cache in FIFO order.

Helps eliminate thrashing behaviour associated with direct mapped caches.

Non-blocking caches

If out-of-order execution is permitted then could continue to execute even if we are waiting for a cache miss to be serviced. Can effective reduce the load miss penalty. In order to implement need to keep track of outstanding memory requests by adding special registers to store this information. Stored in miss status golding register.

Programming for caches - simple optimisations

Working set is small enough to fit in the cache Reposition program in memory to optimise instruction cache performance

Loop Interchange

Improve spatial locality by favouring sequential accesses to non-unit strides.

Loop fusion

Fuse together loops that access the same data to improve temporal locality.

Array Merging

Guarantee spatial locality by reorganising data as array. Works well if data is related and accessed together.

Array Padding

Shifts the position of arrays in memory to reduce problematic conflict misses.

Cache Blocking

Organise computation so we access sub-matrices that fit into the cache.

Sequential prefetcher

One Block Lookahead On a cache miss: fetch required block and prefetch next sequential block into prefect buffer, on miss check this buffer first.

Tagged sequential prefetcher

Adds tag bit per cache line which is set when a block is prefetched into the cache. If bit is set on a cache hit then next cacheline is prefetched.

Striped prefetch

Detect regular access patters and prefetch data

Regrence Prediction Table

indexed by address of load instruction. It stores last address referenced by load which is used to calculate and store delta between subsequent references. If stride between 3 such misses is constant then strided access pattern is assumed and delta is used to prefetch cache blocks.

Helper Thread

A microthread which is run in parallel with the main thread to initiate memory request ahead of time. It can implement complex prefetch algorithm based on addresses of loads that miss.

Lectures 9 & 10 - The Memory Hierachy Flashcards

(36 cards)