Caches Flashcards
The memory wall
processors are getting faster at a rate faster than memories are getting faster
temporal locality
recently referenced data likely to be referenced again soon. Reactive
spacial locality
more likely to reference data near recently referenced data. Proactive
Is temporal and spacial locality used for both data and instructions?
Yes
How to find average memory access time
latency_avg = latency_hit + %_miss*latency_miss
Primary caches
split instructions (l$) and data (d$). on chip (with CPU), made of SRAM (same circuit type as CPU)
2nd level caches
on chip (with CPU), SRAM); unified (holds both I and D)
How large are primary caches?
8KB to 64KB
How large are second level caches?
typically 512KB to 16MB
4th level cache = main memory
Made of DRAM
How large is fourth level cache?
1GB to 4GB for desktop, severs can have much more
5th level cache
disk/SSD (swap and files)
Processors are how much cache by area?
30-70%
Static RAM (SRAM)
6 transistors per bit
optimized for speed and density
fast (sub-nanosecond latency for small SRAM)
speed proportional to area
integrates well with standard processor logic
Dynamic RAM (DRAM)
1 transistor + 1 capacitor per bit
optimized for density
slow (>40ns internal access, ~100ns pin-to-pin)
different fabrication steps
Nonvolatile storage
magnetic disk, flash, STT, Re-RAM, PCM
Cache Lookup Algorithm
Read frame indicated by index bits
“Hit” if tag matches and valid bit is set, otherwise miss
Fill path also called what?
backside
Cache controller
finite state machine - remembers miss address, accesses next level, waits for response, writes data and tag in proper locations
%miss (miss rate)
misses/#accesses
t_hit (hit time)
time to read data from (write data to) cache
t_miss (miss penalty)
time to read data into cache
Average access time: t_avg
t_hit + %miss * t_miss
what roughly determines t_hit
cache capacity and circuits
what roughly determines t_hit
lower level memory structures
How to measure %_miss?
hardware performance counters, simulation, paper simulation
how to find offset
Log_2(block size)
how to find index
log_2(number of sets)
How to reduce %miss?
increase capacity
increase block size
What happens if you increase cache capacity?
reduce % miss, but t_hit increases
What is t_hit latency proportional to?
sqrt(capacity)
What are the advantages of increasing block size?
reduce %miss
reduce tag overhead
What are the disadvantages of increasing block size?
potentially useless data transfer
premature replacement of useful data
For same size cache, will increasing the block size increase or reduce the tag overhead?
Increasing the block size will reduce the tag overhead
Effects of block size on miss rate
spacial prefetching
interference
Spacial prefetching
good; for blocks with adjacent addresses turns miss/miss into miss/hit
Interference
For blocks with non-adjacent addresses but adjacent frames; turns hits into misses by disallowing simultaneous residence
What offsets the time to read/transfer/fill a larger block?
critical word first/early restart
Can critical word first/early restart help with a cluster of misses?
No. Reads/transfers/fills of two misses can’t happen simultaneously
Name for a frame group
set
Each frame in set
way
Pros and cons of increasing set associativity
pro: reduces conflicts
con: increases t_hit (additional tage match and muxing)
Lookup algorithm for multi-way set associative cache
Use index bit to find set
read data/tags in all frames in that set in parallel
if match and valid bit, hit
NMRU and miss handling
Add MRU bits to each set, hit will update MRU, miss will replace any way but MRU
Can split data and tags into two different arrays, so can access in parallel. Why are multi-way associative caches still slower than direct mapped caches?
Still more logic in the critical path than direct mapped caches (an additional multiplexor), so slower t_hit time
Pros and cons of higher associative caches
Pro: have better (lower) % miss
Con: T_hit increases - the more associative, the slower
Why are instruction caches smaller/simpler?
don’t have to worry about writing/storing
Why are writes slower than reads?
For reads, can read tag and data in parallel
Stages of write pipeline
1) match tag
2) write to matching way
bypass to avoid load stalls, may introduce structural hazards
Two options for when to propagate new value to lower level memory
1) write through
2) write back
Write Through
on hit, update cache
immediately send the write to the next level
Write Back
write to lower level when block is replaced
requires an extra dirty bit per block
Writeback buffer (WBB)
keeps writes off the critical path
1) send fill request to next level
2) while waiting, write dirty block to buffer
3) when new block arrives, put it into cache
4) write buffer sends contens to next-level
Disadvantages of write through
requires additional bus bandwidth
without a write buffer, must wait for writes to complete to memory
Advantages of write through
Easier to implement, no need for dirty bits in cache
Don’t have to deal with coherence traffic at this cache level
Simplifies miss handling (no write back buffer step)
Advantage of Write back
Uses less bandwidth since some writes don’t go to memory (also saves power)
Read vs Write Miss
Read miss: load can’t go on w/o data, must stall
Write miss: no instruction waiting for data, so don’t need to stall
Store buffer
writes to D$ in background
eliminates stalls on write misses
loads must search store buffer in addition to D$
Store vs. writeback buffer
store buffer: in front of D$, hides store misses
writeback buffer: behind D$, hides write backs
Write allocate is used with with what time of write (write back or write through)?
write back
Write-allocate
when a write miss occurs, allocate a frame in the cache for the miss data
Advantage of write alloccate
decreases read misses
No write allocate
when a write miss occurs, just write to next level, no need to allocate a cache frame for the miss data
Pros/cons of no-write-allocate
potentially more read misses, but doesn’t use a frame in the cache
4 types of cache miss
compulsory, capacity, conflict, coherence
compulsory cache miss
never seen this address before, would miss in infinite cache
capacity cache miss
miss caused because cache is too small (would miss in fully associative cache)
conflict cache miss
miss caused because cache associativity is too low
coherence cache miss
miss due to external invalidations in shared memory multiprocessors and multicores
How does larger block size effect 3 C’s and hit rate?
decreases compulsory misses (spacial locality)
increases conflict misses (fewer frames)
can increase t_miss - reading more bytes from next level
no significant effect on t_hit
How does larger cache effect 3 C’s and hit rate?
decreases capacity miss
increases t_hit
How does higher associativity effect 3 C’s and hit rate?
decreases conflict misses
increases t_hit
local hit/miss rate
percent of references to this cache that hit -# misses/total accesses to this cache
local miss rate = (100%- local hit rate)
global hit/miss rate
misses/total # of memory references
inclusive caches
a block in the L1 is always in the L2
good for write throughs
coherence traffic only needs to check L2
exclusive caches
block is either in L1 or L2 (never both)
holds more data
coherence traffic must check both L1 and L2
Give reads priority over writes
read must check contents of the WBB since it could hold the read value
reduces write costs in writeback cache- if read miss will replace a dirty block, write the dirty block to WRR, read memory, then write WBB to memory