Memory and Caches Flashcards
DEFINE: SRAM.
Static Random Access Memory. Memory made out of transistors that simply retains data at the speed of the processor (i.e. clock rate).
DEFINE: DRAM.
Dynamic Random Access Memory. Memory made out of capacitors (i.e. a lot of electrical and chemical engineering). Synonymous with “main memory.”
TRUE/FALSE: DRAM is faster than SRAM (as the name implies) because it runs at processor speed.
FALSE, the opposite is true.
TRUE/FALSE: Reading from DRAM requires reading from whole rows or columns (also called pages), and therefore is quite slow.
TRUE.
What are the 5 (common) levels of cache hierarchy, starting with 1) Core Processing?
1) Core Processing
2) L1 Cache
3) L2 Cache
4) L3 Cache
5) DRAM/Main Memory
TRUE/FALSE: As caches get larger, they also get slower.
TRUE.
What is the L1 cache’s goal?
To prevent pipeline bubbles by having “memory” immediately available.
TRUE/FALSE: L2 and L3 caches are good, but pipeline bubbles are inevitable when the L1 cache misses.
TRUE.
DEFINE: Cache hit.
When the L1 cache contains the requested memory, thus preventing pipeline bubbles.
DEFINE: Cache miss.
When the memory has to be pulled from L2, L3, or DRAM, thus creating pipeline bubbles.
DEFINE: Cache block.
A unit of caching.
DEFINE: Temporal Locality.
“If we notice a memory object has been used recently, we think it’ll be used again soon.”
DEFINE: Spatial Locality.
“If a given memory address has been used recently, we think nearby addresses will be used soon.”
What are the two main trends that caching exploits?
Temporal Locality and Spatial Locality.
DEFINE: Miss penalty.
The time (or cycle) penalty for when a cache miss occurs. Tends to be outsized.
DEFINE: Direct Mapped Cache.
A cache that simply takes a modulo of the address to check for, and stores one cache block for that result.
DEFINE: Fully Associative Cache.
A cache that uses all of its space to simply store all the most recent blocks, regardless of address. To find the cached item, a search is done.
Why aren’t Direct Mapped caches ideal?
Conflict misses become a common issue, when two addresses modulo to the same value, and eviction can become common.
Why aren’t Fully Associative caches ideal?
The circuit overhead is immense, and searching through the entire cache means a slow runtime. Eviction occurs only on full-capacity.
What are the three (common) types of cache misses we have to consider?
Compulsory misses: Upon initially filling the cache, we have no choice but to take from DRAM. Unavoidable, not worth profiling.
Capacity misses: When the data was in the cache, but got evicted by another block due to capacity limits.
Conflict/Collision misses: When the data was in the cache, but got evicted by another block occupying the same slot.
In a 2-way set-associative cache capable of holding 256 cache blocks, how many caches are in parallel? How many blocks do those caches modulo over?
2 and 128.
You are designing a CPU with a 37-bit address bus, a 29-stage pipeline, and a Harvard architecture with separate 13-bit instruction and 19-bit data buses. The CPU uses a three-level cache hierarchy where L1 cache has a block size of 37 bytes, L2 cache is fully associative with 23-way set associativity, and L3 cache uses a victim cache policy that dynamically reconfigures its 5-level prefetcher based on the Fibonacci sequence.
If the CPU’s branch predictor mispredicts on prime-numbered cycles and the instruction decoder can only decode instructions with Hamming weights greater than 5, what is the expected CPI (Cycles Per Instruction) when executing a Fibonacci sequence generator written in Brainfuck, assuming each ‘+’ operation results in a TLB miss?
It depends.
DEFINE: Replacement policy.
The policy or strategy the cache takes when considering which cache block in parallel to evict.
What is the optimal (if impossible) replacement policy?
Evict whichever block is used furthest in the future.
What is a replacement policy that isn’t perfect but works well, thanks to temporal locality?
Least Recently Used: Evict the block that was last used at the earliest time.
Caches often employ large blocks for storage. Why is this?
The minimum burst length over memory is a page, and storing a page is fine. We can also subdivide blocks as we please.
What is the equation for AMAT?
Average Memory Access Time =
Hit Time + Miss Penalty * Miss Ratio.
Given a CPU with a CPI of 1 and a clock rate of 4 GHz, if main memory takes 100ns to access, L3 cache takes 25ns to access, L2 cache takes 5ns to access, and L1 cache takes 0ns to access…
If 5% of instructions are L1 misses, 2% are L2 misses, and 0% are L3 misses, what is the effective CPI of this processor’s memory instructions, or AMAT?
4 GHz converts to 0.25ns clock period.
The miss penalties are:
L1 miss: 5ns/0.25ns = 20c
L2 miss: 25ns/0.25ns = 100c
L3 miss: 100ns/0.25 = 400c
CPI = 1 + 0.05 * 20 + 0.02 * 100 + 0 * 400 = 4 CPI
Given a CPU with a CPI of 1 and a clock rate of 4 GHz, if main memory takes 100ns to access, L3 cache takes 25ns to access, L2 cache takes 5ns to access, and L1 cache takes 0ns to access…
If I got rid of all three caches, what is the effective CPI of this processor’s memory instructions, or AMAT?
4 GHz converts to 0.25ns clock period.
The miss penalties are:
No matter what: 100ns/0.25ns = 400c
CPI = 400 CPI
Given a CPU with a CPI of 1 and a clock rate of 4 GHz, if main memory takes 100ns to access, L3 cache takes 25ns to access, L2 cache takes 5ns to access, and L1 cache takes 0ns to access…
If 10% of instructions are L1 misses, but I don’t have an L2 or L3 processor, what is the effective CPI of this processor’s memory instructions, or AMAT?
4 GHz converts to 0.25ns clock period.
The miss penalties are:
L1 miss: 100ns/0.25ns = 400c
CPI = 1 + 0.10 * 400c = 41 CPI
DEFINE: Cache coherency.
When reading from an address, the most recent value must be the one we grab. This is important for multicore systems.
To maintain cache coherency, when a cache line is written by a core, what needs to happen?
All other instances of that cache line need to be invalidated.