6.0 The memory system Flashcards
What is the memory wall
Processor is much faster than memory fetching
How is memory latency hidden?
Using a memory hierarchy
What does a memory hierarchy provide?
Hide latency
illusion of infinite capacity
relatively low cost
What are the components in the memory hierarchy?
Register file (SRAM)
L1 cache (on-chip) (SRAM)
L2 cache (on-chip) (SRAM))
Main memory (DRAM)
Hard drive (Magnetic)
Why does memory hierarchies work?
Programs have locality
Spatial, temporal
Move frequently used data and instructions close to the processor (in registers - done by compiler, in cache - done by hardware).
What is temporal locality?
Accesses to the same location are likely to occur in near time
What is spatial locality?
Accesses to nearby locations are likely to happen in near time
What is a cache?
Consist of a number of equally sized locations (block/line)
What are the three types of caches?
Direct mapped
Fully associative
Set associative
What is a direct mapped cache?
Each block address is mapped to exactly one cache location
index = block_addr % block_count
Tag: part of the address that accesses the cache
More efficient lookup, but more conflict misses
What is a fully associative cache?
All blocks can be stored anywhere in the cache.
No index
Need to search the whole cache
What is a set associative cache?
Each block can map to n locations in the cache. Need to look up the index of the block in each set to be sure if the address is stored in cache
sets = block_count / n
index: block_address % set_count
less prone to location conflicts compared to direct mapped caches
What does the cache block address consist of?
Block address (tag + index)
Block offset: addresses individual bytes within a block
What are the trade offs when choosing block size?
Large blocks:
- Greater oppurtunity for spatial locality
- More likely that large portions of the block are unreferenced before the block is evicted
- Take longer time to read into a cache - increased miss penalty
What is the 3C miss classification?
Compulsary misses
Capacity misses
Conflict misses
What are compulsary misses?
Misses that still would occur in an infinite sized cache.
I.e. the first access to a block
What are capacity misses?
Misses that would occur even if the cache was fully associative.
The amount of data access within a time frame is larger than the cache capacity
What are conflict misses?
Misses that occur because the cache is not fully associative.
What are seperate caches?
Used when we know the caches have specific functions (i.e. data- and instruction caches)
Instructions have the same format, and are only read
Ofen used for L1 caches. Instructions have very good locality, so L1 will grab most of the potential. Meaning we don’t need to use split caches further down the memory hierarchy.
Avoids structural hazards for simple pipelines
What is a write buffer
When a cache needs to write, it writes the data into the write buffer.
This write buffer can continue writing into memory in the background, so that the pipeline does not need to stall.
What do we need to think about when designing caches?
- How to deal with writes
- How to allocate space in cache for writes
- How to allocate space in cache for reads
- What block to evict on a miss
- Should we restrict what data can be placed in higher level caches?
- How many concurrent misses should we support?
What are the different ways of handling writes?
Write-back
Write-through
Write-allocate
Write-no-allocate
What is write-back?
Update cached value, if it is there, and write cache value to memory at a different time.
Betting on that there will be multiple writes to the same cache line, and therefore is better to do one big write instead of multiple small ones.
Tradeoff: Save bandwith, but adds complexity
What is write-through
Write updated data to lower-level memory on all writes, in addition to updating cache.
What is write allocate
Insert written value into cache on a write miss
What is write no-allocate
Bypass cache on a write miss.
If the value is not in cache, write straight to the lower level memory.
What are typical combinations of write-handling?
Write-back + write-allocate
Write-through + write no-allocate
How are read allocations handled?
Allcate-on-fill
Allocate-on-miss
What is allocate on fill?
Insert the requested data into the cache, when the miss returns from lower levels.
You can still hit in all ways while the miss is being serviced (may improve performance)
Higher complexity
What is allocate on miss?
When you miss, you already evict and reserve this block for the load.
What is a replacement policy, and name three examples?
Decides what to evict on a miss.
On a fully- or set-associate cache, there are more than one cache line that can be evicted.
Random, FIFO; LRU
What is the Random replacement policy?
Choose a random block
Less effective, very simple
What is the FIFO replacement policy?
Evict block inserted longest ago
More complexity: need to remember insertion order
Medium complexity, but more effective
What is the LRU replacement policy?
Evict least-recently-used
Need to remember the reuse order, which is updated on every access
When you access a block, it is moved to the top of the stack/queue. On a miss, evict the tail of the queue within the set
Most complex, most effective.
Can be expensive in high-associative caches
What are inclusive caches?
Any block cached at level n is also cached at the level n+1 cache
Pro:
This simplifies coherence: if block is not in level n, it is not in level n-1
Con:
-Reduces cache utilization, as we are in prectise storing redundant copies.
- Adds some overhead
What are exclusive caches?
Any block in level n cache, is not in level n+1 cache.
Simplifies coherence - know a block is in most 1 level
Does not reduce cache utilization as blocks are stored in most 1 cache
Are all caches either inclusive or exclusive?
No, an option is to not enforce inclusivity or exclusivity.
Let the cache coherence deal with the fact that any block can be cached anywhere in the hierarchy
What are non-blocking caches?
The key component is the MSHR
MHSR:
- Miss Information/status Holding Register
- On a miss, go into the miss handling architecture (MHA)
- Ask if there is a MHSR outstanding for this block address
- If there is, that means there is already a miss in progress, and you allocate an entro into space called target information saying that this request also requested this word from this cache block.
- When the miss returns, cache goes through the target information and responds to all words
The cache can service as many misses as there is MSHRs - selecting the number of MSHRs is critical
If the cache runs out of MSHRs it needs to block and cannot handle either hits or misses. Because there is no where to put the miss data on misses
What are the benefits of non-blocking caches?
Enables servicing cache hits while a miss is being resolved(hit-under-miss policy)
Enables hiding memory latency by servicing multiple misses concurrently (memory-level-parallelism MLP)
Enables merging concurrent requests to the same cache block
What is virtual memory?
Every process sees a 23/64-bit address space. Physical address space is much smaller
Virtual addresses needs to be mapped into physical ones
Because every process have the illusion that they own the complete address space, time sharing and demand paging is much easier.
What is time sharing?
Being able to switch between processes at runtime
What is demand paging?
Hardware figure out if a page is available in main memory or not.
If it is not, it might not have been allocated yet. In this case, the software needs to load from disk.
This is very expensive, so instead of blocking the processor, we generate an exception (page fault).
When a page fault occurs, the OS schedules the process out, finds the page to be replaced, load from disk. Meanwhile, as this is non-blocking I/O, other processes is run.
When the page is ready the process is set as runnable again, and can be scheduled.
How does address translation work?
Take the virtual address and do the address translation in hardware.
From the address translation we get a physical address that is used to access main memory.
Allocate memory in fixed sized chunck called pages (4KB or 8KB)
What are pages?
Allocate memory in fixed sized chunck called pages (4KB or 8KB)
What is the page table?
Key data structure to support virtual memory.
Maintained by the OS
Two types:
- Forward page tables
- Inverted page tables
What are forward page tables?
Have a pointer from any virtual address to a physical address.
This is inefficient, as the virtual address space is much bigger than the physical one
What are inverted page tables?
One entry for each physical page in memory
What is the TLB (translation lookaside buffer)
Hardware implementation of inverted page tables.
Small hardware structure that is highly associative, with typical 64 entries. Essentially a small cache for the page tables in memory.
Often have seperate I-TLBs and D-TLBs
What happens on a TLB miss?
Need to grab the address translation from the page table that is stored in memory (this can be done both in HW and SW)
Note that sometimes the page table might be too large to be kept in memory - then the OS might need to be involved to get it resolved.
What is one important thing when using virtual memory?
Important that a process cannot access physical memory that is not allocated to it.
Because of this, we must perform the address translation before we send the request to physical memory.
How does memory access work with virtual addresses to physical ones?
Tag and index of virtual address is used to access TLB
The TLB gives a physical page number.
Add this physical page number to the page offset from the virtual address. This gives the physical address
What does a virtual address entry contain?
Tag: Used to access TLB
Index: Used to access TLB
Page offset: Address within page
What does a physical address entry contain?
Cache Tag
Cache Index
Block offset
All used to access cache
What is memory protection?
When running multiple processes, we have multiple pages from multiple processes in memory. Don’t want processes to change memory of each other.
The different processes might use the same virtual addresses, but the OS makes sure they map to different parts of the physical memory.
Processes might share some memory (shared library). In this case we don’t want a copy in each processes memory. Instead the virtual addresses will map to the same part of memory.
What are virtually indexed, physically tagged cahces?
Make sure the index-bits and block offset of the physical address, is set within the page offset of the virtual address.
The page offset is the same in the virtual- and the physical address.
In this case we can lookup the memory in cache (using the index and block offset from page offset of virtual/physical address) at the same time as the TLB lookup happens using the index and tag from the virtual address.
The physical page number from the TLB becomes the tag of the virtual address. Now compare this tag to the tags from the cache and determine if we had a hit.
What are the three steps of memory operations?
Address calculation
Translation
Mem access
How are loads done in out-of-order processors?
Loads need to wait for the register with the address info
How are stores done in out-of-order processors?
Needs to wait for 2 registers (address + value)
What is the execution flow of loads/stores in out-of-order processors?
Address calculations are done in the FUs
Second stage: Then access TLB to get translation.
Third stage:
- Load: Read value from cache/mem and write to reg
- Store: Write value to store queue. Later into store buffer on completion.Yet later written to cache/mem on retirement
If we were to use virtually indexed, physically tagged. We would do these checks on virtual addresses.
How does speculative execution affect loads/stores
Loads: Do not handle page fault unless we are on correct path
Stores: Because of in-order completion, memory state is not updated speculatively
What hazards can occur on out-of-order memory operations?
RAW: loads may read data before a store has writte their value
WAW and WAR cannot occur because writes happen in program order
How can we accelerate loads
Load bypassing
Load forwarding
What is load bypassing?
Move load before prior stores
What is load forwarding?
Take the value being stored and forward it to the dependent load without accessing memory.
How can out-of-order stores be implemented in hardware
Stores are first in reservation station
Stores are then issued to a store-unit
When stores have the address and the data, then the store is finished.
When the load becomes the oldest instruction in the ROB it becomes completed
Whenever a cache is ready to receive the store it gets retired and the data and address gets written to cache.
How can out-of-order stores be implemented in hardware
Loads are first in reservation station
Loads are then issued to a load-unit.
The load sits in the load buffer while the address is being looked up in cache.
The cache then returns with the data that is written to a register.
When the load becomes the oldest instruction in the ROB and is ready to commit, the register will change into a architecturally visible register.
Loads are often given priority above stores, as they often are on the critical path.
Why do bypassing and forwarding cause complexity?
Most recent value may not be in cache/memory, but in store queue/buffer
Need to search the store queue/buffer for the most recent value of that same memory location
if we hit in store queue, we forward data to the load
Needs comparators to search store queue
Possible multiple stores with the same address, need to pick most recent one
This is Content Addressable Memory
What is a potential problem when doing loads/stores out-of-order?
Store prior to a dependent load might not have executed
Store might be in reservation station or in flight execution
In these cases, the value may not yet be in queue/buffer
Even worse, the address may not even be known yet
How do we implement out-of-order loads in hardware to solve the potential problems when it is dependent on a prior store?
Add a structure called the finished load buffer
Whan a load has gotten the data from cache/memory, the address and data is put in the finished load buffer.
When the load completes, i.e. is the oldest instruction, we can remove it from the finished load buffer.
In the meanwhile, when the stores complete we search through the finished load buffer for a load with the same address as the store.
If we find that the load was dependent on the store, and has loaded the wrong data. All the instructions that were dependent on this load, i.e. used theloaded data, will be rolled back.
These are then re-executed with the correct data. This is expensive, so we can implement some prediction to avoid this