Appendix B Flashcards
Cache performance
CPU execution time = (CPU clock cycles + Memory stall cycles) x Clock cycle time
CPU clock cycles include time to handle cache hit and processor is stalled during cache misses
Miss penalty
Cost per miss (amount of stalled cycles
Memory stall cycles
Memory stall cycles = Number of misses x Miss penalty = IC x (Misses/Instruction) x Miss penalty = IC x (Memory misses/Instruction) x Miss rate x Miss penalty
IC - instruction count
Memory stall cycles for reads and writes
Memory stall cycles = IC x Reads per instruction x Read miss rate x Read miss penalty + IC x Writes per instruction x Write miss rate x Write miss penalty
Example 1 - Assume we have a computer where the cycles per instruction (CPI) is 1.0 when all memory accesses hit the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 50 clock cycles and the miss rate is 1%, how much faster would the computer be if all instructions were cache hit
CPU execution time all hits = (CPU clock cycles + Memory stall cycles) x Clock cycle = (IC x CPI + 0) x Clock cycle = IC x 1.0 x Clock cycle
Memory stall cycles = IC x(1 + 0.5)x0.01x50
CPU execution time misses = (CPU clock cycles + Memory stall cycles) x Clock cycle = (IC x 1.0 + IC x 0.75) x Clock cycles = 1.75 x IC x Clock cycles
Speed up = CPU execution time misses / CPU execution time = 1.75 x IC x Clock cycles / 1.0 x IC x Clock cycles = 1.75
CPU with no cache misses is 75 % faster.
Misses per instruction
Misses/Instruction = Miss rate x Memory accesses / Instruction
Example 2 - Example 1 - Assume we have a computer where the cycles per instruction (CPI) is 1.0 when all memory accesses hit the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 50 clock cycles and the miss rate is 1%, how much faster would the computer be if all instructions were cache hit.
instead we use miss rate per 1000 instruction 30 misses.
What is memory stall time in terms of instruction count?
Memory stall cycles = Number of misses x Miss penalty = IC x (Misses/Instruction) x Miss penalty = IC/1000 x (Misses/(Instruction x 1000)) x Miss penalty = IC/1000 x 30 x 25 = IC x 0.75
Four memory hierarchy questions
Q1 Where can a block be placed in upper level? (block placement)
Q2 How is a block found if it is in the upper level? (block identification)
Q3 Which block should be replaced on a miss? (block replacement)
Q4 What happens on a write? (write strategy)
Cache - Q1 Where can a block be placed in upper level? (block placement)
Fully associative - block can be placed anywhere in cache
Direct mapped - block has its own place in the cache. Mapping is usually done using (Block address) mod (number of blocks in cache)
Set associative - block can be placed anywhere inside of a set of blocks in cache. Mapping is usually done using (Block address) mod (number of sets in cache). For n blocks in a set, memory is n-way associative
Cache - Q2 How is a block found if it is in the upper level? (block identification)
- valid bit - used to tag if data conatains valid address
- block frame address - tag field and index field
- index field - selects set
- tag field - used to compare with hit - is redundant set 0 has tag 0, set 1 has tag 1 etc. Saves hardware
- block offset - selects desired data from the field, not compared
Due to speed, this search is done in parallel.
Cache - Q3 Which block should be replaced on a miss? (block replacement)
Directly mapped - olny one block is checked
Associative and set associative - multiple blocks are checked.
Three methods
Random - candidate blocks are selected randomly. Debugging using pseudorandom numbers
LRU - least recently used - Replaces blocks that were not used for the longest time first.
FIFO - first in, first out - approximates LRU, determines oldest block instead of the LRU block
Cache - Q4 What happens on a write? (write strategy)
Reads are most common operation in RISC processors
Reads are simpler to implement - in case of miss data is ignored and needs to be reread. Such rules won`t apply for the writes though. Writes also need to specify size of the write, which makes them more complex.
Techniques for writes with Cache:
write through - information is written both to the cache and lower-level memory
write back - information is written only to the cahceand then written to the main memory when replaced - uses dirty bit
Write back writes are at speed of the cache memory and uses less bandwidth and less energy - interesting for embedded systems and servers and multiprocessor systems
Write through is easier to implement - always clean memory, no need to write to lower level memories. Simplifies data coherency. Multilevel caches -> write through more viable
Write miss - can result in write stall. Solution -> write buffer
Write misses resolution:
Write allocate - block is allocated on write miss, followed by the preceding write hit actions. In this natural option, write misses act like read misses.
No–write allocate - write misses do not affect cache, block is only modified in low-level memory only
Average memory access time
Average memory access time = Hit time + Miss rate x Miss penalty
Relationship between CPI and fast clock
- The lower the CPI(execution) the higher relative impact of a fixed number of cache miss clock cycles
- Cache miss penalty is measured in processor clock cycles for a miss. Even if memory hierarchies for two computers are identical, processor with higher clock rate will have higher clock cycles per miss and higher memory portion of CPI
This fact emphasizes importance of caches for processors with higher clock rates and low CPI
Cache optimization categories
Reducing the miss rate - larger block size, larger cache size and higher associativity
Reducing the miss penalty - multilevel caches and giving reads priority over writes
Reducing the time to hit in the cache - avoiding address translation when indexing the cache
Cache miss categories
Compulsory - the very first access to the memory cannot be hit
Capacity - cache usually cannot contain all the blocks from the memory
Conflict - set associative and directly mapped placement strategy can cause cache miss, because multiple blocks can be mapped to one cache block
Cache miss optimization 1: Larger block size to reduce miss rate
- takes advantage of spatial locality
- increases miss penalty
- may increase conflict misses
- increase in miss penalty may overweight decrease of miss rate
Cache optimization 2: Larger Caches to reduce miss rate
- higher capacity reduces the misses
- longer hit time, higher cost and power consumption
Cache optimization 3: Higher associativity to reduce miss rate
Two rules of thumb:
- 8-way set associative memory is as effective in reducing misses as fully associative memory in practice
- 2:1 cache rule of thumb - directly mapped cache of size N has the same miss rate as two-way set associative cache of size N/2.
- improves miss rate, increases miss penalty
Cache optimization 4: Multilevel Caches to reduce miss penalty
- reducing the size of first level cache to improve cache access time to clock cycle
- second level large enough to cover most memory accesses
Performance analysis:
Average memory access time = Hit timeL1 + Miss rateL1 x (Hit timeL2 + Miss rateL2 x Miss penaltyL2)
Average memory stalls per instruction = Misses per instructionL1 x Hit timeL2 + Misses per instructionL2 x Miss penaltyL2
Miss rates:
- Local miss rate - number of misses in a cache divided by the total accesses to the cache (Miss rate L1, Miss rate L2)
- Global miss rate - miss rate to the cache divided by the total number of memory accesse to the memory by processor (Miss rate L1 for L1 and Miss rate L1 x Miss rate L2 for L2)
Practice shows that L2 caches must be huge compared to L1 caches
- Multilevel inclusion - data in L1 cache are also in L2 cache
- Same block size vs. different block size - different block size -> smaller block size in L1 can lead to faster memory accesses, but to higher miss penalties in L1. Problem of inclusion can be solved by maintaining the same size of L1 and L2 cache block sizes.
Multilevel exclusion - if L2 cache is only slighlty bigger than L1 cache -> data in L1 is then swapped with data from L2
Cache optimization 5: Giving priority to read misses over writes to reduce miss penalty
- read misses are easier to handle
- with write through we use write buffers
- simplest solution is to wait for read miss until write miss is complete
- solution to improve the speed, during read miss we can read write buffer or stall the processor until the write buffer is empty
Cache optimization 6: Avoiding address translation during indexing of the cache to reduce hit time
- using virtual vs. physical addresses
- protection - Page-level protection must be obeyed
- solution - copying also TLBs
- switching the processes flushes the caches - physical addresses change (virtual may be the same!!!)
- using PID as part of the tag
- duplicate physical addresses -> different virtual addresses may access the same physical address
- synonyms and aliases may cause this
- antialiasing - each block in cache has a unique physical address - I/O addressing
- usually physical caching, harder to implement virtual caches
- solution - using virtual addressing and physical tagging
- virtual addressing using page offset (same in physical and virtual address)
- physical tags using physical addresses
- drawback -> direct-mapped caches cannot have blocks bigger than page sizes
Virtual memory - why?
- in past programmers had to divide programs into smaller pieces and had to ensure that program cannot access memories greater than computers main memory -> too much burden
- solution - virtual memory which is managed automatically
Virtual memory - principles
- processor generates virtual addresses
- virtual addresses are mapped to physical addresses
- transferring virtual memory to physical memory is called memory mapping or memory translation
Cache vs. virtual memory
Cache - controlled by hardware
Virtual memory - controlled by OS, higher miss penalties, better algorithms for replacements
Cache - independent of the processor`s address space
Virtual memory - defined by processors address space
Cache - Not used as a secondary storage
Virtual memory - contains file systems and is used as secondary storage
Virtual memory - categories
Fixed-sized blocks - pages
Variable-sized blocks - segments
Hybrid - both segments and pages - segments consist of pages
Four question for virtual memory
Q1 Where can a block be placed in main meory?
Q2 How is a block found if its in main memory?
Q3 Which block should be replaced on a virtual memory miss?
Q4 What happens on a write?
Virtual memory - Q1 Where can a block be placed in main meory?
- virtual memory accesses slowly rotating magnetic disks
- replacements are expensive!!!
- main goal is to pick lower miss rates
- mostly fully associative method
Virtual memory - Q2 How is a block found if its in main memory?
- depending on the technique used:
- paging - concatenating page offset to the physical address found in page table, page address is translated to the physical address using virtual page numbers
- some computers apply hashing to speed up process - inverted page table - length of the table is the same as amount of physical pages
- paging - concatenating page offset to the physical address found in page table, page address is translated to the physical address using virtual page numbers
- another speed up is use of TLBs (translation lookaside buffer)
Virtual memory - Q3 Which block should be replaced on a virtual memory miss?
- most common technique is LRU
- use of use bit whenever page is accessed
- OS periodically resets this bit and then refreshes it on use
Virtual memory - Q4 What happens on a write?
- always write back strategy ( extremely slow speed of rotating disks)
- use of dirty bits - only writes when neccesary
Techniques for fast address translation
- pages are so big they are stored in the main memory and sometimes paged themselves (2 level paging) -> leads to two accesses: 1. searching for a page 2. searching for the data
- Address translation cache - TLB
Translation lookaside buffer
- similar to cache entry
- contains tags
- implements protection mechanisms
- OS can modify these tables (protection etc.)
- OS handles old entries (must be removed!)
- OS handles dirty bits
Typical entry: Virtual address - n bits Virtual page number - n - k bits Page offset k - bits Physical address - m bits n-k bits found in TLB TLB -> V - valid bit R/W - read write permissions U/S D - dirty bit A - accessd bit (used by LRU) Tag - n - k bits (virtual page number) Physical address - m - k bits
Output Page offset + Physical address -> m bit physical address
Page size
- bigger pages save resources
- bigger pages allow larger caches with fast hit times
- bigger pages are more effective during transfers
- bigger pages allow for more efficient mapping of virtual memory and reduce TLB misses
- bigger pages cause higher internal fragmentation (unused space inside the page)
- modern microprocessors use variable page sizes (sometimes reffered to as page segmentation)
- large page size for small processes can cause waste of I/O bandwidth, waste of storage and process start-up times for smaller processes
Processes
- multiprogramming lead to processes
- process - abstraction of a program
- context and process switch
- process states must be saved during process switch and processes cannot interfere with each another
- multiple processes have data in one memory
- need of protecting data in the memory, so different processes cannot access each other`s data
- sharing of data and code between processes
Protection model
- built upon a principle of rings
- each process has its privileged level (x86 model)
- processes cannot modify their protection level
- x86 has four levels - 0 - kernel, 3 - user programs
- to access lower level protection levels, user must call kernel functions (e.g. sudo in UNIX)
Descriptor table - fields
Present field - valid bit - indicates if the page is a valid translation
Base field - page frame address - contains physical addresses
Access bit - similar to reference bit used by replacing algorithms
Attributes field - valid operations, protection levels…
Limit field - upper bounds of the segment
Entry in this table - segment decriptor
Descriptor tables - protection model
- global address space - shared by all processes
- local address space - address space of each process
- protection field - 2 bits
- violation - process tries to access segment with lower protection level than it has
- call gates - prevents processes from accessing each others stuff
- contain full physical address
- prevent process from randomly jumping accross address space