Appendix B Flashcards
Cache performance
CPU execution time = (CPU clock cycles + Memory stall cycles) x Clock cycle time
CPU clock cycles include time to handle cache hit and processor is stalled during cache misses
Miss penalty
Cost per miss (amount of stalled cycles
Memory stall cycles
Memory stall cycles = Number of misses x Miss penalty = IC x (Misses/Instruction) x Miss penalty = IC x (Memory misses/Instruction) x Miss rate x Miss penalty
IC - instruction count
Memory stall cycles for reads and writes
Memory stall cycles = IC x Reads per instruction x Read miss rate x Read miss penalty + IC x Writes per instruction x Write miss rate x Write miss penalty
Example 1 - Assume we have a computer where the cycles per instruction (CPI) is 1.0 when all memory accesses hit the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 50 clock cycles and the miss rate is 1%, how much faster would the computer be if all instructions were cache hit
CPU execution time all hits = (CPU clock cycles + Memory stall cycles) x Clock cycle = (IC x CPI + 0) x Clock cycle = IC x 1.0 x Clock cycle
Memory stall cycles = IC x(1 + 0.5)x0.01x50
CPU execution time misses = (CPU clock cycles + Memory stall cycles) x Clock cycle = (IC x 1.0 + IC x 0.75) x Clock cycles = 1.75 x IC x Clock cycles
Speed up = CPU execution time misses / CPU execution time = 1.75 x IC x Clock cycles / 1.0 x IC x Clock cycles = 1.75
CPU with no cache misses is 75 % faster.
Misses per instruction
Misses/Instruction = Miss rate x Memory accesses / Instruction
Example 2 - Example 1 - Assume we have a computer where the cycles per instruction (CPI) is 1.0 when all memory accesses hit the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 50 clock cycles and the miss rate is 1%, how much faster would the computer be if all instructions were cache hit.
instead we use miss rate per 1000 instruction 30 misses.
What is memory stall time in terms of instruction count?
Memory stall cycles = Number of misses x Miss penalty = IC x (Misses/Instruction) x Miss penalty = IC/1000 x (Misses/(Instruction x 1000)) x Miss penalty = IC/1000 x 30 x 25 = IC x 0.75
Four memory hierarchy questions
Q1 Where can a block be placed in upper level? (block placement)
Q2 How is a block found if it is in the upper level? (block identification)
Q3 Which block should be replaced on a miss? (block replacement)
Q4 What happens on a write? (write strategy)
Cache - Q1 Where can a block be placed in upper level? (block placement)
Fully associative - block can be placed anywhere in cache
Direct mapped - block has its own place in the cache. Mapping is usually done using (Block address) mod (number of blocks in cache)
Set associative - block can be placed anywhere inside of a set of blocks in cache. Mapping is usually done using (Block address) mod (number of sets in cache). For n blocks in a set, memory is n-way associative
Cache - Q2 How is a block found if it is in the upper level? (block identification)
- valid bit - used to tag if data conatains valid address
- block frame address - tag field and index field
- index field - selects set
- tag field - used to compare with hit - is redundant set 0 has tag 0, set 1 has tag 1 etc. Saves hardware
- block offset - selects desired data from the field, not compared
Due to speed, this search is done in parallel.
Cache - Q3 Which block should be replaced on a miss? (block replacement)
Directly mapped - olny one block is checked
Associative and set associative - multiple blocks are checked.
Three methods
Random - candidate blocks are selected randomly. Debugging using pseudorandom numbers
LRU - least recently used - Replaces blocks that were not used for the longest time first.
FIFO - first in, first out - approximates LRU, determines oldest block instead of the LRU block
Cache - Q4 What happens on a write? (write strategy)
Reads are most common operation in RISC processors
Reads are simpler to implement - in case of miss data is ignored and needs to be reread. Such rules won`t apply for the writes though. Writes also need to specify size of the write, which makes them more complex.
Techniques for writes with Cache:
write through - information is written both to the cache and lower-level memory
write back - information is written only to the cahceand then written to the main memory when replaced - uses dirty bit
Write back writes are at speed of the cache memory and uses less bandwidth and less energy - interesting for embedded systems and servers and multiprocessor systems
Write through is easier to implement - always clean memory, no need to write to lower level memories. Simplifies data coherency. Multilevel caches -> write through more viable
Write miss - can result in write stall. Solution -> write buffer
Write misses resolution:
Write allocate - block is allocated on write miss, followed by the preceding write hit actions. In this natural option, write misses act like read misses.
No–write allocate - write misses do not affect cache, block is only modified in low-level memory only
Average memory access time
Average memory access time = Hit time + Miss rate x Miss penalty
Relationship between CPI and fast clock
- The lower the CPI(execution) the higher relative impact of a fixed number of cache miss clock cycles
- Cache miss penalty is measured in processor clock cycles for a miss. Even if memory hierarchies for two computers are identical, processor with higher clock rate will have higher clock cycles per miss and higher memory portion of CPI
This fact emphasizes importance of caches for processors with higher clock rates and low CPI
Cache optimization categories
Reducing the miss rate - larger block size, larger cache size and higher associativity
Reducing the miss penalty - multilevel caches and giving reads priority over writes
Reducing the time to hit in the cache - avoiding address translation when indexing the cache