Appendix B Flashcards by Karol Marso

Cache performance

CPU execution time = (CPU clock cycles + Memory stall cycles) x Clock cycle time

CPU clock cycles include time to handle cache hit and processor is stalled during cache misses

How well did you know this?

Not at all

Perfectly

Miss penalty

Cost per miss (amount of stalled cycles

How well did you know this?

Not at all

Perfectly

Memory stall cycles

Memory stall cycles = Number of misses x Miss penalty = IC x (Misses/Instruction) x Miss penalty = IC x (Memory misses/Instruction) x Miss rate x Miss penalty

IC - instruction count

How well did you know this?

Not at all

Perfectly

Memory stall cycles for reads and writes

Memory stall cycles = IC x Reads per instruction x Read miss rate x Read miss penalty + IC x Writes per instruction x Write miss rate x Write miss penalty

How well did you know this?

Not at all

Perfectly

Example 1 - Assume we have a computer where the cycles per instruction (CPI) is 1.0 when all memory accesses hit the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 50 clock cycles and the miss rate is 1%, how much faster would the computer be if all instructions were cache hit

CPU execution time all hits = (CPU clock cycles + Memory stall cycles) x Clock cycle = (IC x CPI + 0) x Clock cycle = IC x 1.0 x Clock cycle

Memory stall cycles = IC x(1 + 0.5)x0.01x50

CPU execution time misses = (CPU clock cycles + Memory stall cycles) x Clock cycle = (IC x 1.0 + IC x 0.75) x Clock cycles = 1.75 x IC x Clock cycles

Speed up = CPU execution time misses / CPU execution time = 1.75 x IC x Clock cycles / 1.0 x IC x Clock cycles = 1.75

CPU with no cache misses is 75 % faster.

How well did you know this?

Not at all

Perfectly

Misses per instruction

Misses/Instruction = Miss rate x Memory accesses / Instruction

How well did you know this?

Not at all

Perfectly

Example 2 - Example 1 - Assume we have a computer where the cycles per instruction (CPI) is 1.0 when all memory accesses hit the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 50 clock cycles and the miss rate is 1%, how much faster would the computer be if all instructions were cache hit.

instead we use miss rate per 1000 instruction 30 misses.

What is memory stall time in terms of instruction count?

Memory stall cycles = Number of misses x Miss penalty = IC x (Misses/Instruction) x Miss penalty = IC/1000 x (Misses/(Instruction x 1000)) x Miss penalty = IC/1000 x 30 x 25 = IC x 0.75

How well did you know this?

Not at all

Perfectly

Four memory hierarchy questions

Q1 Where can a block be placed in upper level? (block placement)
Q2 How is a block found if it is in the upper level? (block identification)
Q3 Which block should be replaced on a miss? (block replacement)
Q4 What happens on a write? (write strategy)

How well did you know this?

Not at all

Perfectly

Cache - Q1 Where can a block be placed in upper level? (block placement)

Fully associative - block can be placed anywhere in cache
Direct mapped - block has its own place in the cache. Mapping is usually done using (Block address) mod (number of blocks in cache)
Set associative - block can be placed anywhere inside of a set of blocks in cache. Mapping is usually done using (Block address) mod (number of sets in cache). For n blocks in a set, memory is n-way associative

How well did you know this?

Not at all

Perfectly

Cache - Q2 How is a block found if it is in the upper level? (block identification)

valid bit - used to tag if data conatains valid address
block frame address - tag field and index field
- index field - selects set
- tag field - used to compare with hit - is redundant set 0 has tag 0, set 1 has tag 1 etc. Saves hardware
block offset - selects desired data from the field, not compared

Due to speed, this search is done in parallel.

How well did you know this?

Not at all

Perfectly

Cache - Q3 Which block should be replaced on a miss? (block replacement)

Directly mapped - olny one block is checked
Associative and set associative - multiple blocks are checked.

Three methods
Random - candidate blocks are selected randomly. Debugging using pseudorandom numbers
LRU - least recently used - Replaces blocks that were not used for the longest time first.
FIFO - first in, first out - approximates LRU, determines oldest block instead of the LRU block

How well did you know this?

Not at all

Perfectly

Cache - Q4 What happens on a write? (write strategy)

Reads are most common operation in RISC processors
Reads are simpler to implement - in case of miss data is ignored and needs to be reread. Such rules won`t apply for the writes though. Writes also need to specify size of the write, which makes them more complex.

Techniques for writes with Cache:
write through - information is written both to the cache and lower-level memory
write back - information is written only to the cahceand then written to the main memory when replaced - uses dirty bit

Write back writes are at speed of the cache memory and uses less bandwidth and less energy - interesting for embedded systems and servers and multiprocessor systems
Write through is easier to implement - always clean memory, no need to write to lower level memories. Simplifies data coherency. Multilevel caches -> write through more viable

Write miss - can result in write stall. Solution -> write buffer

Write misses resolution:
Write allocate - block is allocated on write miss, followed by the preceding write hit actions. In this natural option, write misses act like read misses.
No–write allocate - write misses do not affect cache, block is only modified in low-level memory only

How well did you know this?

Not at all

Perfectly

Average memory access time

Average memory access time = Hit time + Miss rate x Miss penalty

How well did you know this?

Not at all

Perfectly

Relationship between CPI and fast clock

The lower the CPI(execution) the higher relative impact of a fixed number of cache miss clock cycles
Cache miss penalty is measured in processor clock cycles for a miss. Even if memory hierarchies for two computers are identical, processor with higher clock rate will have higher clock cycles per miss and higher memory portion of CPI

This fact emphasizes importance of caches for processors with higher clock rates and low CPI

How well did you know this?

Not at all

Perfectly

Cache optimization categories

Reducing the miss rate - larger block size, larger cache size and higher associativity

Reducing the miss penalty - multilevel caches and giving reads priority over writes

Reducing the time to hit in the cache - avoiding address translation when indexing the cache

How well did you know this?

Not at all

Perfectly

Cache miss categories

Study These Flashcards

Compulsory - the very first access to the memory cannot be hit

Capacity - cache usually cannot contain all the blocks from the memory

Conflict - set associative and directly mapped placement strategy can cause cache miss, because multiple blocks can be mapped to one cache block

Cache miss optimization 1: Larger block size to reduce miss rate

Study These Flashcards

takes advantage of spatial locality
increases miss penalty
may increase conflict misses
increase in miss penalty may overweight decrease of miss rate

Cache optimization 2: Larger Caches to reduce miss rate

Study These Flashcards

higher capacity reduces the misses

- longer hit time, higher cost and power consumption

Cache optimization 3: Higher associativity to reduce miss rate

Study These Flashcards

Two rules of thumb:

8-way set associative memory is as effective in reducing misses as fully associative memory in practice
2:1 cache rule of thumb - directly mapped cache of size N has the same miss rate as two-way set associative cache of size N/2.

improves miss rate, increases miss penalty

Cache optimization 4: Multilevel Caches to reduce miss penalty

Study These Flashcards

reducing the size of first level cache to improve cache access time to clock cycle
second level large enough to cover most memory accesses

Performance analysis:
Average memory access time = Hit timeL1 + Miss rateL1 x (Hit timeL2 + Miss rateL2 x Miss penaltyL2)

Average memory stalls per instruction = Misses per instructionL1 x Hit timeL2 + Misses per instructionL2 x Miss penaltyL2

Miss rates:

Local miss rate - number of misses in a cache divided by the total accesses to the cache (Miss rate L1, Miss rate L2)
Global miss rate - miss rate to the cache divided by the total number of memory accesse to the memory by processor (Miss rate L1 for L1 and Miss rate L1 x Miss rate L2 for L2)

Practice shows that L2 caches must be huge compared to L1 caches

Multilevel inclusion - data in L1 cache are also in L2 cache
Same block size vs. different block size - different block size -> smaller block size in L1 can lead to faster memory accesses, but to higher miss penalties in L1. Problem of inclusion can be solved by maintaining the same size of L1 and L2 cache block sizes.

Multilevel exclusion - if L2 cache is only slighlty bigger than L1 cache -> data in L1 is then swapped with data from L2

Cache optimization 5: Giving priority to read misses over writes to reduce miss penalty

Study These Flashcards

read misses are easier to handle
with write through we use write buffers
simplest solution is to wait for read miss until write miss is complete
solution to improve the speed, during read miss we can read write buffer or stall the processor until the write buffer is empty

Cache optimization 6: Avoiding address translation during indexing of the cache to reduce hit time

Study These Flashcards

using virtual vs. physical addresses
protection - Page-level protection must be obeyed
- solution - copying also TLBs
switching the processes flushes the caches - physical addresses change (virtual may be the same!!!)
- using PID as part of the tag
duplicate physical addresses -> different virtual addresses may access the same physical address
- synonyms and aliases may cause this
- antialiasing - each block in cache has a unique physical address
I/O addressing
- usually physical caching, harder to implement virtual caches
- solution - using virtual addressing and physical tagging
- virtual addressing using page offset (same in physical and virtual address)
- physical tags using physical addresses
- drawback -> direct-mapped caches cannot have blocks bigger than page sizes

Virtual memory - why?

Study These Flashcards

in past programmers had to divide programs into smaller pieces and had to ensure that program cannot access memories greater than computers main memory -> too much burden
solution - virtual memory which is managed automatically

Virtual memory - principles

Study These Flashcards

processor generates virtual addresses
virtual addresses are mapped to physical addresses
transferring virtual memory to physical memory is called memory mapping or memory translation

Cache vs. virtual memory

Cache - controlled by hardware Virtual memory - controlled by OS, higher miss penalties, better algorithms for replacements Cache - independent of the processor`s address space Virtual memory - defined by processors address space Cache - Not used as a secondary storage Virtual memory - contains file systems and is used as secondary storage

Virtual memory - categories

Fixed-sized blocks - pages Variable-sized blocks - segments Hybrid - both segments and pages - segments consist of pages

Four question for virtual memory

Q1 Where can a block be placed in main meory? Q2 How is a block found if its in main memory? Q3 Which block should be replaced on a virtual memory miss? Q4 What happens on a write?

Virtual memory - Q1 Where can a block be placed in main meory?

- virtual memory accesses slowly rotating magnetic disks - replacements are expensive!!! - main goal is to pick lower miss rates - mostly fully associative method

Virtual memory - Q2 How is a block found if its in main memory?

- depending on the technique used: - paging - concatenating page offset to the physical address found in page table, page address is translated to the physical address using virtual page numbers - some computers apply hashing to speed up process - inverted page table - length of the table is the same as amount of physical pages - another speed up is use of TLBs (translation lookaside buffer)

Virtual memory - Q3 Which block should be replaced on a virtual memory miss?

- most common technique is LRU - use of use bit whenever page is accessed - OS periodically resets this bit and then refreshes it on use

Virtual memory - Q4 What happens on a write?

- always write back strategy ( extremely slow speed of rotating disks) - use of dirty bits - only writes when neccesary

Techniques for fast address translation

- pages are so big they are stored in the main memory and sometimes paged themselves (2 level paging) -> leads to two accesses: 1. searching for a page 2. searching for the data - Address translation cache - TLB

Translation lookaside buffer

- similar to cache entry - contains tags - implements protection mechanisms - OS can modify these tables (protection etc.) - OS handles old entries (must be removed!) - OS handles dirty bits ``` Typical entry: Virtual address - n bits Virtual page number - n - k bits Page offset k - bits Physical address - m bits n-k bits found in TLB TLB -> V - valid bit R/W - read write permissions U/S D - dirty bit A - accessd bit (used by LRU) Tag - n - k bits (virtual page number) Physical address - m - k bits ``` Output Page offset + Physical address -> m bit physical address

Page size

- bigger pages save resources - bigger pages allow larger caches with fast hit times - bigger pages are more effective during transfers - bigger pages allow for more efficient mapping of virtual memory and reduce TLB misses - bigger pages cause higher internal fragmentation (unused space inside the page) - modern microprocessors use variable page sizes (sometimes reffered to as page segmentation) - large page size for small processes can cause waste of I/O bandwidth, waste of storage and process start-up times for smaller processes

Processes

- multiprogramming lead to processes - process - abstraction of a program - context and process switch - process states must be saved during process switch and processes cannot interfere with each another - multiple processes have data in one memory - need of protecting data in the memory, so different processes cannot access each other`s data - sharing of data and code between processes

Protection model

- built upon a principle of rings - each process has its privileged level (x86 model) - processes cannot modify their protection level - x86 has four levels - 0 - kernel, 3 - user programs - to access lower level protection levels, user must call kernel functions (e.g. sudo in UNIX)

Descriptor table - fields

Present field - valid bit - indicates if the page is a valid translation Base field - page frame address - contains physical addresses Access bit - similar to reference bit used by replacing algorithms Attributes field - valid operations, protection levels... Limit field - upper bounds of the segment Entry in this table - segment decriptor

Descriptor tables - protection model

- global address space - shared by all processes - local address space - address space of each process - protection field - 2 bits - violation - process tries to access segment with lower protection level than it has - call gates - prevents processes from accessing each others stuff - contain full physical address - prevent process from randomly jumping accross address space

Appendix B Flashcards

(38 cards)