Appendix B Flashcards

1
Q

Cache performance

A

CPU execution time = (CPU clock cycles + Memory stall cycles) x Clock cycle time

CPU clock cycles include time to handle cache hit and processor is stalled during cache misses

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Miss penalty

A

Cost per miss (amount of stalled cycles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Memory stall cycles

A

Memory stall cycles = Number of misses x Miss penalty = IC x (Misses/Instruction) x Miss penalty = IC x (Memory misses/Instruction) x Miss rate x Miss penalty

IC - instruction count

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Memory stall cycles for reads and writes

A

Memory stall cycles = IC x Reads per instruction x Read miss rate x Read miss penalty + IC x Writes per instruction x Write miss rate x Write miss penalty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Example 1 - Assume we have a computer where the cycles per instruction (CPI) is 1.0 when all memory accesses hit the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 50 clock cycles and the miss rate is 1%, how much faster would the computer be if all instructions were cache hit

A

CPU execution time all hits = (CPU clock cycles + Memory stall cycles) x Clock cycle = (IC x CPI + 0) x Clock cycle = IC x 1.0 x Clock cycle

Memory stall cycles = IC x(1 + 0.5)x0.01x50

CPU execution time misses = (CPU clock cycles + Memory stall cycles) x Clock cycle = (IC x 1.0 + IC x 0.75) x Clock cycles = 1.75 x IC x Clock cycles

Speed up = CPU execution time misses / CPU execution time = 1.75 x IC x Clock cycles / 1.0 x IC x Clock cycles = 1.75

CPU with no cache misses is 75 % faster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Misses per instruction

A

Misses/Instruction = Miss rate x Memory accesses / Instruction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Example 2 - Example 1 - Assume we have a computer where the cycles per instruction (CPI) is 1.0 when all memory accesses hit the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 50 clock cycles and the miss rate is 1%, how much faster would the computer be if all instructions were cache hit.

instead we use miss rate per 1000 instruction 30 misses.

What is memory stall time in terms of instruction count?

A

Memory stall cycles = Number of misses x Miss penalty = IC x (Misses/Instruction) x Miss penalty = IC/1000 x (Misses/(Instruction x 1000)) x Miss penalty = IC/1000 x 30 x 25 = IC x 0.75

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Four memory hierarchy questions

A

Q1 Where can a block be placed in upper level? (block placement)
Q2 How is a block found if it is in the upper level? (block identification)
Q3 Which block should be replaced on a miss? (block replacement)
Q4 What happens on a write? (write strategy)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Cache - Q1 Where can a block be placed in upper level? (block placement)

A

Fully associative - block can be placed anywhere in cache
Direct mapped - block has its own place in the cache. Mapping is usually done using (Block address) mod (number of blocks in cache)
Set associative - block can be placed anywhere inside of a set of blocks in cache. Mapping is usually done using (Block address) mod (number of sets in cache). For n blocks in a set, memory is n-way associative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Cache - Q2 How is a block found if it is in the upper level? (block identification)

A
  • valid bit - used to tag if data conatains valid address
  • block frame address - tag field and index field
    • index field - selects set
    • tag field - used to compare with hit - is redundant set 0 has tag 0, set 1 has tag 1 etc. Saves hardware
  • block offset - selects desired data from the field, not compared

Due to speed, this search is done in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Cache - Q3 Which block should be replaced on a miss? (block replacement)

A

Directly mapped - olny one block is checked
Associative and set associative - multiple blocks are checked.

Three methods
Random - candidate blocks are selected randomly. Debugging using pseudorandom numbers
LRU - least recently used - Replaces blocks that were not used for the longest time first.
FIFO - first in, first out - approximates LRU, determines oldest block instead of the LRU block

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Cache - Q4 What happens on a write? (write strategy)

A

Reads are most common operation in RISC processors
Reads are simpler to implement - in case of miss data is ignored and needs to be reread. Such rules won`t apply for the writes though. Writes also need to specify size of the write, which makes them more complex.

Techniques for writes with Cache:
write through - information is written both to the cache and lower-level memory
write back - information is written only to the cahceand then written to the main memory when replaced - uses dirty bit

Write back writes are at speed of the cache memory and uses less bandwidth and less energy - interesting for embedded systems and servers and multiprocessor systems
Write through is easier to implement - always clean memory, no need to write to lower level memories. Simplifies data coherency. Multilevel caches -> write through more viable

Write miss - can result in write stall. Solution -> write buffer

Write misses resolution:
Write allocate - block is allocated on write miss, followed by the preceding write hit actions. In this natural option, write misses act like read misses.
No–write allocate - write misses do not affect cache, block is only modified in low-level memory only

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Average memory access time

A

Average memory access time = Hit time + Miss rate x Miss penalty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Relationship between CPI and fast clock

A
  1. The lower the CPI(execution) the higher relative impact of a fixed number of cache miss clock cycles
  2. Cache miss penalty is measured in processor clock cycles for a miss. Even if memory hierarchies for two computers are identical, processor with higher clock rate will have higher clock cycles per miss and higher memory portion of CPI

This fact emphasizes importance of caches for processors with higher clock rates and low CPI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Cache optimization categories

A

Reducing the miss rate - larger block size, larger cache size and higher associativity

Reducing the miss penalty - multilevel caches and giving reads priority over writes

Reducing the time to hit in the cache - avoiding address translation when indexing the cache

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Cache miss categories

A

Compulsory - the very first access to the memory cannot be hit

Capacity - cache usually cannot contain all the blocks from the memory

Conflict - set associative and directly mapped placement strategy can cause cache miss, because multiple blocks can be mapped to one cache block

17
Q

Cache miss optimization 1: Larger block size to reduce miss rate

A
  • takes advantage of spatial locality
  • increases miss penalty
  • may increase conflict misses
  • increase in miss penalty may overweight decrease of miss rate
18
Q

Cache optimization 2: Larger Caches to reduce miss rate

A
  • higher capacity reduces the misses

- longer hit time, higher cost and power consumption

19
Q

Cache optimization 3: Higher associativity to reduce miss rate

A

Two rules of thumb:

  1. 8-way set associative memory is as effective in reducing misses as fully associative memory in practice
  2. 2:1 cache rule of thumb - directly mapped cache of size N has the same miss rate as two-way set associative cache of size N/2.
  • improves miss rate, increases miss penalty
20
Q

Cache optimization 4: Multilevel Caches to reduce miss penalty

A
  • reducing the size of first level cache to improve cache access time to clock cycle
  • second level large enough to cover most memory accesses

Performance analysis:
Average memory access time = Hit timeL1 + Miss rateL1 x (Hit timeL2 + Miss rateL2 x Miss penaltyL2)

Average memory stalls per instruction = Misses per instructionL1 x Hit timeL2 + Misses per instructionL2 x Miss penaltyL2

Miss rates:

  • Local miss rate - number of misses in a cache divided by the total accesses to the cache (Miss rate L1, Miss rate L2)
  • Global miss rate - miss rate to the cache divided by the total number of memory accesse to the memory by processor (Miss rate L1 for L1 and Miss rate L1 x Miss rate L2 for L2)

Practice shows that L2 caches must be huge compared to L1 caches

  • Multilevel inclusion - data in L1 cache are also in L2 cache
  • Same block size vs. different block size - different block size -> smaller block size in L1 can lead to faster memory accesses, but to higher miss penalties in L1. Problem of inclusion can be solved by maintaining the same size of L1 and L2 cache block sizes.

Multilevel exclusion - if L2 cache is only slighlty bigger than L1 cache -> data in L1 is then swapped with data from L2

21
Q

Cache optimization 5: Giving priority to read misses over writes to reduce miss penalty

A
  • read misses are easier to handle
  • with write through we use write buffers
  • simplest solution is to wait for read miss until write miss is complete
  • solution to improve the speed, during read miss we can read write buffer or stall the processor until the write buffer is empty
22
Q

Cache optimization 6: Avoiding address translation during indexing of the cache to reduce hit time

A
  • using virtual vs. physical addresses
  • protection - Page-level protection must be obeyed
    • solution - copying also TLBs
  • switching the processes flushes the caches - physical addresses change (virtual may be the same!!!)
    • using PID as part of the tag
  • duplicate physical addresses -> different virtual addresses may access the same physical address
    - synonyms and aliases may cause this
    - antialiasing - each block in cache has a unique physical address
  • I/O addressing
    • usually physical caching, harder to implement virtual caches
    • solution - using virtual addressing and physical tagging
    • virtual addressing using page offset (same in physical and virtual address)
    • physical tags using physical addresses
    • drawback -> direct-mapped caches cannot have blocks bigger than page sizes
23
Q

Virtual memory - why?

A
  • in past programmers had to divide programs into smaller pieces and had to ensure that program cannot access memories greater than computers main memory -> too much burden
  • solution - virtual memory which is managed automatically
24
Q

Virtual memory - principles

A
  • processor generates virtual addresses
  • virtual addresses are mapped to physical addresses
  • transferring virtual memory to physical memory is called memory mapping or memory translation
25
Q

Cache vs. virtual memory

A

Cache - controlled by hardware
Virtual memory - controlled by OS, higher miss penalties, better algorithms for replacements

Cache - independent of the processor`s address space
Virtual memory - defined by processors address space

Cache - Not used as a secondary storage
Virtual memory - contains file systems and is used as secondary storage

26
Q

Virtual memory - categories

A

Fixed-sized blocks - pages
Variable-sized blocks - segments
Hybrid - both segments and pages - segments consist of pages

27
Q

Four question for virtual memory

A

Q1 Where can a block be placed in main meory?
Q2 How is a block found if its in main memory?
Q3 Which block should be replaced on a virtual memory miss?
Q4 What happens on a write?

28
Q

Virtual memory - Q1 Where can a block be placed in main meory?

A
  • virtual memory accesses slowly rotating magnetic disks
  • replacements are expensive!!!
  • main goal is to pick lower miss rates
  • mostly fully associative method
29
Q

Virtual memory - Q2 How is a block found if its in main memory?

A
  • depending on the technique used:
    • paging - concatenating page offset to the physical address found in page table, page address is translated to the physical address using virtual page numbers
      - some computers apply hashing to speed up process - inverted page table - length of the table is the same as amount of physical pages
  • another speed up is use of TLBs (translation lookaside buffer)
30
Q

Virtual memory - Q3 Which block should be replaced on a virtual memory miss?

A
  • most common technique is LRU
    - use of use bit whenever page is accessed
    - OS periodically resets this bit and then refreshes it on use
31
Q

Virtual memory - Q4 What happens on a write?

A
  • always write back strategy ( extremely slow speed of rotating disks)
  • use of dirty bits - only writes when neccesary
32
Q

Techniques for fast address translation

A
  • pages are so big they are stored in the main memory and sometimes paged themselves (2 level paging) -> leads to two accesses: 1. searching for a page 2. searching for the data
  • Address translation cache - TLB
33
Q

Translation lookaside buffer

A
  • similar to cache entry
  • contains tags
  • implements protection mechanisms
  • OS can modify these tables (protection etc.)
  • OS handles old entries (must be removed!)
  • OS handles dirty bits
Typical entry:
Virtual address - n bits
Virtual page number - n - k bits
Page offset k - bits
Physical address - m bits
n-k bits found in TLB
TLB -> V - valid bit
            R/W - read write permissions
            U/S
            D - dirty bit
            A - accessd bit (used by LRU)
            Tag - n - k bits (virtual page number)
             Physical address - m - k bits

Output Page offset + Physical address -> m bit physical address

34
Q

Page size

A
  • bigger pages save resources
  • bigger pages allow larger caches with fast hit times
  • bigger pages are more effective during transfers
  • bigger pages allow for more efficient mapping of virtual memory and reduce TLB misses
  • bigger pages cause higher internal fragmentation (unused space inside the page)
  • modern microprocessors use variable page sizes (sometimes reffered to as page segmentation)
  • large page size for small processes can cause waste of I/O bandwidth, waste of storage and process start-up times for smaller processes
35
Q

Processes

A
  • multiprogramming lead to processes
  • process - abstraction of a program
  • context and process switch
  • process states must be saved during process switch and processes cannot interfere with each another
  • multiple processes have data in one memory
  • need of protecting data in the memory, so different processes cannot access each other`s data
  • sharing of data and code between processes
36
Q

Protection model

A
  • built upon a principle of rings
  • each process has its privileged level (x86 model)
  • processes cannot modify their protection level
  • x86 has four levels - 0 - kernel, 3 - user programs
  • to access lower level protection levels, user must call kernel functions (e.g. sudo in UNIX)
37
Q

Descriptor table - fields

A

Present field - valid bit - indicates if the page is a valid translation
Base field - page frame address - contains physical addresses
Access bit - similar to reference bit used by replacing algorithms
Attributes field - valid operations, protection levels…
Limit field - upper bounds of the segment

Entry in this table - segment decriptor

38
Q

Descriptor tables - protection model

A
  • global address space - shared by all processes
  • local address space - address space of each process
  • protection field - 2 bits
  • violation - process tries to access segment with lower protection level than it has
  • call gates - prevents processes from accessing each others stuff
    - contain full physical address
    - prevent process from randomly jumping accross address space