Memory hierarchy Flashcards

1
Q

What is the objective of the memory hierarchy?

A

Reduce the impact of the memory wall.

Illusion of having a fast and large memory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the memory tradeoffs?

A

Speed, size and cost

CPU registers
Cache
RAM
Hard disk
Off-line storage (drivers)

Faster and smaller closest to CPU, these are more costly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why is the memory hierarchy efficient?

A

Locality: spatial and temporal

Spatial: Addresses near each other are likely to be referenced close in time, loops/arrays

Temporal: Address recently accessed, likely to be accessed again, e.g. loops

Because of locality - keep recently used data near the CPU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some causes of locality?

A

Loops
functions
stack
Code artifacts (Arrays, matrix, structs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What 4 things needs to be taken into account when designing memory?

A

Block placement: Where can a block be placed?

Block identification: How is a block found?

Block replacement: Which block should be replaced on a miss?

Write strategy: What happens on a write?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the three ways a block can be placed in a cache?

A

Direct mapped
Set-associative
Fully associative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a direct mapped cache?

A

A block can only be placed one place in the cache.

Only need to compare 1 tag with the address tag to see if it is in the cache.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a n-way set-associative cache?

A

A trade off between direct mapped and fully associative.

Group entries in sets of n blocks.
An address can be placed in n possible blocks in the cache, but only in one possible set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a fully associative cache?

A

A block can be placed anywhere in cache.
Compare tag to tags of all cache entries to find correct cache block.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What function (optimization) does the index bits of an address have?

A

Limit the number of places we have to search, and search all places in parallell.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the structure of an address?

A

Block addres + Block offset
Tag + index + Block offset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the address of a fully associative cache contain?

A

Tag + offset

Does not need index to limit amount of block to search, as the address can be stored in any of the blocks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does the index in a direct mapped- and a set associative cache do?

A

Direct mapped: Tell what block the address is stored in.

Set associative: Tall what set the address is stored in.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does the offset bits do in caches?

A

Used to index into the data, get the correct bytes from the cache block.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How are sets organized in hardware?

A

The ways are seperated in their own tables. The ways contain a valid bit and a tag field.

The index bits points to rows in each way. This allows for accessing all ways within a set in parallell.

The rows pointed to by the index in the way-tables, make for the complete set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the steps of identification?

A
  1. Split address into fields (offset, index, tag)
  2. Read tag(s) from the set indicated by the index
  3. Compare the tag from the address with the read tag(s)
  4. If hit, then select the correct data and send it to the CPU.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are 3 examples of replacement policies?

A

Random

LRU: Least recently used, keep ordered list of ways in the cache. Way least recently used is evicted.

FIFO: First in first out

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are 4 examples of write strategy?

A

How to handle writes:
Write-through
Write-back

If we are missing when doing a write, how should we do this access:
Write allocate
No-write allocate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is write through?

A

All data written to cache, is also written to all the lower levels of cache.

Cache is always clean.

Easy to implement.

20
Q

What is write back?

A

Only write to the cache that is being accessed to.

When a block is evicted, it is written back to lower levels of cache.

Cache is dirty - complicated cache coherency

Saves traffic/bandwidth and energy

21
Q

What is cache coherency?

A

When a system deals with multiple caches, needs to keep these consistent. This is done by sending messages between the caches. This itself, increases traffic and energy.

22
Q

What is write allocate?

A

Store the block in the cache on a write miss.
Most common with write-back - if we need to evict a line, just need to write over it.

23
Q

What is no-write allocate?

A

Do not store a block on a write miss.
Instead, go to the next level in the cache and store it there.

24
Q

What steps are required when accessing a cache?

A
  1. Read tags to see if we have a hit

2a. On hit, read the data
2b. On miss, access next level in mem.hierarchy

  1. Provide the requested data
25
Q

When accessing a cache, how can load performance be increased?

A

Reading tags and reading the hit-data in parallel, done by L1 cache. On hit, send back data immediately. This is fast but power hungry.

26
Q

How can energy efficiency be spared when accessing cache?

A

Deeper levels from L1 does tag comparison and data read sequentially.

L2 is not as performance critical - don’t need to do data read in parallel. Because L2 cache is bigger, it is also more expensive to do this in parallel.

27
Q

What is the parallel access time?

A

Hit time + (Miss rate * miss penalty)

28
Q

What can affect cache performance?

A

Miss rate
Cache design

29
Q

How can cache design affect performance?

A

Extra logic in cache may increase clock cycle time.

Trade offs: Additional logic vs. faster clock frequency

30
Q

What are the three types of cache misses?

A

Compulsory: First access to a block, miss that would happen in an infinite large cache

Capacity: Would not happen in an infinitely cache. Caused by blocks being evicted and later tried to read. Happens in a fully associative cache.

Conflict misses: Caused by cache not being fully associative. caused by restrictions on where a blok can be placed. Does not occur in a fully associative cache.

31
Q

Name 9 cache optimizations

A

L1 configuration
Way prediction
Pipelining
Non-blocking
Multi-banked
Critical word first
Merging write buffer
Compiler optimizations
Prefetching

32
Q

What is L1 configuration?

A

L1 config deals with the design of L1 caches, and the trade offs here.

L1 requirements:
- Fast, core speed as CPU is accessing directly
- High hit rate
- Energy sufficient

These are conflicting requirements, but can for example be achieved using set-associative caches.

33
Q

What is way prediction?

A

Increased associativity increases hit time and power consumption.
Power increases because of the parallel SRAM accesses.

Way prediction aims to reduce this added overhead, by only accessing the predicted ways.

34
Q

What is pipelining?

A

Simplifies timing of memory operations.

With increased pipeline depth comes longer load use hazard and miss penalty for prediction.

35
Q

What is non-blocking?

A

Blocking cache: On a miss, the CPU blocks while lower level caches are accessed. When these caches returns, the CPU continues.

Non-blocking/lockup-free: CPU continues execution (independent instruction) as long as the requested data is not used. Can have multiple misses and continue CPU until there are no more independent instructions. Multiple misses can be handled in parallel.

36
Q

What is multi-banked?

A

Instead of having one monolithic L1 cache, bank the cache intro smaller banks of the L1 cache. Each of these banks pointing to their own MHSRs.

This way we have multiple ports to the L1 cache, meaning multiple requests can go to the cache in parallel, as long as two addresses are not going to the same cache bank.

The smaller sizes of the banks also reduces power when accessing them.

We also reduce time needed to search through the MHSRs, as each L1 bank now have a smaller set of MHSRs.

37
Q

What is early restart and critical word first?

A

Early restart:
Instead of reading the complete cache line into the cache, and then notifying the CPU that the data is ready, as soon as data arrives in the cache - forward it to the CPU:

Critical word first:
When reading a cache line, start at the byte that is requested by the address and read to the end of the cache block. The wrap back to the beginning of the cache block and read until all data before the critical word has been read.

38
Q

What is merging write buffer?

A

When writing back from the L1 to the L2 cache, instead of waiting for the write to return from the L2 cache - store it in a write buffer.

In a write back cache, this buffer contains the evicted dirty line.

In write through caches, this buffer contains all writes.

Merging write buffer utilizes buffer resource more efficiently by taking advantage of spatial locality.

(Figure in Memory Hierarchy - Cache optimizations video)

39
Q

What is compiler optimizations?

A

Loop interchange:
- Order the loops such that data is read how it is stored in cache.
- If data is stored in cache by row, loop through rows.
- If data is stored in cache by column, loop through columns.

Loop fusion: combine loops

Blocking

40
Q

What is prefetching?

A
41
Q

What is Virtual indexed physically tagged (VIPT)?

A

Use virtual address for indexing into the cache (tag array and data array of the cache).
By doing this, we don’t need to translate the address to get the index, and can access the TLB in parallel.

Once we have the tag and the physical page number we can do a comparison to see if we have a hit.

This is done in parallel for all the ways.

42
Q

How are addresses constructed in VIPT caches?

A

The requested address is used in parallell in both address translation and in tag lookup.

Translation:
- Requested address is split into VPN (virtual page number) and a page offset
- VPN is used to index into the TLB
- TLB provides a PPN (physical page number)

Page offset from requested address is not translated. This means that the index and offset used for cache access needs to be the size of the page offset used in translation. Page offset contain index + offset. This limits the size of the data array.

Each way in the cache can’t be larger than the page size in the page table.

43
Q

Because of the limitations with VIPT caches, how can cache size be increased? What are the trade offs of increasing the size.

A

The parallel TLB access limits the cache-way size.

Needs to increase associativity to gain larger cache sizes. This also reduces conflict misses.

Larger caches are however slower to access, because of larger way-select logic, and requires a higher power consumption.

44
Q

How can way prediction be implemented?

A

Tag the tags that are being compared and create a hash of 6-bits and store it in a hash buffer.

Take the full address and create a hash of that as well.

Compare the 6-bit tag hash with the 6-bit address hash (this requires a smaller comparitor, and because of this is faster).

Use the result from the hash comparison to predict the way to access in the cache.

The 19-bit tag comparison is still done, in addition to the hash comparison, but these can be placed right in the pipeline in a way that the prediction can be used efficiently.

45
Q

What is a MSHR?

A

Miss status Holding Register.
Needed when using non-blocking caches.
Holds statue of an outstanding miss:
- Address and size
- forwarding information (what CPU register is being written, which cache block is being written back to)
- data buffer with metadata

When a new miss happens, we first look up in the MHSRs if the cache block of this miss already is being handled. If this is the case, not additional memory requests are being made.

The number of MHSRs determine the max outstanding misses.

46
Q

What is loop fusion?

A

If we have multiple loops over the same area, with different computations. Add both computations to the same loop.