Memory and Cache Flashcards

Question 1

Q

Why is memory access a performance bottleneck in modern computers?

Answer

A

Main memory access is much slower than CPU operations.

Can take ~100 clock cycles

Question 2

Q

What is the purpose of the memory hierarchy?

Answer

A

To balance:
- Capacity
- Cost
- Access time

It also ensures frequently accessed data is available in faster memory levels.

Question 3

Q

What is cache?

Answer

A

A small amount of fast memory which holds data fetched from, and written to, memory

Question 4

Q

How is cache structured in modern CPUs?

Answer

A

CPUs have multiple levels of cache (L1, L2, L3), each progressively larger but slower.

Question 5

Q

What are the two main types of locality that caches exploit?

Answer

A

Spatial Locality – Data near recently accessed memory is likely to be used.
Temporal Locality – Recently accessed data is likely to be reused soon.

Question 6

Q

How does spatial locality work?

Answer

A

Data is fetched in fixed size blocks (cache lines)
When we access adjacent memory, it is likely to be in cache (cache hit)
If it’s not in the cache we have a cache miss, and the correct data cache line is fetched from main memory

Question 7

Q

Why is loop order important in C for efficient memory access?

Answer

A

Accessing memory sequentially improves cache efficiency.
Looping over 2D array rows first (before columns) results in more cache hits.

Question 8

Q

What is cache blocking? What are the benefits?

Answer

A

Computations are split into blocks that fit in cache
Benefits:
Maximises data reuse
Much faster data access
Exploits temporal locality

Question 9

Q

How can reducing problem size affect performance testing?

Answer

A

If a test case is too small, memory behavior (e.g., cache misses) may not be representative of real workloads.

Question 10

Q

What can compilers do to optimise memory access?

Answer

A

Techniques like loop interchange and cache blocking can improve cache efficiency.

Question 11

Q

What is arithmetic intensity?

Answer

A

The ratio of floating-point operations to data movement, measured in FLOPs/byte.

Question 12

Q

What does the roofline model describe?

Answer

A

The maximum floating-point performance of an application based on:
1. Peak performance
2. Memory bandwidth
3. Arithmetic intensity

Question 13

Q

When is an application compute-bound vs. memory-bound?

Answer

A

Compute-bound: High arithmetic intensity, limited by CPU performance.
Memory-bound: Low arithmetic intensity, limited by memory bandwidth.

Question 14

Q

What does NUMA stand for? What is it?

Answer

A

Non-Uniform Memory Access
NUMA is the phenomenon that memory at various points in the address space of a processer have different performance characteristics.

Question 15

Q

Why does NUMA exist?

Answer

A

Multi-socket CPUs have separate memory controllers, making remote memory accesses slower than local ones.

Question 16

Q

How does memory allocation work in a NUMA system?

Answer

Study These Flashcards

A

First Touch Policy - Memory is allocated on the first memory controller that accesses it.

Question 17

Q

How does NUMA affect OpenMP programs?

Answer

Study These Flashcards

A

If one thread initialises memory and other threads access it later, remote memory accesses slow performance.

Question 18

Q

What is hybrid parallelism?

Answer

Study These Flashcards

A

Using both MPI (for distributed memory) and OpenMP (for shared memory) together.

Question 19

Q

How can hybrid parallelism help with NUMA effects?

Answer

Study These Flashcards

A

Assign one MPI process per socket
Each MPI process starts OpenMP threads within a single socket

Question 20

Q

How do you compile and run an MPI+OpenMP program (if it was titled lissajous.c)?

Answer

Study These Flashcards

A

mpicc -std=c99 -o lissajous -fopenmp lissajous.c -lm
export OMP_NUM_THREADS=2
mpirun -np 2 ./lissajous

Memory and Cache Flashcards

(20 cards)