Memory and Cache Flashcards

1
Q

Why is memory access a performance bottleneck in modern computers?

A

Because accessing main memory can take ~100 clock cycles (much slower than CPU operations).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the purpose of the memory hierarchy?

A

To balance capacity, cost, and access time, ensuring frequently accessed data is available in faster memory levels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is cache?

A

A small, fast memory that stores frequently accessed data to reduce memory latency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How is cache structured in modern CPUs?

A

CPUs have multiple levels of cache (L1, L2, L3), each progressively larger but slower.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the cache sizes and latencies on Isca’s 16-core nodes?

A
  • L1: 32kB (4-cycle latency)
  • L2: 256kB (11-cycle latency)
  • L3: 20MB (~34-cycle latency)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How can you check L3 cache size on a Linux system?

A

By using the command: cat /proc/cpuinfo

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the two main types of locality that caches exploit?

A
  1. Spatial Locality – Data near recently accessed memory is likely to be used.
  2. Temporal Locality – Recently accessed data is likely to be reused soon.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does spatial locality improve performance?

A

Data is fetched in cache lines (e.g., 64 bytes on Intel x86_64), so sequential memory access increases cache hits.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What happens when the required data is not in cache?

A

A cache miss occurs, and the data must be fetched from slower main memory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is loop order important in C for efficient memory access?

A
  • Accessing memory sequentially improves cache efficiency.
  • Looping over 2D array rows first (before columns) results in more cache hits.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is cache blocking?

A

A technique where computations are split into blocks that fit in cache to maximise data reuse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is cache blocking effective?

A

It exploits temporal locality, keeping frequently used data in cache for longer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can reducing problem size affect performance testing?

A

If a test case is too small, key memory behavior (e.g., cache misses) may not be representative of real workloads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a good strategy for performance testing?

A

Use a representative memory footprint by keeping the same domain size but reducing time steps.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What can compilers do to optimise memory access?

A

Techniques like loop interchange and cache blocking can improve cache efficiency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can you enable compiler optimizations in gcc?

A

Use the -O3 flag when compiling.

17
Q

What is arithmetic intensity?

A

The ratio of floating-point operations to memory accesses, measured in FLOPs/byte.

18
Q

Why does arithmetic intensity vary between algorithms?

A

Some algorithms perform many calculations per byte of memory accessed, while others require frequent data movement.

19
Q

What does the roofline model describe?

A

The maximum floating-point performance of an application based on:
1. Peak performance (FLOPs/sec)
2. Memory bandwidth (bytes/sec)
3. Arithmetic intensity (FLOPs/byte)

20
Q

When is an application compute-bound vs. memory-bound?

A
  • Compute-bound: High arithmetic intensity, limited by CPU performance.
  • Memory-bound: Low arithmetic intensity, limited by memory bandwidth.
21
Q

What does NUMA stand for? What is it?

A

Non-Uniform Memory Access

NUMA is the phenomenon that memory at various points in the address space of a processer have different performance characteristics.

22
Q

Why does NUMA exist?

A

Multi-socket CPUs have separate memory controllers, making remote memory accesses slower than local ones.

23
Q

How does memory allocation work in a NUMA system?

A

Memory is allocated on the first memory controller that accesses it (first touch policy).

24
Q

Why is first-touch allocation important?

A

If only one thread initialises an array, all data may be allocated on one NUMA node, slowing access for other processors.

25
How does NUMA affect OpenMP programs?
If one thread initialises memory and other threads access it later, remote memory accesses slow performance.
26
What other NUMA-related performance effects exist?
- Cache sharing between threads - Contention for memory bandwidth
27
What is hybrid parallelism?
Using both MPI (for distributed memory) and OpenMP (for shared memory) together.
28
Why can hybrid parallelism help with NUMA effects?
By assigning one MPI process per socket and using OpenMP within each socket, memory accesses stay local.
29
How do you compile and run an MPI+OpenMP program (if it was title `lissajous.c`?
`mpicc -std=c99 -o lissajous -fopenmp lissajous.c -lm` `export OMP_NUM_THREADS=2` `mpirun -np 2 ./lissajous`