Memory and Cache Flashcards
Why is memory access a performance bottleneck in modern computers?
Because accessing main memory can take ~100 clock cycles (much slower than CPU operations).
What is the purpose of the memory hierarchy?
To balance capacity, cost, and access time, ensuring frequently accessed data is available in faster memory levels.
What is cache?
A small, fast memory that stores frequently accessed data to reduce memory latency.
How is cache structured in modern CPUs?
CPUs have multiple levels of cache (L1, L2, L3), each progressively larger but slower.
What are the cache sizes and latencies on Isca’s 16-core nodes?
- L1: 32kB (4-cycle latency)
- L2: 256kB (11-cycle latency)
- L3: 20MB (~34-cycle latency)
How can you check L3 cache size on a Linux system?
By using the command: cat /proc/cpuinfo
What are the two main types of locality that caches exploit?
- Spatial Locality – Data near recently accessed memory is likely to be used.
- Temporal Locality – Recently accessed data is likely to be reused soon.
How does spatial locality improve performance?
Data is fetched in cache lines (e.g., 64 bytes on Intel x86_64), so sequential memory access increases cache hits.
What happens when the required data is not in cache?
A cache miss occurs, and the data must be fetched from slower main memory.
Why is loop order important in C for efficient memory access?
- Accessing memory sequentially improves cache efficiency.
- Looping over 2D array rows first (before columns) results in more cache hits.
What is cache blocking?
A technique where computations are split into blocks that fit in cache to maximise data reuse.
Why is cache blocking effective?
It exploits temporal locality, keeping frequently used data in cache for longer.
How can reducing problem size affect performance testing?
If a test case is too small, key memory behavior (e.g., cache misses) may not be representative of real workloads.
What is a good strategy for performance testing?
Use a representative memory footprint by keeping the same domain size but reducing time steps.
What can compilers do to optimise memory access?
Techniques like loop interchange and cache blocking can improve cache efficiency.
How can you enable compiler optimizations in gcc
?
Use the -O3
flag when compiling.
What is arithmetic intensity?
The ratio of floating-point operations to memory accesses, measured in FLOPs/byte.
Why does arithmetic intensity vary between algorithms?
Some algorithms perform many calculations per byte of memory accessed, while others require frequent data movement.
What does the roofline model describe?
The maximum floating-point performance of an application based on:
1. Peak performance (FLOPs/sec)
2. Memory bandwidth (bytes/sec)
3. Arithmetic intensity (FLOPs/byte)
When is an application compute-bound vs. memory-bound?
- Compute-bound: High arithmetic intensity, limited by CPU performance.
- Memory-bound: Low arithmetic intensity, limited by memory bandwidth.
What does NUMA stand for? What is it?
Non-Uniform Memory Access
NUMA is the phenomenon that memory at various points in the address space of a processer have different performance characteristics.
Why does NUMA exist?
Multi-socket CPUs have separate memory controllers, making remote memory accesses slower than local ones.
How does memory allocation work in a NUMA system?
Memory is allocated on the first memory controller that accesses it (first touch policy).
Why is first-touch allocation important?
If only one thread initialises an array, all data may be allocated on one NUMA node, slowing access for other processors.