CUDA C API - Unified Memory Flashcards
How does unified memory in CUDA work?
- When UM is allocated, the memory is not resident yet on either the host or the device.
- At any point when the CPU, or any GPU in the accelerated system, attempts to access memory not yet resident on it, page faults will occur and trigger its migration.
- This is true for empty (i.e. not initialized) buffer as well - until you access them, they don’t reside anywhere!
What are the formal names of data transfers occuring in CUDA program?
-
Host to Device (
HtoD
)- Device to Host (
DtoH
)
- Device to Host (
How to indentify page faults
- Nsight Systems: search for UM access timeline, you can see (red) occurances of page faults that caused the transfer.
-
nsys profile: If you see many small memory migration operations (
cuda_gpu_mem_time_sum
andcuda_gpu_mem_size_sum
), this is a sign that on-demand page faulting is occurring, with small memory migrations occurring each time there is a page fault in the requested location.
How to prefetch memory (to GPU/CPU)
```c
int deviceId;
cudaGetDevice(&deviceId);
// Prefetch to GPU device (with given ID)
cudaMemPrefetchAsync(pointerToSomeUMData, size, deviceId);
// Prefetch to host.
// cudaCpuDeviceId is a built-in CUDA variable.
cudaMemPrefetchAsync(pointerToSomeUMData, size, cudaCpuDeviceId);
~~~
Allocate and free universal memory buffer
```c
int *a;
int size = N * sizeof (int);
cudaMallocManaged(&a, size);
// Logic …
cudaFree(a);
~~~
- How to manually allocate memory on GPU or CPU
- How to manually transfer data between them.
- How to free this manually allocated memory.
```c
int *host_a, *device_a;
cudaMalloc(&device_a, size); // device_a
is immediately available on the GPU.
cudaMallocHost(&host_a, size); // host_a
is immediately available on CPU, and is page-locked, or pinned.
// cudaMemcpy
takes the destination, source, size, and a CUDA-provided variable for the direction of the copy.
cudaMemcpy(device_a, host_a, size, cudaMemcpyHostToDevice);
cudaMemcpy(host_a, device_a, size, cudaMemcpyDeviceToHost);
cudaFree(device_a);
cudaFreeHost(host_a);
~~~