CUDA C API - Unified Memory Flashcards

1
Q

How does unified memory in CUDA work?

A
  • When UM is allocated, the memory is not resident yet on either the host or the device.
  • At any point when the CPU, or any GPU in the accelerated system, attempts to access memory not yet resident on it, page faults will occur and trigger its migration.
    • This is true for empty (i.e. not initialized) buffer as well - until you access them, they don’t reside anywhere!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the formal names of data transfers occuring in CUDA program?

A
  • Host to Device (HtoD)
    • Device to Host (DtoH)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How to indentify page faults

A
  • Nsight Systems: search for UM access timeline, you can see (red) occurances of page faults that caused the transfer.
  • nsys profile: If you see many small memory migration operations (cuda_gpu_mem_time_sum and cuda_gpu_mem_size_sum), this is a sign that on-demand page faulting is occurring, with small memory migrations occurring each time there is a page fault in the requested location.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to prefetch memory (to GPU/CPU)

A

```c
int deviceId;
cudaGetDevice(&deviceId);

// Prefetch to GPU device (with given ID)
cudaMemPrefetchAsync(pointerToSomeUMData, size, deviceId);
// Prefetch to host.
// cudaCpuDeviceId is a built-in CUDA variable.
cudaMemPrefetchAsync(pointerToSomeUMData, size, cudaCpuDeviceId);
~~~

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Allocate and free universal memory buffer

A

```c
int *a;
int size = N * sizeof (int);
cudaMallocManaged(&a, size);

// Logic …

cudaFree(a);
~~~

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  • How to manually allocate memory on GPU or CPU
  • How to manually transfer data between them.
  • How to free this manually allocated memory.
A

```c
int *host_a, *device_a;
cudaMalloc(&device_a, size); // device_a is immediately available on the GPU.
cudaMallocHost(&host_a, size); // host_a is immediately available on CPU, and is page-locked, or pinned.

// cudaMemcpy takes the destination, source, size, and a CUDA-provided variable for the direction of the copy.
cudaMemcpy(device_a, host_a, size, cudaMemcpyHostToDevice);
cudaMemcpy(host_a, device_a, size, cudaMemcpyDeviceToHost);

cudaFree(device_a);
cudaFreeHost(host_a);
~~~

How well did you know this?
1
Not at all
2
3
4
5
Perfectly