CUDA C API - Basics Flashcards
What is a CUDA Kernel? How to define it
\_\_global\_\_ void myKernel() { /* Code here */ }
How do you launch a CUDA kernel?
Kernels are launched with an execution configuration:
myKernel<<<numBlocks, threadsPerBlock>>>();
What are CUDA threads?
CUDA divides execution into threads, which are small execution units that run the same kernel function in parallel.
What is a CUDA Block?
- A group of threads that execute a kernel function together and share memory.
- Every
block
in thegrid
contains the same number ofthreads
.
What is a CUDA Grid?
A collection of blocks that run a kernel function together.
How to await GPU kernel to finish
cudaDeviceSynchronize()
: CUDA function that will cause the CPU to wait until the GPU is finished working.
What is the limit to the number of threads that can exist in a thread block?
1024
Kernel execution configuration in 3D (when processing matrixes for instance)
dim3 threads_per_block(16, 16, 1); dim3 number_of_blocks(16, 16, 1); someKernel<<<number_of_blocks, threads_per_block>>>();
Get total number of (in 3D directions):
- blocks in a grid
- threads in a block
-
gridDim.x|y|z
is the number ofblocks
in thegrid
-
blockDim.x|y|z
is the number ofthreads
in ablock
.
Get:
- index of the current block
- index of the thread
within a block
-
blockIdx.x|y|z
is the index of the currentblock
within thegrid
. -
threadIdx.x|y|z
describes the index of thethread
within ablock
.
What are Streaming Multiprocessors?
- Streaming Multiprocessors (SMs) are the GPU’s computational units that execute CUDA threads in parallel. Each SM contains multiple CUDA cores, registers, shared memory, and scheduling units.
- Depending on the number of
SMs
on a GPU, and the requirements of a block, more than one block can be scheduled on anSM
.
How to determine optimal grid size?
A grid size that has a number of blocks that is a multiple of the number of SMs.
- That way in each “round” all SMs can be occupied (full utilization).
- The number of SMs should not be hard-coded into a code bases (it’s different between GPUs)
How to determine optimal thread size?
-
SMs
create, manage, schedule, and execute groupings of 32 threads from within a block called warps. - Performance gains can be had by choosing a block size that has a number of threads that is a multiple of 32.
Coordinate thread work when input is smaller than threads count
threadIdx.x + blockIdx.x * blockDim.x
- Example: thread 3 in block 1 will process the 7’th element (
3 + 1 * 4
). - Code must check that the dataIndex calculated by threadIdx.x + blockIdx.x * blockDim.x is less than N, the number of data elements (Attempting to access non-existent elements can result in a runtime error)
What is grid stride
- Used when more data elements than there are threads in the grid.
- In a grid-stride loop, the thread’s first element is calculated as usual, with
threadIdx.x + blockIdx.x * blockDim.x
. -
The thread then strides forward by the number of threads in the grid (
blockDim.x * gridDim.x
). - It continues in this way until its data index is greater than the number of data elements.
- With all threads working in this way, all elements are covered
- CUDA runs as many blocks in parallel at once as the GPU hardware supports, for massive parallelization.
for (int i = blockIdx.x * blockDim.x * threadIdx.x; i < n; i += blockDim.x * gridDim.x) { c[i] = 2 * a[i] + b[i]; }
Common way for handling CUDA API errors?
cudaError_t mallocErr = cudaMallocManaged(&a, size); if (mallocErr != cudaSuccess) { printf("Error: %s\n", cudaGetErrorString(mallocErr)); return 1; }
How to check CUDA kernel launch errors?
someKernel<<<1, -1>>>(); cudaError_t kernelLaunchErr = cudaGetLastError(); if (kernelLaunchErr != cudaSuccess) { printf("Kernel launch error: %s\n", cudaGetErrorString(kernelLaunchErr)); return 2; }
How to compile a CUDA program and run it immediatelly
nvcc -o hello-gpu 01-hello/01-hello-gpu.cu -run
Generate profiling report with summary printed on the screen
nsys profile --stats=true ./single-thread-vector-add
What are profiling report sections
- Operating System Runtime Summary (
osrt_sum
)- CUDA API Summary (
cuda_api_sum
) - CUDA Kernel Summary (
cuda_gpu_kern_sum
) - CUDA Memory Time Operation Summary (
cuda_gpu_mem_time_sum
) - CUDA Memory Size Operation Summary (
cuda_gpu_mem_size_sum
)
- CUDA API Summary (