What is a CUDA Kernel? How to define it
\_\_global\_\_ void myKernel() { /* Code here */ }
How do you launch a CUDA kernel?
Kernels are launched with an execution configuration:
myKernel<<<numBlocks, threadsPerBlock>>>();
What are CUDA threads?
CUDA divides execution into threads, which are small execution units that run the same kernel function in parallel.
What is a CUDA Block?
block in the grid contains the same number of threads.What is a CUDA Grid?
A collection of blocks that run a kernel function together.
How to await GPU kernel to finish
cudaDeviceSynchronize(): CUDA function that will cause the CPU to wait until the GPU is finished working.
What is the limit to the number of threads that can exist in a thread block?
1024
Kernel execution configuration in 3D (when processing matrixes for instance)
dim3 threads_per_block(16, 16, 1); dim3 number_of_blocks(16, 16, 1); someKernel<<<number_of_blocks, threads_per_block>>>();
Get total number of (in 3D directions):
- blocks in a grid
- threads in a block
gridDim.x|y|z is the number of blocks in the gridblockDim.x|y|z is the number of threads in a block.Get:
- index of the current block
- index of the thread within a block
blockIdx.x|y|z is the index of the current block within the grid.threadIdx.x|y|z describes the index of the thread within a block.What are Streaming Multiprocessors?
SMs on a GPU, and the requirements of a block, more than one block can be scheduled on an SM.How to determine optimal grid size?
A grid size that has a number of blocks that is a multiple of the number of SMs.
- That way in each “round” all SMs can be occupied (full utilization).
- The number of SMs should not be hard-coded into a code bases (it’s different between GPUs)
How to determine optimal thread size?
SMs create, manage, schedule, and execute groupings of 32 threads from within a block called warps.Coordinate thread work when input is smaller than threads count
threadIdx.x + blockIdx.x * blockDim.x
3 + 1 * 4).What is grid stride
threadIdx.x + blockIdx.x * blockDim.x.blockDim.x * gridDim.x). for (int i = blockIdx.x * blockDim.x * threadIdx.x; i < n; i += blockDim.x * gridDim.x)
{
c[i] = 2 * a[i] + b[i];
}Common way for handling CUDA API errors?
cudaError_t mallocErr = cudaMallocManaged(&a, size);
if (mallocErr != cudaSuccess) {
printf("Error: %s\n", cudaGetErrorString(mallocErr));
return 1;
}How to check CUDA kernel launch errors?
someKernel<<<1, -1>>>();
cudaError_t kernelLaunchErr = cudaGetLastError();
if (kernelLaunchErr != cudaSuccess) {
printf("Kernel launch error: %s\n", cudaGetErrorString(kernelLaunchErr));
return 2;
}How to compile a CUDA program and run it immediatelly
nvcc -o hello-gpu 01-hello/01-hello-gpu.cu -run
Generate profiling report with summary printed on the screen
nsys profile --stats=true ./single-thread-vector-add
What are profiling report sections
osrt_sum)cuda_api_sum)cuda_gpu_kern_sum)cuda_gpu_mem_time_sum)cuda_gpu_mem_size_sum)