CUDA C API - Basics Flashcards

Question 1

Q

What is a CUDA Kernel? How to define it

Answer

A

\_\_global\_\_ void myKernel() { /* Code here */ }

Question 2

Q

How do you launch a CUDA kernel?

Answer

A

Kernels are launched with an execution configuration:

myKernel<<<numBlocks, threadsPerBlock>>>();

Question 3

Q

What are CUDA threads?

Answer

A

CUDA divides execution into threads, which are small execution units that run the same kernel function in parallel.

Question 4

Q

What is a CUDA Block?

Answer

A

A group of threads that execute a kernel function together and share memory.
Every block in the grid contains the same number of threads.

Question 5

Q

What is a CUDA Grid?

Answer

A

A collection of blocks that run a kernel function together.

Question 6

Q

How to await GPU kernel to finish

Answer

A

cudaDeviceSynchronize(): CUDA function that will cause the CPU to wait until the GPU is finished working.

Question 7

Q

What is the limit to the number of threads that can exist in a thread block?

Question 8

Q

Kernel execution configuration in 3D (when processing matrixes for instance)

Answer

A

dim3 threads_per_block(16, 16, 1);
dim3 number_of_blocks(16, 16, 1);
someKernel<<<number_of_blocks, threads_per_block>>>();

Question 9

Q

Get total number of (in 3D directions):
- blocks in a grid
- threads in a block

Answer

A

gridDim.x|y|z is the number of blocks in the grid
blockDim.x|y|z is the number of threads in a block.

Question 10

Q

Get:
- index of the current block
- index of the thread within a block

Answer

A

blockIdx.x|y|z is the index of the current block within the grid.
threadIdx.x|y|z describes the index of the thread within a block.

Question 11

Q

What are Streaming Multiprocessors?

Answer

A

Streaming Multiprocessors (SMs) are the GPU’s computational units that execute CUDA threads in parallel. Each SM contains multiple CUDA cores, registers, shared memory, and scheduling units.
Depending on the number of SMs on a GPU, and the requirements of a block, more than one block can be scheduled on an SM.

Question 12

Q

How to determine optimal grid size?

Answer

A

A grid size that has a number of blocks that is a multiple of the number of SMs.
- That way in each “round” all SMs can be occupied (full utilization).
- The number of SMs should not be hard-coded into a code bases (it’s different between GPUs)

Question 13

Q

How to determine optimal thread size?

Answer

A

SMs create, manage, schedule, and execute groupings of 32 threads from within a block called warps.
Performance gains can be had by choosing a block size that has a number of threads that is a multiple of 32.

Question 14

Q

Coordinate thread work when input is smaller than threads count

Answer

A

threadIdx.x + blockIdx.x * blockDim.x

Example: thread 3 in block 1 will process the 7’th element (3 + 1 * 4).
Code must check that the dataIndex calculated by threadIdx.x + blockIdx.x * blockDim.x is less than N, the number of data elements (Attempting to access non-existent elements can result in a runtime error)

Question 15

Q

What is grid stride

Answer

A

Used when more data elements than there are threads in the grid.
In a grid-stride loop, the thread’s first element is calculated as usual, with threadIdx.x + blockIdx.x * blockDim.x.
The thread then strides forward by the number of threads in the grid (blockDim.x * gridDim.x).
It continues in this way until its data index is greater than the number of data elements.
With all threads working in this way, all elements are covered
CUDA runs as many blocks in parallel at once as the GPU hardware supports, for massive parallelization.

    for (int i = blockIdx.x * blockDim.x * threadIdx.x; i < n; i += blockDim.x * gridDim.x)
    {
        c[i] = 2 * a[i] + b[i];
    }

Question 16

Q

Common way for handling CUDA API errors?

Answer

Study These Flashcards

A

  cudaError_t mallocErr = cudaMallocManaged(&a, size);
  if (mallocErr != cudaSuccess) {
    printf("Error: %s\n", cudaGetErrorString(mallocErr));
    return 1;
  }

Question 17

Q

How to check CUDA kernel launch errors?

Answer

Study These Flashcards

A

someKernel<<<1, -1>>>();
cudaError_t kernelLaunchErr = cudaGetLastError();
if (kernelLaunchErr != cudaSuccess) {
  printf("Kernel launch error: %s\n", cudaGetErrorString(kernelLaunchErr));
  return 2;
}

Question 18

Q

How to compile a CUDA program and run it immediatelly

Answer

Study These Flashcards

A

nvcc -o hello-gpu 01-hello/01-hello-gpu.cu -run

Question 19

Q

Generate profiling report with summary printed on the screen

Answer

Study These Flashcards

A

nsys profile --stats=true ./single-thread-vector-add

Question 20

Q

What are profiling report sections

Answer

Study These Flashcards

A

Operating System Runtime Summary (osrt_sum)
- CUDA API Summary (cuda_api_sum)
- CUDA Kernel Summary (cuda_gpu_kern_sum)
- CUDA Memory Time Operation Summary (cuda_gpu_mem_time_sum)
- CUDA Memory Size Operation Summary (cuda_gpu_mem_size_sum)

CUDA C API - Basics Flashcards

(20 cards)