CUDA C API - Basics Flashcards

1
Q

What is a CUDA Kernel? How to define it

A

\_\_global\_\_ void myKernel() { /* Code here */ }

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you launch a CUDA kernel?

A

Kernels are launched with an execution configuration:

myKernel<<<numBlocks, threadsPerBlock>>>();

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are CUDA threads?

A

CUDA divides execution into threads, which are small execution units that run the same kernel function in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a CUDA Block?

A
  • A group of threads that execute a kernel function together and share memory.
  • Every block in the grid contains the same number of threads.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a CUDA Grid?

A

A collection of blocks that run a kernel function together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How to await GPU kernel to finish

A

cudaDeviceSynchronize(): CUDA function that will cause the CPU to wait until the GPU is finished working.​

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the limit to the number of threads that can exist in a thread block?

A

1024

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Kernel execution configuration in 3D (when processing matrixes for instance)

A
dim3 threads_per_block(16, 16, 1);
dim3 number_of_blocks(16, 16, 1);
someKernel<<<number_of_blocks, threads_per_block>>>();
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Get total number of (in 3D directions):
- blocks in a grid
- threads in a block

A
  • gridDim.x|y|z is the number of blocks in the grid
  • blockDim.x|y|z is the number of threads in a block.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Get:
- index of the current block
- index of the thread within a block

A
  • blockIdx.x|y|z is the index of the current block within the grid.
  • threadIdx.x|y|z describes the index of the thread within a block.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are Streaming Multiprocessors?

A
  • Streaming Multiprocessors (SMs) are the GPU’s computational units that execute CUDA threads in parallel. Each SM contains multiple CUDA cores, registers, shared memory, and scheduling units.
  • Depending on the number of SMs on a GPU, and the requirements of a block, more than one block can be scheduled on an SM.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to determine optimal grid size?

A

A grid size that has a number of blocks that is a multiple of the number of SMs.
- That way in each “round” all SMs can be occupied (full utilization).
- The number of SMs should not be hard-coded into a code bases (it’s different between GPUs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to determine optimal thread size?

A
  • SMs create, manage, schedule, and execute groupings of 32 threads from within a block called warps.
  • Performance gains can be had by choosing a block size that has a number of threads that is a multiple of 32.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Coordinate thread work when input is smaller than threads count

A

threadIdx.x + blockIdx.x * blockDim.x

  • Example: thread 3 in block 1 will process the 7’th element (3 + 1 * 4).
  • Code must check that the dataIndex calculated by threadIdx.x + blockIdx.x * blockDim.x is less than N, the number of data elements (Attempting to access non-existent elements can result in a runtime error)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is grid stride

A
  • Used when more data elements than there are threads in the grid.
  • In a grid-stride loop, the thread’s first element is calculated as usual, with threadIdx.x + blockIdx.x * blockDim.x.
  • The thread then strides forward by the number of threads in the grid (blockDim.x * gridDim.x).
  • It continues in this way until its data index is greater than the number of data elements.
  • With all threads working in this way, all elements are covered
  • CUDA runs as many blocks in parallel at once as the GPU hardware supports, for massive parallelization.
    for (int i = blockIdx.x * blockDim.x * threadIdx.x; i < n; i += blockDim.x * gridDim.x)
    {
        c[i] = 2 * a[i] + b[i];
    }
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Common way for handling CUDA API errors?

A
  cudaError_t mallocErr = cudaMallocManaged(&a, size);
  if (mallocErr != cudaSuccess) {
    printf("Error: %s\n", cudaGetErrorString(mallocErr));
    return 1;
  }
17
Q

How to check CUDA kernel launch errors?

A
someKernel<<<1, -1>>>();
cudaError_t kernelLaunchErr = cudaGetLastError();
if (kernelLaunchErr != cudaSuccess) {
  printf("Kernel launch error: %s\n", cudaGetErrorString(kernelLaunchErr));
  return 2;
}
18
Q

How to compile a CUDA program and run it immediatelly

A

nvcc -o hello-gpu 01-hello/01-hello-gpu.cu -run

19
Q

Generate profiling report with summary printed on the screen

A

nsys profile --stats=true ./single-thread-vector-add

20
Q

What are profiling report sections

A
  • Operating System Runtime Summary (osrt_sum)
    • CUDA API Summary (cuda_api_sum)
    • CUDA Kernel Summary (cuda_gpu_kern_sum)
    • CUDA Memory Time Operation Summary (cuda_gpu_mem_time_sum)
    • CUDA Memory Size Operation Summary (cuda_gpu_mem_size_sum)