L16 Flashcards

Question 1

Q

What are CPUs optimised for? What are GPUs optimised for?

Answer

A

CPUs optimized for
* Low latency
* Programmability

GPUs optimized for
* High throughput
* Little reuse

Question 2

Q

What are some properties of a GPU chip?

Answer

A

Same silicon
More ALUs and registers
Less control logic and caches

Question 3

Q

How do GPUs achieve simple control?

Answer

A

 No branch prediction/prefetching
 No out-of-order execution
 Groups of threads in lockstep (SIMT)

Question 4

Q

How do GPUs achieve high throughput?

Answer

A

 Large number of execution units, registers, and threads
 Thread switching on long latency operations
 Fast + wide link to memory

Question 5

Q

How can GPUs be programmed?

Answer

A

o As graphics cards (OpenGL)
o As General Purpose GPUs (CUDA, OpenMP)

Question 6

Q

What is CUDA?

Answer

A

A general purpose GPUs
CUDA: heterogenous programming model
* Host (CPU) vs Device (GPU)
* Only for NVIDIA GPUs

Question 7

Q

True or False.

A CUDA function cannot be both __host__ and __device___.

Answer

A

False. It can.

Question 8

Q

What are the three function declaration prefixes in CUDA?

Answer

A

 __global__: executed on device, called from host
 __device__: executed on device, called from device
 __host__: executed on hose, called from host

Question 9

Q

What are the 4 levels of the thread hierarchy in GPUs?

Answer

A

Thread:
 Executes the kernel once

Warp:
 Group of 32 threads executing in lockstep
 One assigned to each SP (stream processor) of the stream multiprocessor (SM)
 Zero-cost context switching

Block:
 Group of threads assigned to one SM
 Runs as a group to completion

Grid:
 All the blocks for the kernel organized in a Cartesian Grid

Question 10

Q

Consider the following CUDA code:

MMKernel «<dim(width/32,width/32), dim3(32,32)»>(Md, Nd, Pd, width);

a. What does “dim(width/32,width/32)” specify?
b. What does “dim3(32,32)” specify?

Answer

A

a. Blocks per grid
b. Threads per block

Question 11

Q

In GPUs, what happens if a block uses more resources than available in SM?

Answer

A

fail to launch

Question 12

Q

True or False.

CUDA has more performance portability than OpenCL.

Answer

A

False.

OpenCL has better performance portability but is a bit more awkward to use.

Question 13

Q

What is an alternative to CUDA?

Answer

A

OpenCL:
 Standard for programming accelerators
 Similar to CUDA’s program structure
 Similar performance considerations
 Can run in multiple different platforms using multiple different compiler

SYCL:
 High level
 Single source
 Uses buffers for memory

Question 14

Q

Is there a way to get portable performance from GPUs to CPUs?

Answer

A

Tuned libraries, such as MKL, can be used, but code might still be suboptimal.

Question 15

Q

How is high throughput achieved in GPUs?

Answer

A

Many ALUs and registers

Question 16

Q

What is the role of host and device in GPU programming?

Answer

A

Host: orchestrates execution by describing how the problem maps to threads

Device: does the computation in restricted C/C++