L16 Flashcards
What are CPUs optimised for? What are GPUs optimised for?
CPUs optimized for
* Low latency
* Programmability
GPUs optimized for
* High throughput
* Little reuse
What are some properties of a GPU chip?
Same silicon
More ALUs and registers
Less control logic and caches
How do GPUs achieve simple control?
No branch prediction/prefetching
No out-of-order execution
Groups of threads in lockstep (SIMT)
How do GPUs achieve high throughput?
Large number of execution units, registers, and threads
Thread switching on long latency operations
Fast + wide link to memory
How can GPUs be programmed?
o As graphics cards (OpenGL)
o As General Purpose GPUs (CUDA, OpenMP)
What is CUDA?
A general purpose GPUs
CUDA: heterogenous programming model
* Host (CPU) vs Device (GPU)
* Only for NVIDIA GPUs
True or False.
A CUDA function cannot be both __host__ and __device___.
False. It can.
What are the three function declaration prefixes in CUDA?
__global__: executed on device, called from host
__device__: executed on device, called from device
__host__: executed on hose, called from host
What are the 4 levels of the thread hierarchy in GPUs?
Thread:
Executes the kernel once
Warp:
Group of 32 threads executing in lockstep
One assigned to each SP (stream processor) of the stream multiprocessor (SM)
Zero-cost context switching
Block:
Group of threads assigned to one SM
Runs as a group to completion
Grid:
All the blocks for the kernel organized in a Cartesian Grid
Consider the following CUDA code:
MMKernel «<dim(width/32,width/32), dim3(32,32)»>(Md, Nd, Pd, width);
a. What does “dim(width/32,width/32)” specify?
b. What does “dim3(32,32)” specify?
a. Blocks per grid
b. Threads per block
In GPUs, what happens if a block uses more resources than available in SM?
fail to launch
True or False.
CUDA has more performance portability than OpenCL.
False.
OpenCL has better performance portability but is a bit more awkward to use.
What is an alternative to CUDA?
OpenCL:
Standard for programming accelerators
Similar to CUDA’s program structure
Similar performance considerations
Can run in multiple different platforms using multiple different compiler
SYCL:
High level
Single source
Uses buffers for memory
Is there a way to get portable performance from GPUs to CPUs?
Tuned libraries, such as MKL, can be used, but code might still be suboptimal.
How is high throughput achieved in GPUs?
Many ALUs and registers