GPUs (2) Flashcards
Why do GPUs use caches? Is it for the same reason as CPUs?
No, multithreading in GPUs hides DRAM latency. Cache reduces DRAM bandwidth requirements
What are the 3 main types of GPU memory?
- Private local memory
- Shared memory
- Global GPU memory
Describe private local memory
Each thread is allocated private local memory
Used for stack frame, spilling registers
Describe shared memory
On each multithreaded processor, not between
Dynamically allocated to thread blocks on creation, used for communication between threads
Describe global GPU memory
Available across GPU and also to host/system processor
Where is private local memory located?
External DRAM so it can be large, can be cached
Where is shared memory located?
Within each multithreaded processor (core) so high bandwidth
Where is global memory located?
In external DRAM, can be cached
Give 2 examples of GPUs
Fermi, Kepler
What are 2 main languages used to program a GPU?
- CUDA
- OpenCL
What does CUDA stand for?
Compute Unified Device Architecture
How does the programmer split up CUDA code?
Identify the code to run on the CPU and that for the GPU, split up and annotate
What is a kernel?
Program or function, designed to be executed in parallel
What is a CUDA thread?
Single stream of instructions from a computation kernel
What is a thread block?
A set of threads that execute the same kernel and can cooperate
What is a grid?
A set of thread blocks that execute the same kernel
What is strip-mining?
Compiler optimisation where a loop is split into smaller nested loops
Can be done to illustrate hierarchy - correspondance of inner loop to a thread block
Write out the code for the original DAXPY loop strip-mined into 2 parts. In this case how many warps are in a thread block, each of which corresponds to the inner loop?
for (i=0; i<1024; i+=128) {
for (j=i; j < i+128; ++j) {
Z[j] = a*X[j] + Y[j]
}
}
4 warps per thread block
Draw a diagram of the grouping of CUDA threads in a GPU executing this code
.
How does a CUDA programmer annotate GPU functions?
__device__ or __global__
How does a CUDA programmer annotate CPU functions?
__host__
GPU functions must be called with code dimensions. Write out the structure. What do these specify?
func«<dimGrid, dimBlock»>(params)
dimGrid = number of blocks in a grid
dimBlock = number of threads in a block
Write DAXPY in C
void daxpy(int n, double a, double X, doubleY) {
for (i=0; i < n; ++i) {
Y[i] = a*X[i] + Y[i]
}
}
How could DAXPY in C be called?
daxpy(n, 2.0, X, Y)