GPUs Flashcards
What is a GPU?
Graphics Processing Unit
Specialised hardware for graphics computation, highly parallel multiprocessors
List the 6 stages in the graphics pipeline
- Input assembler
- Vertex shader
- Geometry shader
- Setup & rasterise
- Pixel shader
- Output merger
Which 3 stages in the graphics pipeline are programmable?
- Vertex shader
- Geometry shader
- Pixel shader
(all the shaders!!)
Draw the graphics pipeline, showing the unified processor array
.
Tasks pass through the unified processor array ? ?
several times
GPUs take advantage of ? level parallelism
data
Which can support more threads: CPU or GPU?
GPU - supports multithreading whereas CPU is designed for single-threading
What is the design goal of a CPU?
Designed for single-threaded performance
What is the design goal of a GPU?
Designed for throughput
- individual thread performance is not critical
- each core is relatively simple
Write out the code for a DAXPY (Double A X Plus Y) loop
for (i=0; i<1024; ++i) {
Z[i] = a*X[i] + Y[i]
}
Multiple vector of doubles X by constant a then add to vector of doubles Y
Why can we parallelise a DAXPY loop?
All iterations are independent because there are no data dependencies between them
eg. Z[0] = aX[0] + Y[0] is independent of Z[1] = aX[1] + Y[1]
How do GPUs parallelise code?
Using SIMT
Describe SIMT
Single Instruction, Multiple Threads
Each processor runs many threads concurrently, each with its own registers and memory
All threads execute the same instructions but on different data
What is 1 similarity between SIMD and SIMT?
Both take advantage of data-level parallelism
What are the differences between SIMD and SIMT?
SIMD is a form of vector processing - vector registers each hold several items, vector operations work on vector registers
In SIMT, each instruction operates on threads - same instruction issued to each thread, each with separate state (registers and memory)
ie. one register storing lots of data items vs lots of registers storing 1 data item
Draw a diagram showing SIMT execution of the DAXPY loop with 4 threads in lock-step
.
In the case of 4 threads, how many times is the DAXPY kernel called?
Called 256 times with 4 threads each time
Each thread executes the DAXPY ?
kernel
What is a warp?
Group of threads that are executed simultaneously
All threads in a warp execute the same instruction
What support is there in a GPU for conditional execution?
Use a predicate register to hold conditions, then only execute something if the condition holds - predication
What is a downside of predication?
The thread sits idle if the predicate is false, leading to inefficiency
Draw a diagram to show how conditional branches in GPUs are treated
.
When might we need more complex branching?
Procedure call and return, nested if-then-else etc.
How is more complex branching handled?
Handled by hardware with special instructions
Each thread has its own stack that predicates are pushed to and popped from