GPUs Flashcards
What are different properties for CPU and GPU programs?
CPU: Require low latency
GPU: Require high throughput
What are a property of graphic processing stages?
The stages are dependent, but data elements within are independent - perfect for parallelism
What model does GPUs use?
SIMT - Single Instruction Multiple Threads
Each thread compute on different data elements
Threads within a warp must be executed in lockstep(same instruction at the same time)
Enables parallell execution, without control overhead
Compare MIMD, SIMD, SIMT
MIMD: independent threads, multicore, good handling of control flow heavy programs, thread finish as soon as possible. Only indirectly DLP
SIMD: One thread, multiple data, can’t scale to a huge amount of data, increased DLP, decent latency, control and vector overhead
SIMT: Multiple threads in lockstep - all executing on the same point in the program. Massive throughput and parallelism, can scale to very large data amount, control flow is difficult, high latency
GPUS: Does not support arbitrary large vectors, must map to warps.
What is the GPU architecture (NVIDIA)?
Graphic Processing Clusters (GPC)
->
Texture Processing Clusters (TPC)
->
Streaming Multiprocessors (SMs)
->
cores divided into processing blocks
->
1 thread per core
What is a Streaming Multiprocessor?
Coordinates execution of warps across its processing blocks. Each block has a shared register file, execution units and several potential warps.
L1 and shared memory are shared across cores within SM
Each processing block have multiple threads to choose from to execute, and up to 32 at the time. The warp scheduler choose between the available threads in an round robin fashion, skipping inactive/blocked threads.
Describe the flow in an SM
L1 instruction cache -> Warp scheduler (-> choose threads) -> Register file -> operands -> FU
Operand can also access constant cache, L1 shared cache and shared memory. When accessing shared memory, the data will be available for all threads sharing this.
What do GPUs do on a stall?
Switch which warp to execute - limits control overhead
What is a warp?
A grouping of threads, all executing the same instruction, but with different data elements. All threads within a warp must be in the same stage.
All threads in a GPU are executed within a warp.
A warp is scheduled by the warp scheduler.
Each processing block has several warp schedulers, each with several warps to choose from.
A SM has multiple of the processing blocks.
What happens if one thread in a warp is stalled?
The entire warp is stalled
What is divergence?
When a warp executes a control statement, as each thread operate on different data, these statements may evaluate differently across threads.
The processing block must execute all paths that are taken by at least one thread in a warp. The different paths cannot be executed in parallell, meaning we are wasting execution time.
What is a predicate operation?
Control statements are converted to predicate operations where one value is selected based on the predicate value.
Instead of using predication to execute both sides of a branch, use it to directly evaluate a control statement to select a specific value based on the evaluation.
How are branches supported in the GPU today?
Supported as part of programming model.
GPU can still not execute divergent branching simultaneously. Use execution mask to decide what threads are executing which branch path.
Can skip paths that are completely dead (mask=0x0)
Have dedicated HW structures that keeps track of branches - cheaper than coding in a predication system.
Programmers can reorder edge-case conditions to outside main execution block -> exploit greater parallelism during normal case.
What is self predicate?
Implemented in CUDA
Allows single-cycle execution of what previously would be a control statement.
What are the memory requirements of GPUs?
Great bandwidth
Does not need low latency, because we can schedule other warps.
Stack together memory requests