GPUs Flashcards
What are different properties for CPU and GPU programs?
CPU: Require low latency
GPU: Require high throughput
What are a property of graphic processing stages?
The stages are dependent, but data elements within are independent - perfect for parallelism
What model does GPUs use?
SIMT - Single Instruction Multiple Threads
Each thread compute on different data elements
Threads within a warp must be executed in lockstep(same instruction at the same time)
Enables parallell execution, without control overhead
Compare MIMD, SIMD, SIMT
MIMD: independent threads, multicore, good handling of control flow heavy programs, thread finish as soon as possible. Only indirectly DLP
SIMD: One thread, multiple data, can’t scale to a huge amount of data, increased DLP, decent latency, control and vector overhead
SIMT: Multiple threads in lockstep - all executing on the same point in the program. Massive throughput and parallelism, can scale to very large data amount, control flow is difficult, high latency
GPUS: Does not support arbitrary large vectors, must map to warps.
What is the GPU architecture (NVIDIA)?
Graphic Processing Clusters (GPC)
->
Texture Processing Clusters (TPC)
->
Streaming Multiprocessors (SMs)
->
cores divided into processing blocks
->
1 thread per core
What is a Streaming Multiprocessor?
Coordinates execution of warps across its processing blocks. Each block has a shared register file, execution units and several potential warps.
L1 and shared memory are shared across cores within SM
Each processing block have multiple threads to choose from to execute, and up to 32 at the time. The warp scheduler choose between the available threads in an round robin fashion, skipping inactive/blocked threads.
Describe the flow in an SM
L1 instruction cache -> Warp scheduler (-> choose threads) -> Register file -> operands -> FU
Operand can also access constant cache, L1 shared cache and shared memory. When accessing shared memory, the data will be available for all threads sharing this.
What do GPUs do on a stall?
Switch which warp to execute - limits control overhead
What is a warp?
A grouping of threads, all executing the same instruction, but with different data elements. All threads within a warp must be in the same stage.
All threads in a GPU are executed within a warp.
A warp is scheduled by the warp scheduler.
Each processing block has several warp schedulers, each with several warps to choose from.
A SM has multiple of the processing blocks.
What happens if one thread in a warp is stalled?
The entire warp is stalled
What is divergence?
When a warp executes a control statement, as each thread operate on different data, these statements may evaluate differently across threads.
The processing block must execute all paths that are taken by at least one thread in a warp. The different paths cannot be executed in parallell, meaning we are wasting execution time.
What is a predicate operation?
Control statements are converted to predicate operations where one value is selected based on the predicate value.
Instead of using predication to execute both sides of a branch, use it to directly evaluate a control statement to select a specific value based on the evaluation.
How are branches supported in the GPU today?
Supported as part of programming model.
GPU can still not execute divergent branching simultaneously. Use execution mask to decide what threads are executing which branch path.
Can skip paths that are completely dead (mask=0x0)
Have dedicated HW structures that keeps track of branches - cheaper than coding in a predication system.
Programmers can reorder edge-case conditions to outside main execution block -> exploit greater parallelism during normal case.
What is self predicate?
Implemented in CUDA
Allows single-cycle execution of what previously would be a control statement.
What are the memory requirements of GPUs?
Great bandwidth
Does not need low latency, because we can schedule other warps.
Stack together memory requests
What memory does a normal GPU have?
Shared memory: Scratch pad style (directly accessible by the programmer)
L1 cache per SM
Constant or scalar cache
Shared L2 cache
Memory - Needs high bandwidth
Might have: Vector/scalar cache, infinity cache (AMD)
What is GDDR?
Used in GPU memory.
Offers higher latency but higher bandwidth.
What is the difference in integrated vs. discrete GPUs?
Discrete can use GDDR, but integrated share resources with rest of system.
Discrete are connected through PCIe slots - incredible high speed
What are some unique features of GPU memory?
Memory coalescing
Shared memory
What is shared memory in GPUs?
A scratchpad memory that is directly programmer accessible.
Instead of having to use memory addresses, just tell directly where the memory is. don’t need translation.
What is memory coalescing?
Group together memory requests to the same cache line into one request
How have GPUs developed in newer times?
Increased cache size
More heterogeneous, accelerating specific tasks due to specific demands
What are GPU kernels?
Programs that can execute on the GPU
How does GPU programming work?
Use compilers that can create kernels
Program with GPU directives and primitives that will compile into a kernel
Rest of program will be compiled to a CPU program that will invoke the right drivers that will run the kernel
What is SYCL?
Open standard for writing efficient code for heterogeneous applications (using GPUs/FPGAs)
A unified language in which you can target different APIs for different accelerators.
Prevent need to separate CPU and GPU development tracks.
What is Vulkan?
Offers more direct access to the GPU function calls.
Requires more direct control of the core parts of graphic processing, but enables alot for expert users.
Why is GPUs important for AI?
AI training is inherently parallel.
For deep neural networks, GPUs allow efficient training.
GPU evolution brought AI evolution.
What is TensorFlow?
Open-source library for machine learning, support for GPU
ease-of-use
Was GPUs designed for AI?
No, originally designed for graphics and retro-fitted for AI.
This is apparent when looking at features of GPUs that provide little benefit to AI.
What is the clock speed of GPUs compared to CPUs?
Much lower for GPUs