Chapter 6 - CUDA Flashcards
What are GPGPUs?
General purpose GPUs
What is CUDA?
API for general purpose programming on GPUs.
What is the architecture of GPUs?
SIMD (single control unit, multiple datapaths). Instruction is fetched from memory and broadcasted to the multiple datapaths. Each datapath executes the instruction on its data, or remain idle.
GPUs consist of one or more SIMD processors.
A Streaming Multiprocessor (SMs) consists of one or more SIMD processors.
On GPUs, the datapaths are called cores or Streaming processor (SPs). There can bemultiple SPs within a SM
What does a SM consist of?
Can have several control units, and many datapaths
Operates asynchronously.
No penalty if one SM execute one part of conditional (if), and another SM executes the other (else)
Small block of memory shared amongst all SPs
What is the Host?
CPU and its associated memory
What is the device?
GPU and its associated memory
Draw a diagram of a CPU and a GPU
What is heterogeneous computing?
Writing programs that will run on devices with different architectures.
What is a kernel?
A function started by host, but run by device.
__global__
return void
How can threads running in cuda be synchroniced?
cudaDeviceSynchronize();
Main waits until device threads have finished executing the kermel.
How are kernels run?
Kernel_name «<grid_dims, block_dims»>(args);
n_threads: How many threads to start on the GPU
What variables are defined by CUDA when a kernel is started?
4 structs, all with a x-, y- and z-field
threadIdx: defines rank of thread within block
blockDim: dimensions of the blocks
blockIdx: rank of current block in grid
gridDim: Dimensions of grid
What does it mean that a call to a kernel is asynchronous?
The call returns immediately.
To avoid the host terminating before kernels has completed execution, cudaDeviceSynchronize can be used.
How are threads organized in a kernel?
Each individual thread will execute on one SP.
«<grid_dims, block_dims»>
grid_dims: how many SMs to utilize, how many blocks in each dimenstion
block_dims: How many SPs to utilize, how many threads to run on a single SM
if grid_dims or block_dims has the type integer, the x-value of the grid- and block-dimension will be that integer. The y- and z-dimensions will be set to 1.
To specify the y- and z- dimension in addition to x, the grid_dims variable must be of type dim3
How is a dimension variable specified using dim3?
dim3 grid_dims;
grid_dims.x = 2;
grid_dims.y = 3;
grid_dims.z = 4;
What are some conditions of thread blocks?
All blocks have the same dimensions
All blocks can execute independantly of each other
All blocks can execute in any order
Thread blocks must be able to complete its execution, regardless of the state of the other blocks.
What is the compute capacity?
There are limits on number of threads in each block, and number of blocks.
This number (limit) has the for a.b
a: 1, 2, 3, 5, 6, 7, 8
b: 0-7 (depends on the value of a)
The compute capacity specifies how many threads each block can have, and how big dimensions in blocks or grids can be.
How can you calculate the global index of a thread?
x = blockDim.x * blockIdx.x + threadIdx.x
How is memory allocated to devices?
cudaMalloc(void** ptr, size_t size)
Memory allocated on device must be explicitly freed
What does the qualifier __device__ do?
Put in front of functions.
Indicates that a function can only be called from the device.
How is device memory freed?
cudaFree(&data);
How is data copied to and from devices?
cudaMemcpy(
void* dest,
void* src,
size s,
codaMemcpyKind kind
)
kind: where are src and dest located
The function is synchronous, meaning the host will wait for the kernels to finish running - don’t need to explicitly synch the program.
cudaMemcpy(device_x, host_x, size, cudaMemcpyHostToDevice);
cudaMemcpy(host_x, device_x, size, cudaMemcpyDeviceToHost);
host_x and device_x are pointers
What is the return type of a kernel
void
What does kernel-declaration look like?
__global__ void kernelName();
What is a similarity between cuda, and pthreads and openMP?
Cuda threads are also allocated stacks and local variables
What are warp shuffles?
Functions that allow collections of threads within a warp to read variables stored by other threads in the warp.
Function allow threads to read from registers used by other threads in the warp.
How is memory layed out in CUDA?
SMs has access to “shared” memory. This “shared” memory is accessible to all of the SPs within this SM.
All of the SPs and all of the threads have access to “global” memory. This memory is
Number of shared memory locations are small but fast.
Global memory is many but slow.
The fastest memory used is registers.
How are local variables stored in cuda?
Depends on total available memory, and programs memory usage.
If there is enough storage - store in registers.
If not - the local variables are stored in a region of global memory that’s thread-private.
What is a warp?
A set of threads with consecutive ranks belonging to a block.
Thenumber of threads is 32 - this can change in future CUDA implementations.
The system-initialized variable warpSize stores the size.
How does threads in a warp behave?
They execute in a SIMD fashion
Threads in different warps can execute different statements. Threads within the same warp must execute the same.
What does it mean when threads are said to be diversed?
Threads within a warp attempts to execute different statements.
E.g. take different branches in an if-else statement
When diverged threads have finished executing the different statements, they begin executing the same. When this happens the threads have converged.
What is a thread’s lane?
How can it be calculated?
Its rank within a warp.
lane = threadIdx.x % warpSize
How does the warp shuffle function__shfl_down_sync work?
__shfl_down_sync(
unsigned mask,
float var,
unsigned diff,
int width=warpsize
)
mask: indicates what threads are participating in the function call. A bit representing a thread’s lane must be set for each participating thread to ensure all threads in the call have converged (arrived at the call) before the threads begin executing this function.
var: when lane x calls function, the value stored in var on the thread with lane (x + diff)is returned for the current thread x
diff: because unsigned -> bigger than 0. value returned is therefor from a higher rank (hence name shuffl_down)
What are some possible issues with warp shuffles?
If not all threads in a warp calls the function, or the warps does not have warpSize amount of threads or a multiple of it, the function may return undefined results.
What happens if x calls a warp shuffle and its corresponding thread lane does not?
X gets returned undefined
How can you create shared variables?
__shared__ float shared_vals[32]
This definition needs to be in kernel
What is a SM?
Streaming Multiprocessor (SM)
Closest thing the GPU has to a CPU core
128 32-bit floating point units (128 ‘cores’)
What are similarities between CPUs and GPUs?
Both are general purpose processor cores
Both use superscalar architectures
How are threads executed in GPUs?
Executed in groups of 32 (warps)
All threads in a warp execute same instructions (GPUs are designed to do things over and over)
What is a main reason for using warps?
Only one decode needed for 32 threads
What is thread divergence?
When a branch occur, all threads in warp must participate, even if the branch is not in their execution path.
if: If one thread chooses one of the paths, the other threads must participate in executing that path
loop: all threads must participate until the final thread has finished iterating
Thread 1:
if = true
a = 3
else
a = N/A
Thread 2:
if = true
a = N/A
else
a = 10
Both execute along both paths, though might not do useful work (N/A)
When does thread divergence become a problem?
If threads are following different execution paths.
Long if/else where threads are likely to choose different paths
Loops with highly variable number of iterations
Fix:
-rewrite branch using math
- rewrite multiple nested loops to one single one
What class of parallelism does warp-based execution use in Flynn’s taxonomy?
SIMT
What is the difference between SIMD and SIMT?
SIMD: Single instruction executes multiple data operations in a single thread
SIMT: A single instruction executes a single data operation in multiple threads
How does the CPU do a context switch?
Store PC
Store all register to stack
Load registers from new thread from stack
Jump to PC of new thread
Continue execution
How is context switching done on GPUs?
Context switch every single clock cycle
Register file stores all registers of all threads -> no need to swap registers because of this
Each cycle, 4 warps are chosen that are able to execute an instruction
What is the memory system of GPUs?
Shared L2 cache
SM: its own L1 cache + separate instruction cache. Part of the L1 cache can be used as temporary storage known as shared memory
What is the measure of GPU utilisation?
Occupancy =Cycles SM busy / Total cycles
What is a GPU grid?
Contains all threads we want to run.
dim3 gridDim = {x, y, z}
What is a CPU block?
Collection of threads, all running within a single SM
dim3 blockDim = {x, y, z}
What happens when a kernel is launched?
A queue of blocks is created
Blocks run until all its threads have completed
Blocks run in any order
Once a block has completed, the next block is allocated to that SM
Blocks should be dimensioned such that they contain a multiple of 32 threads
What is the desired dimensions of a block?
A multiple of 32 threads
What does __synchthreads() do?
Interraction between threads within a block.
Barriers for all threads in block
What mechanism does GPUs have to avoid race conditions?
Atomic operations
int atomicAdd(int* address, int val)
int atomicSub(int* address, int val)
int atomicMax(int* address, int val)
What is shared memory?
Part of the SMs L1 cache used for temporary storage.
Useful if the same data will be used many times.
Can be usedto communicate bewteen warps in a block
allocated on a per-block basis
Stores:
- intermediate results
- data to be reused
- exchange values between warps in a block
What limits does SMs have?
Number of simultaneously executing threads
Misconfigurations can decrease performance
In Ada Lovelace architecture:
- Max thread per block: 1024
- block per SM: 24
- Warps per SM: 48 (1536 thread)
- registers per thread (255)
- shared mem: 100Kb
register requirements limit the number of warps that can be executed simultaneously in an SM
blocks have constant number of warps, and cannot be partially allocated to an SM, even if the SMs register file has space for some of the warps in the block
Shared memory required by each block limits the number of warps
How are kernel launched?
dim3 gridDims = {1, 2, 3}
dim3 gblockDims = {2, 2, 3}
kernelName«<gridDims, blockDims»>(param1, param2);
Dimension cannot be 0
How are registers used?
Each thread has a fixed number of registers
A warp uses 32x that number of registers
register values are kept in register files within the SM
register requirements limit the number of warps that can be executed simultaneously in an SM
What are some block size trade offs?
Large:
- More ability to cooperate between threads
- better reuse of shared memory
Small:
- Less waiting for all warps to complete
- May improve occupancy by allowing more warps to execute simultaneously
What are cooperative groups?
Threads in blocks and warps partitioned into groups that work together.
Collaboration stems from communication between threads within the same warp being cheap
How do you implement cooperative groups?
include <cooperative_groups.h></cooperative_groups.h>
namespace cg = cooperative_groups
What are some types of cooperative groups?
Coalesced group
Block group
Grid group
Cluster group
What is a Coalesced group?
Threads in the current warp, but only the ones that are executing at that point in time
Example:
__global__ void kernel(){
if(threadIdx.x < 12) return;
cg::coalesced_group warp = cg::coalesced_threads():
}
20 threads would be active in the warp
What is a block group?
A group with all threads in the current block
cg::thread_block block = cg::this_thread_block();;
What is a grid group?
Group of all threads in the entire grid
cg::grid_group grid = cg::this_grid();
Cannot use «<»>, must use:
cudaLaunchCooperativeKernel(kernel, dim, dim, args);
What is a cluster group?
A union of multiple thread blocks
What is group partitioning?
Creating smaller subgroups from larger ones
How is barriers used on groups?
group.sync();
Works on any group or partition
How are instructions executed in terms of GPU structure?
All threads are executed grouped as a warp
What are the 8 GPU limits?
1.
Max 1024 thread per block
2.
Max 24 blocks per SM
3.
Max 48 warps per SM
4.
Max registers per thread: 255
5.
Shared memory: 100Kb
6.
Register requirements limit the number of warps that can be executed simultaneously in an SM
7.
Blocks have constant number of warps and cannot be partially allocated to an SM
8.
Shared memory required by each block limits the number of warps
How does register requirements limit the number of warps that can be executed simultaneously in an SM?
How does shared memory required by each block limit the number of warps?
What are two common issues with memory in GPUs?
Non-coalesced memory reads
Atomic write contention
What are coalesced memory reads?
GPU cache lines: often 128 bytes (32 warps * 4)
Memory allocated with cudaMalloc is always aligned, meaning byte 0 is the start of a cache line.
Even if only one byte is read, the whole cache line must be loaded into core.
Coalesced reads are when 100% of the bandwidth are utilised by reading one cache line per warp, and each thread reads 4 bytes.
4 bytes * 32 threads = 128 bytes (the whole cache line)
Code an example where a therad only uses one byte in a cache line
__global__ void kernel(int* array, int n){
int value = array[32 * threadIdx.x]
}
As each int is 4 bytes - every thread will skip the next 32 * 4 bytes. meaning it reads from the next cache line.
One cache line must be read in per thread.
Give a coding example of a coalesced read
__global__ void kernel(int* array, int n){
int value = array[threadIdx.x]
}
What is atomic write contention?
When multiple threads tries to do an atomic operation, their effects are serialised because threads needs to wait.
__global__ void kernel(int* array. int* oddCount){
int value = array[threadIdx.x];
if(value % 2 == 1){
atomicAdd(oddCount, 1);
}
}
What is a solution to atomic write contention?
Do a reduction within the warp, then have a single thread do the atomic operation
What is a common problem when using structs of arrays, and how can this be solved?
Misaligned reads
struct int4 {
int x,
int y,
int y
}
int4 var;
var.x = varArray[threadIndex].x
Memory: x y z x y z x y z x y z
index: 0 1 2 3
say every xyz-block is each own cache block, each thread will read only read one value and from their own cache line
Solve by using struct of array, this stores all x values together, and all y- and z.
struct arrayOfInt4 {
int* x,
int* y,
int* z
}
int x_value = arrayOfInt4.x[index]
Memory:
x x x x
y y y y
z z z z
What is vectorized loads?
If data type is exactly 4, 8 or 16 bytes, the mem. system guarantees your data is loaded in one operation.
Does not need to use struct of arrays in this case.
What is register spilling?
Temporarily write some register values to memory
upside: can run more threads simultaneously
Downside: more memory transactions
Compiler spills registers when it think it will improve performance
How can register spilling be avoided?
Identify parts of kernel where many variables must be kept simultaneously
- nested function calls
- loops
Consider recalculating values if it is not too expensive
cut fields from structs that are unrelated to kernel
What is a shuffle instruction?
Exchange values within a warp
Each thread provides a value
Ech thread reads a valueprovided by another thread
__shfl__sync(unsigned mask, T var, int srcLane)
Mask: defines what lanes are included in the function, often __activemask() - all threads in warp
Thread 3 executes this - sends value 7:
int out = __shfl__sync(__activatemask(), 7, 8)
Thread 10 executes this - sends value 11 and receives value from lane 3, this being 7:
int out = __shfl__sync(__activatemask(), 11, 3)
What threads can communicate using shuffle instructions?
Threads must be in the same block
Threads must be in the sme warp or cooperative group up to 32 threads in size
What happens if a thread using shuffle instructions are reading from a lane that is not participating?
Undefined result
What are the four shuffle instructions
Read from any thread:
__shfl_sync(mask, var, srcLane)
Read from thread with laneid = (current lane - delta)
__shfl_up_sync(mask, var, delta)
Read from thread with laneid = (current lane + delta)
__shfl_down_sync(mask, var, delta)
Read from thread with laneId XORed with laneMask
__shfl_xor_sync(mask, var, laneMask)
How can shuffle instructions be used to create sum?
int threadValue = 12 // for this thread
sum += __shfl_xor_sync(__activemask(), threadValue, 16)
sum += __shfl_xor_sync(__activemask(), sum, 8)
sum += __shfl_xor_sync(__activemask(), sum, 4)
sum += __shfl_xor_sync(__activemask(), sum, 2)
sum += __shfl_xor_sync(__activemask(), sum, 1)
return sum
This will create a butterfly reduction (cross pattern) where 16 values are exchanged in the same direction which creates a cross, then 8 values - creating 2 crosses, then 4 values, and so on
How can shuffle instructions be used to broadcast values?
reserve a block of 32 entries in a buffer
__shfl_sync(__activemask(), value, 0)
All threads read from lane 0.
Lane 0 calculates value, then all threads can use value in further computation
What is warp voting?
Each thread in the warp sets one bit in a 32 bit integer
Bit index corresponds to the lane index
only active threads vote
What are the warp voting instructions?
Create a 32-bit integer where each lane sets one bit:
unsigned int __ballot_sync(mask, bool predicate)
Returns true if all threads votes true
bool __all_sync(mask, bool predicate)
Returns true if any thread votes true
bool __any_sync(mask, bool predicate)
Reverses a 32-bit integer
unsigned int __brev(unsigned int mask)
What are some usecases of warp voting?
Identify elements that should be removed
What are warp reductions?
Function used to reduce thread values into one result
__func_name(unsigned mask, unsigned/int value)