Chapter 6 - CUDA Flashcards

Question

What is a similarity between cuda, and pthreads and openMP?

Answer 1

Cuda threads are also allocated stacks and local variables

Answer 2

Functions that allow collections of threads within a warp to read variables stored by other threads in the warp. Function allow threads to read from registers used by other threads in the warp.

Answer 3

SMs has access to "shared" memory. This "shared" memory is accessible to all of the SPs within this SM. All of the SPs and all of the threads have access to "global" memory. This memory is Number of shared memory locations are small but fast. Global memory is many but slow. The fastest memory used is registers.

Answer 4

Depends on total available memory, and programs memory usage. If there is enough storage - store in registers. If not - the local variables are stored in a region of global memory that's thread-private.

Answer 5

A set of threads with consecutive ranks belonging to a block. Thenumber of threads is 32 - this can change in future CUDA implementations. The system-initialized variable warpSize stores the size.

Answer 6

They execute in a SIMD fashion Threads in different warps can execute different statements. Threads within the same warp must execute the same.

Answer 7

Threads within a warp attempts to execute different statements. E.g. take different branches in an if-else statement When diverged threads have finished executing the different statements, they begin executing the same. When this happens the threads have converged.

Answer 8

Its rank within a warp. lane = threadIdx.x % warpSize

Answer 9

__shfl_down_sync( unsigned mask, float var, unsigned diff, int width=warpsize ) mask: indicates what threads are participating in the function call. A bit representing a thread's lane must be set for each participating thread to ensure all threads in the call have converged (arrived at the call) before the threads begin executing this function. var: when lane x calls function, the value stored in var on the thread with lane (x + diff)is returned for the current thread x diff: because unsigned -> bigger than 0. value returned is therefor from a higher rank (hence name shuffl_down)

Answer 10

If not all threads in a warp calls the function, or the warps does not have warpSize amount of threads or a multiple of it, the function may return undefined results.

Answer 11

X gets returned undefined

Answer 12

__shared__ float shared_vals[32] This definition needs to be in kernel

Answer 13

Streaming Multiprocessor (SM) Closest thing the GPU has to a CPU core 128 32-bit floating point units (128 'cores')

Answer 14

Both are general purpose processor cores Both use superscalar architectures

Answer 15

Executed in groups of 32 (warps) All threads in a warp execute same instructions (GPUs are designed to do things over and over)

Answer 16

Only one decode needed for 32 threads

Answer 17

When a branch occur, all threads in warp must participate, even if the branch is not in their execution path. if: If one thread chooses one of the paths, the other threads must participate in executing that path loop: all threads must participate until the final thread has finished iterating Thread 1: if = true a = 3 else a = N/A Thread 2: if = true a = N/A else a = 10 Both execute along both paths, though might not do useful work (N/A)

Answer 18

If threads are following different execution paths. Long if/else where threads are likely to choose different paths Loops with highly variable number of iterations Fix: -rewrite branch using math - rewrite multiple nested loops to one single one

Answer 19

SIMD: Single instruction executes multiple data operations in a single thread SIMT: A single instruction executes a single data operation in multiple threads

Answer 20

Store PC Store all register to stack Load registers from new thread from stack Jump to PC of new thread Continue execution

Answer 21

Context switch every single clock cycle Register file stores all registers of all threads -> no need to swap registers because of this Each cycle, 4 warps are chosen that are able to execute an instruction

Answer 22

Shared L2 cache SM: its own L1 cache + separate instruction cache. Part of the L1 cache can be used as temporary storage known as shared memory

Answer 23

Occupancy =Cycles SM busy / Total cycles

Answer 24

Contains all threads we want to run. dim3 gridDim = {x, y, z}

Answer 25

Collection of threads, all running within a single SM dim3 blockDim = {x, y, z}

Answer 26

A queue of blocks is created Blocks run until all its threads have completed Blocks run in any order Once a block has completed, the next block is allocated to that SM Blocks should be dimensioned such that they contain a multiple of 32 threads

Answer 27

A multiple of 32 threads

Answer 28

Interraction between threads within a block. Barriers for all threads in block

Answer 29

Atomic operations int atomicAdd(int* address, int val) int atomicSub(int* address, int val) int atomicMax(int* address, int val)

Answer 30

Part of the SMs L1 cache used for temporary storage. Useful if the same data will be used many times. Can be usedto communicate bewteen warps in a block allocated on a per-block basis Stores: - intermediate results - data to be reused - exchange values between warps in a block

Answer 31

Number of simultaneously executing threads Misconfigurations can decrease performance In Ada Lovelace architecture: - Max thread per block: 1024 - block per SM: 24 - Warps per SM: 48 (1536 thread) - registers per thread (255) - shared mem: 100Kb register requirements limit the number of warps that can be executed simultaneously in an SM blocks have constant number of warps, and cannot be partially allocated to an SM, even if the SMs register file has space for some of the warps in the block Shared memory required by each block limits the number of warps

Answer 32

dim3 gridDims = {1, 2, 3} dim3 gblockDims = {2, 2, 3} kernelName<<>>(param1, param2); Dimension cannot be 0

Answer 33

Each thread has a fixed number of registers A warp uses 32x that number of registers register values are kept in register files within the SM register requirements limit the number of warps that can be executed simultaneously in an SM

Answer 34

Large: - More ability to cooperate between threads - better reuse of shared memory Small: - Less waiting for all warps to complete - May improve occupancy by allowing more warps to execute simultaneously

Answer 35

Threads in blocks and warps partitioned into groups that work together. Collaboration stems from communication between threads within the same warp being cheap

Answer 36

#include namespace cg = cooperative_groups

Answer 37

Coalesced group Block group Grid group Cluster group

Answer 38

Threads in the current warp, but only the ones that are executing at that point in time Example: __global__ void kernel(){ if(threadIdx.x < 12) return; cg::coalesced_group warp = cg::coalesced_threads(): } 20 threads would be active in the warp

Answer 39

A group with all threads in the current block cg::thread_block block = cg::this_thread_block();;

Answer 40

Group of all threads in the entire grid cg::grid_group grid = cg::this_grid(); Cannot use <<<>>>, must use: cudaLaunchCooperativeKernel(kernel, dim, dim, args);

Answer 41

A union of multiple thread blocks

Answer 42

Creating smaller subgroups from larger ones

Answer 43

group.sync(); Works on any group or partition

Answer 44

All threads are executed grouped as a warp

Answer 45

1. Max 1024 thread per block 2. Max 24 blocks per SM 3. Max 48 warps per SM 4. Max registers per thread: 255 5. Shared memory: 100Kb 6. Register requirements limit the number of warps that can be executed simultaneously in an SM 7. Blocks have constant number of warps and cannot be partially allocated to an SM 8. Shared memory required by each block limits the number of warps

Answer 46

Non-coalesced memory reads Atomic write contention

Answer 47

GPU cache lines: often 128 bytes (32 warps * 4) Memory allocated with cudaMalloc is always aligned, meaning byte 0 is the start of a cache line. Even if only one byte is read, the whole cache line must be loaded into core. Coalesced reads are when 100% of the bandwidth are utilised by reading one cache line per warp, and each thread reads 4 bytes. 4 bytes * 32 threads = 128 bytes (the whole cache line)

Answer 48

__global__ void kernel(int* array, int n){ int value = array[32 * threadIdx.x] } As each int is 4 bytes - every thread will skip the next 32 * 4 bytes. meaning it reads from the next cache line. One cache line must be read in per thread.

Answer 49

__global__ void kernel(int* array, int n){ int value = array[threadIdx.x] }

Answer 50

When multiple threads tries to do an atomic operation, their effects are serialised because threads needs to wait. __global__ void kernel(int* array. int* oddCount){ int value = array[threadIdx.x]; if(value % 2 == 1){ atomicAdd(oddCount, 1); } }

Answer 51

Do a reduction within the warp, then have a single thread do the atomic operation

Answer 52

Misaligned reads struct int4 { int x, int y, int y } int4 var; var.x = varArray[threadIndex].x Memory: x y z x y z x y z x y z index: 0 1 2 3 say every xyz-block is each own cache block, each thread will read only read one value and from their own cache line Solve by using struct of array, this stores all x values together, and all y- and z. struct arrayOfInt4 { int* x, int* y, int* z } int x_value = arrayOfInt4.x[index] Memory: x x x x y y y y z z z z

Answer 53

If data type is exactly 4, 8 or 16 bytes, the mem. system guarantees your data is loaded in one operation. Does not need to use struct of arrays in this case.

Answer 54

Temporarily write some register values to memory upside: can run more threads simultaneously Downside: more memory transactions Compiler spills registers when it think it will improve performance

Answer 55

Identify parts of kernel where many variables must be kept simultaneously - nested function calls - loops Consider recalculating values if it is not too expensive cut fields from structs that are unrelated to kernel

Answer 56

Exchange values within a warp Each thread provides a value Ech thread reads a valueprovided by another thread __shfl__sync(unsigned mask, T var, int srcLane) Mask: defines what lanes are included in the function, often __activemask() - all threads in warp Thread 3 executes this - sends value 7: int out = __shfl__sync(__activatemask(), 7, 8) Thread 10 executes this - sends value 11 and receives value from lane 3, this being 7: int out = __shfl__sync(__activatemask(), 11, 3)

Answer 57

Threads must be in the same block Threads must be in the sme warp or cooperative group up to 32 threads in size

Answer 58

Undefined result

Answer 59

Read from any thread: __shfl_sync(mask, var, srcLane) Read from thread with laneid = (current lane - delta) __shfl_up_sync(mask, var, delta) Read from thread with laneid = (current lane + delta) __shfl_down_sync(mask, var, delta) Read from thread with laneId XORed with laneMask __shfl_xor_sync(mask, var, laneMask)

Answer 60

int threadValue = 12 // for this thread sum += __shfl_xor_sync(__activemask(), threadValue, 16) sum += __shfl_xor_sync(__activemask(), sum, 8) sum += __shfl_xor_sync(__activemask(), sum, 4) sum += __shfl_xor_sync(__activemask(), sum, 2) sum += __shfl_xor_sync(__activemask(), sum, 1) return sum This will create a butterfly reduction (cross pattern) where 16 values are exchanged in the same direction which creates a cross, then 8 values - creating 2 crosses, then 4 values, and so on

Answer 61

reserve a block of 32 entries in a buffer __shfl_sync(__activemask(), value, 0) All threads read from lane 0. Lane 0 calculates value, then all threads can use value in further computation

Answer 62

Each thread in the warp sets one bit in a 32 bit integer Bit index corresponds to the lane index only active threads vote

Answer 63

Create a 32-bit integer where each lane sets one bit: unsigned int __ballot_sync(mask, bool predicate) Returns true if all threads votes true bool __all_sync(mask, bool predicate) Returns true if any thread votes true bool __any_sync(mask, bool predicate) Reverses a 32-bit integer unsigned int __brev(unsigned int mask)

Answer 64

Identify elements that should be removed

Answer 65

Function used to reduce thread values into one result __func_name(unsigned mask, unsigned/int value)

Chapter 6 - CUDA Flashcards

(92 cards)