Chapter 6 - CUDA Flashcards
What are GPGPUs?
General purpose GPUs
What is CUDA?
API for general purpose programming on GPUs.
What is the architecture of GPUs?
SIMD (single control unit, multiple datapaths). Instruction is fetched from memory and broadcasted to the multiple datapaths. Each datapath executes the instruction on its data, or remain idle.
GPUs consist of one or more SIMD processors.
A Streaming Multiprocessor (SMs) consists of one or more SIMD processors.
On GPUs, the datapaths are called cores or Streaming processor (SPs). There can bemultiple SPs within a SM
What does a SM consist of?
Can have several control units, and many datapaths
Operates asynchronously.
No penalty if one SM execute one part of conditional (if), and another SM executes the other (else)
Small block of memory shared amongst all SPs
What is the Host?
CPU and its associated memory
What is the device?
GPU and its associated memory
Draw a diagram of a CPU and a GPU
What is heterogeneous computing?
Writing programs that will run on devices with different architectures.
What is a kernel?
A function started by host, but run by device.
__global__
return void
How can threads running in cuda be synchroniced?
cudaDeviceSynchronize();
Main waits until device threads have finished executing the kermel.
How are kernels run?
Kernel_name «<grid_dims, block_dims»>(args);
n_threads: How many threads to start on the GPU
What variables are defined by CUDA when a kernel is started?
4 structs, all with a x-, y- and z-field
threadIdx: defines rank of thread within block
blockDim: dimensions of the blocks
blockIdx: rank of current block in grid
gridDim: Dimensions of grid
What does it mean that a call to a kernel is asynchronous?
The call returns immediately.
To avoid the host terminating before kernels has completed execution, cudaDeviceSynchronize can be used.
How are threads organized in a kernel?
Each individual thread will execute on one SP.
«<grid_dims, block_dims»>
grid_dims: how many SMs to utilize, how many blocks in each dimenstion
block_dims: How many SPs to utilize, how many threads to run on a single SM
if grid_dims or block_dims has the type integer, the x-value of the grid- and block-dimension will be that integer. The y- and z-dimensions will be set to 1.
To specify the y- and z- dimension in addition to x, the grid_dims variable must be of type dim3
How is a dimension variable specified using dim3?
dim3 grid_dims;
grid_dims.x = 2;
grid_dims.y = 3;
grid_dims.z = 4;
What are some conditions of thread blocks?
All blocks have the same dimensions
All blocks can execute independantly of each other
All blocks can execute in any order
Thread blocks must be able to complete its execution, regardless of the state of the other blocks.
What is the compute capacity?
There are limits on number of threads in each block, and number of blocks.
This number (limit) has the for a.b
a: 1, 2, 3, 5, 6, 7, 8
b: 0-7 (depends on the value of a)
The compute capacity specifies how many threads each block can have, and how big dimensions in blocks or grids can be.
How can you calculate the global index of a thread?
x = blockDim.x * blockIdx.x + threadIdx.x
How is memory allocated to devices?
cudaMalloc(void** ptr, size_t size)
Memory allocated on device must be explicitly freed
What does the qualifier __device__ do?
Put in front of functions.
Indicates that a function can only be called from the device.
How is device memory freed?
cudaFree(&data);
How is data copied to and from devices?
cudaMemcpy(
void* dest,
void* src,
size s,
codaMemcpyKind kind
)
kind: where are src and dest located
The function is synchronous, meaning the host will wait for the kernels to finish running - don’t need to explicitly synch the program.
cudaMemcpy(device_x, host_x, size, cudaMemcpyHostToDevice);
cudaMemcpy(host_x, device_x, size, cudaMemcpyDeviceToHost);
host_x and device_x are pointers
What is the return type of a kernel
void
What does kernel-declaration look like?
__global__ void kernelName();
What is a similarity between cuda, and pthreads and openMP?
Cuda threads are also allocated stacks and local variables
What are warp shuffles?
Functions that allow collections of threads within a warp to read variables stored by other threads in the warp.
Function allow threads to read from registers used by other threads in the warp.
How is memory layed out in CUDA?
SMs has access to “shared” memory. This “shared” memory is accessible to all of the SPs within this SM.
All of the SPs and all of the threads have access to “global” memory. This memory is
Number of shared memory locations are small but fast.
Global memory is many but slow.
The fastest memory used is registers.
How are local variables stored in cuda?
Depends on total available memory, and programs memory usage.
If there is enough storage - store in registers.
If not - the local variables are stored in a region of global memory that’s thread-private.
What is a warp?
A set of threads with consecutive ranks belonging to a block.
Thenumber of threads is 32 - this can change in future CUDA implementations.
The system-initialized variable warpSize stores the size.
How does threads in a warp behave?
They execute in a SIMD fashion
Threads in different warps can execute different statements. Threads within the same warp must execute the same.
What does it mean when threads are said to be diversed?
Threads within a warp attempts to execute different statements.
E.g. take different branches in an if-else statement
When diverged threads have finished executing the different statements, they begin executing the same. When this happens the threads have converged.
What is a thread’s lane?
How can it be calculated?
Its rank within a warp.
lane = threadIdx.x % warpSize
How does the warp shuffle function__shfl_down_sync work?
__shfl_down_sync(
unsigned mask,
float var,
unsigned diff,
int width=warpsize
)
mask: indicates what threads are participating in the function call. A bit representing a thread’s lane must be set for each participating thread to ensure all threads in the call have converged (arrived at the call) before the threads begin executing this function.
var: when lane x calls function, the value stored in var on the thread with lane (x + diff)is returned for the current thread x
diff: because unsigned -> bigger than 0. value returned is therefor from a higher rank (hence name shuffl_down)
What are some possible issues with warp shuffles?
If not all threads in a warp calls the function, or the warps does not have warpSize amount of threads or a multiple of it, the function may return undefined results.
What happens if x calls a warp shuffle and its corresponding thread lane does not?
X gets returned undefined
How can you create shared variables?
__shared__ float shared_vals[32]
This definition needs to be in kernel