GPUs (2) Flashcards by Tabby Black

Why do GPUs use caches? Is it for the same reason as CPUs?

No, multithreading in GPUs hides DRAM latency. Cache reduces DRAM bandwidth requirements

How well did you know this?

Not at all

Perfectly

What are the 3 main types of GPU memory?

Private local memory
Shared memory
Global GPU memory

How well did you know this?

Not at all

Perfectly

Describe private local memory

Each thread is allocated private local memory
Used for stack frame, spilling registers

How well did you know this?

Not at all

Perfectly

Describe shared memory

On each multithreaded processor, not between
Dynamically allocated to thread blocks on creation, used for communication between threads

How well did you know this?

Not at all

Perfectly

Describe global GPU memory

Available across GPU and also to host/system processor

How well did you know this?

Not at all

Perfectly

Where is private local memory located?

External DRAM so it can be large, can be cached

How well did you know this?

Not at all

Perfectly

Where is shared memory located?

Within each multithreaded processor (core) so high bandwidth

How well did you know this?

Not at all

Perfectly

Where is global memory located?

In external DRAM, can be cached

How well did you know this?

Not at all

Perfectly

Give 2 examples of GPUs

Fermi, Kepler

How well did you know this?

Not at all

Perfectly

What are 2 main languages used to program a GPU?

CUDA
OpenCL

How well did you know this?

Not at all

Perfectly

What does CUDA stand for?

Compute Unified Device Architecture

How well did you know this?

Not at all

Perfectly

How does the programmer split up CUDA code?

Identify the code to run on the CPU and that for the GPU, split up and annotate

How well did you know this?

Not at all

Perfectly

What is a kernel?

Program or function, designed to be executed in parallel

How well did you know this?

Not at all

Perfectly

What is a CUDA thread?

Single stream of instructions from a computation kernel

How well did you know this?

Not at all

Perfectly

What is a thread block?

A set of threads that execute the same kernel and can cooperate

How well did you know this?

Not at all

Perfectly

What is a grid?

A set of thread blocks that execute the same kernel

How well did you know this?

Not at all

Perfectly

What is strip-mining?

Compiler optimisation where a loop is split into smaller nested loops
Can be done to illustrate hierarchy - correspondance of inner loop to a thread block

How well did you know this?

Not at all

Perfectly

Write out the code for the original DAXPY loop strip-mined into 2 parts. In this case how many warps are in a thread block, each of which corresponds to the inner loop?

for (i=0; i<1024; i+=128) {
for (j=i; j < i+128; ++j) {
Z[j] = a*X[j] + Y[j]
}
}
4 warps per thread block

How well did you know this?

Not at all

Perfectly

Draw a diagram of the grouping of CUDA threads in a GPU executing this code

How well did you know this?

Not at all

Perfectly

How does a CUDA programmer annotate GPU functions?

__device__ or __global__

How well did you know this?

Not at all

Perfectly

How does a CUDA programmer annotate CPU functions?

__host__

How well did you know this?

Not at all

Perfectly

GPU functions must be called with code dimensions. Write out the structure. What do these specify?

func«<dimGrid, dimBlock»>(params)
dimGrid = number of blocks in a grid
dimBlock = number of threads in a block

How well did you know this?

Not at all

Perfectly

Write DAXPY in C

void daxpy(int n, double a, double X, doubleY) {
for (i=0; i < n; ++i) {
Y[i] = a*X[i] + Y[i]
}
}

How well did you know this?

Not at all

Perfectly

How could DAXPY in C be called?

daxpy(n, 2.0, X, Y)

How well did you know this?

Not at all

Perfectly

Write out the DAXPY equivalent in CUDA

__device__ void daxpy(int n, double a, double *X, double*Y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) Y[i] = a*X[i] + Y[i]; (checks that thread is within bounds) } __host__ int nblocks = (n + 255) / 256; daxpy<<>>(n, 2.0, X, Y)

What is OpenCL designed to support?

Heterogenous computing

Explain what blockIDx, blockDim and threadIdx are

Constraints provided to each thread eg. blockIdx tells us which thread block the thread is in

What will CUDA do when the DAXPY code is executed?

Will create 1024 different threads all with the right dimensions

Which devices can OpenCL be used to program?

Any heterogenous device eg. CPUs, GPUs, FPGAs Whereas CUDA is just used for GPUs

List the 4 OpenCL models

1. Platform model 2. Execution model 3. Kernel programming model 4. Memory model

What is an OpenCL platform?

A host with 1+ devices

How is a device divided up?

Devices are divided into compute units, and then into processing elements

Draw a diagram of an OpenCL platform

Does a system have to have 1 OpenCL platform?

No, could have multiple platforms with different characteristics

What property does a platform provide applications?

Portability - abstracts away from vendor-specific runtime

In OpenCL, what is a GPU?

A device

In OpenCL, what is a streaming multiprocessor?

A compute unit

In OpenCL, what is a thread?

A processing element

With OpenCL apps use an API to choose the platform/device to run on. How do they discover the set of available platforms

Call once to get the number Then allocate memory Then call again to populate the array: clGetPlatformIDs(entries, *platforms, *num)

How do apps then query the devices available on a platform?

Similarly call twice: clGetDeviceIDs(platform, device_type, ...)

What is a context?

Abstract environment for the execution that: 1. Manages memory objects 2. Manages interaction between host and device 3. Keeps track of programs and kernels on each device

Give an example of how there is more fine grained control in OpenCL compared to CUDA

Contexts can be created manually in OpenCL whereas this is done behind the scenes in CUDA

Which 2 OpenCL commands can be used to create a context?

clCreateContext(*properties, num, *devices, ...) clCreateContextFromType(*properties, *dev_type, ...)

How can the host communicate with a device?

Command queues One queue per device, commands sent from host

Barriers can be used to synchronise queues. Which two OpenCL commands are used?

clFlush(queue) clFinish(queue)

Which two orderings can queues have?

1. In-order (FIFO queue) 2. Out-of-Order (Commands can be rearranged for efficiency, but cannot break dependencies)

Event objects specify ? ?

command dependencies

Each command has a wait list. What does this contain?

Events that this command depends on

Each command has its own event. What is this for?

To link dependent commands with this command And contain the state of the command eg. queued, running, ready

Draw a diagram showing the command queue between host and device

Kernels are what actually run on the device. Each kernel contains ? ? ? ? ?

the body of a loop

What is a work-item?

Basic unit of concurrency Each executes one iteration of a loop Multiple created to execute each kernel Corresponds to a CUDA thread

What is NDRange?

Number of kernels expressed as an n-dimensional range (index space)

What is a work-group?

Work-items within an NDRange are grouped into smaller units called work-groups Has the same dimensions as the NDRange

What would we call a work-group in CUDA?

Warp

Draw a diagram of the structure of work items executing the DAXPY loop with i+=32 in the external loop

Draw a diagram showing the NDRange and a work-group

When are kernels compiled? Why?

At run-time to allow optimisation for a specific device

What are the 3 types of memory objects in OpenCL?

1. Buffers - equivalent to C arrays (contiguous elements) 2. Images - abstract, cannot be directly referenced 3. Pipes - ordered sequences of data as a FIFO

What OpenCL commands can be used to create memory objects?

clCreateBuffer(context, flags, size, ...) clCreateImage(context, flags, format, desc, ...) clCreatePipe(context, flags, pkt_size, max_pkts)

What are the two types of memory in OpenCL?

Host and device

What is host memory?

Defined outside OpenCL

What is device memory?

Accessible to kernels

What are the 4 types of device memory?

1. Global - visible to all work items 2. Constant - visible to all for simultaneous access 3. Local - shared between work-items within a work-group 4. Private - only visible to each work-item

Draw a diagram showing the OpenCL memory within a kernel

OpenCL is made up of many stages. Which of these is the only CUDA stage, with the rest being handled by the compiler?

Execute kernel

Write out the DAXPY equivalent in OpenCL

GPU: __kernel void daxpy(__global double a, __global double *X, __global double *Y) { int tid = get_global_id(0); Y[tid] = a*X[tid] + Y[tid]; } CPU (called using 256 work-items per work-group): cl_kernel k = clCreateKernel(p, "daxpy", &status); s = clSetKernelArg(k, 0, sizeof(cl_mem), a); i[0] = 4096; w[0] = 256; s = clEnqueueNDRangeKernel(q, k, 1, NULL, i, w, ...);

Which of CUDA and OpenCL is more verbose? Why?

OpenCL Because platform discovery and compilation performed at runtime

ADD CUDA OPENCL COMPARISON FROM SUPOS

Which of CUDA and OpenCL supports more devices?

OpenCL It targets CPUs, GPUs, FPGAs etc. wheres CUDA only targets NVIDIA's GPUs

GPUs (2) Flashcards

(70 cards)