GPUs (2) Flashcards

1
Q

Why do GPUs use caches? Is it for the same reason as CPUs?

A

No, multithreading in GPUs hides DRAM latency. Cache reduces DRAM bandwidth requirements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 3 main types of GPU memory?

A
  1. Private local memory
  2. Shared memory
  3. Global GPU memory
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe private local memory

A

Each thread is allocated private local memory
Used for stack frame, spilling registers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe shared memory

A

On each multithreaded processor, not between
Dynamically allocated to thread blocks on creation, used for communication between threads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe global GPU memory

A

Available across GPU and also to host/system processor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Where is private local memory located?

A

External DRAM so it can be large, can be cached

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Where is shared memory located?

A

Within each multithreaded processor (core) so high bandwidth

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Where is global memory located?

A

In external DRAM, can be cached

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Give 2 examples of GPUs

A

Fermi, Kepler

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are 2 main languages used to program a GPU?

A
  1. CUDA
  2. OpenCL
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does CUDA stand for?

A

Compute Unified Device Architecture

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does the programmer split up CUDA code?

A

Identify the code to run on the CPU and that for the GPU, split up and annotate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a kernel?

A

Program or function, designed to be executed in parallel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a CUDA thread?

A

Single stream of instructions from a computation kernel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a thread block?

A

A set of threads that execute the same kernel and can cooperate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a grid?

A

A set of thread blocks that execute the same kernel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is strip-mining?

A

Compiler optimisation where a loop is split into smaller nested loops
Can be done to illustrate hierarchy - correspondance of inner loop to a thread block

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Write out the code for the original DAXPY loop strip-mined into 2 parts. In this case how many warps are in a thread block, each of which corresponds to the inner loop?

A

for (i=0; i<1024; i+=128) {
for (j=i; j < i+128; ++j) {
Z[j] = a*X[j] + Y[j]
}
}
4 warps per thread block

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Draw a diagram of the grouping of CUDA threads in a GPU executing this code

A

.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How does a CUDA programmer annotate GPU functions?

A

__device__ or __global__

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How does a CUDA programmer annotate CPU functions?

A

__host__

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

GPU functions must be called with code dimensions. Write out the structure. What do these specify?

A

func«<dimGrid, dimBlock»>(params)
dimGrid = number of blocks in a grid
dimBlock = number of threads in a block

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Write DAXPY in C

A

void daxpy(int n, double a, double X, doubleY) {
for (i=0; i < n; ++i) {
Y[i] = a*X[i] + Y[i]
}
}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How could DAXPY in C be called?

A

daxpy(n, 2.0, X, Y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Write out the DAXPY equivalent in CUDA
__device__ void daxpy(int n, double a, double *X, double*Y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) Y[i] = a*X[i] + Y[i]; (checks that thread is within bounds) } __host__ int nblocks = (n + 255) / 256; daxpy<<>>(n, 2.0, X, Y)
26
What is OpenCL designed to support?
Heterogenous computing
27
Explain what blockIDx, blockDim and threadIdx are
Constraints provided to each thread eg. blockIdx tells us which thread block the thread is in
28
What will CUDA do when the DAXPY code is executed?
Will create 1024 different threads all with the right dimensions
29
Which devices can OpenCL be used to program?
Any heterogenous device eg. CPUs, GPUs, FPGAs Whereas CUDA is just used for GPUs
30
List the 4 OpenCL models
1. Platform model 2. Execution model 3. Kernel programming model 4. Memory model
31
What is an OpenCL platform?
A host with 1+ devices
32
How is a device divided up?
Devices are divided into compute units, and then into processing elements
33
Draw a diagram of an OpenCL platform
.
34
Does a system have to have 1 OpenCL platform?
No, could have multiple platforms with different characteristics
35
What property does a platform provide applications?
Portability - abstracts away from vendor-specific runtime
36
In OpenCL, what is a GPU?
A device
37
In OpenCL, what is a streaming multiprocessor?
A compute unit
38
In OpenCL, what is a thread?
A processing element
39
With OpenCL apps use an API to choose the platform/device to run on. How do they discover the set of available platforms
Call once to get the number Then allocate memory Then call again to populate the array: clGetPlatformIDs(entries, *platforms, *num)
40
How do apps then query the devices available on a platform?
Similarly call twice: clGetDeviceIDs(platform, device_type, ...)
41
What is a context?
Abstract environment for the execution that: 1. Manages memory objects 2. Manages interaction between host and device 3. Keeps track of programs and kernels on each device
42
Give an example of how there is more fine grained control in OpenCL compared to CUDA
Contexts can be created manually in OpenCL whereas this is done behind the scenes in CUDA
43
Which 2 OpenCL commands can be used to create a context?
clCreateContext(*properties, num, *devices, ...) clCreateContextFromType(*properties, *dev_type, ...)
44
How can the host communicate with a device?
Command queues One queue per device, commands sent from host
45
Barriers can be used to synchronise queues. Which two OpenCL commands are used?
clFlush(queue) clFinish(queue)
46
Which two orderings can queues have?
1. In-order (FIFO queue) 2. Out-of-Order (Commands can be rearranged for efficiency, but cannot break dependencies)
47
Event objects specify ? ?
command dependencies
48
Each command has a wait list. What does this contain?
Events that this command depends on
49
Each command has its own event. What is this for?
To link dependent commands with this command And contain the state of the command eg. queued, running, ready
50
Draw a diagram showing the command queue between host and device
.
51
Kernels are what actually run on the device. Each kernel contains ? ? ? ? ?
the body of a loop
52
What is a work-item?
Basic unit of concurrency Each executes one iteration of a loop Multiple created to execute each kernel Corresponds to a CUDA thread
53
What is NDRange?
Number of kernels expressed as an n-dimensional range (index space)
54
What is a work-group?
Work-items within an NDRange are grouped into smaller units called work-groups Has the same dimensions as the NDRange
55
What would we call a work-group in CUDA?
Warp
56
Draw a diagram of the structure of work items executing the DAXPY loop with i+=32 in the external loop
.
57
Draw a diagram showing the NDRange and a work-group
.
58
When are kernels compiled? Why?
At run-time to allow optimisation for a specific device
59
What are the 3 types of memory objects in OpenCL?
1. Buffers - equivalent to C arrays (contiguous elements) 2. Images - abstract, cannot be directly referenced 3. Pipes - ordered sequences of data as a FIFO
60
What OpenCL commands can be used to create memory objects?
clCreateBuffer(context, flags, size, ...) clCreateImage(context, flags, format, desc, ...) clCreatePipe(context, flags, pkt_size, max_pkts)
61
What are the two types of memory in OpenCL?
Host and device
62
What is host memory?
Defined outside OpenCL
63
What is device memory?
Accessible to kernels
64
What are the 4 types of device memory?
1. Global - visible to all work items 2. Constant - visible to all for simultaneous access 3. Local - shared between work-items within a work-group 4. Private - only visible to each work-item
65
Draw a diagram showing the OpenCL memory within a kernel
.
66
OpenCL is made up of many stages. Which of these is the only CUDA stage, with the rest being handled by the compiler?
Execute kernel
67
Write out the DAXPY equivalent in OpenCL
GPU: __kernel void daxpy(__global double a, __global double *X, __global double *Y) { int tid = get_global_id(0); Y[tid] = a*X[tid] + Y[tid]; } CPU (called using 256 work-items per work-group): cl_kernel k = clCreateKernel(p, "daxpy", &status); s = clSetKernelArg(k, 0, sizeof(cl_mem), a); i[0] = 4096; w[0] = 256; s = clEnqueueNDRangeKernel(q, k, 1, NULL, i, w, ...);
68
Which of CUDA and OpenCL is more verbose? Why?
OpenCL Because platform discovery and compilation performed at runtime
69
ADD CUDA OPENCL COMPARISON FROM SUPOS
70
Which of CUDA and OpenCL supports more devices?
OpenCL It targets CPUs, GPUs, FPGAs etc. wheres CUDA only targets NVIDIA's GPUs