L16 Flashcards

1
Q

What are CPUs optimised for? What are GPUs optimised for?

A

CPUs optimized for
* Low latency
* Programmability

GPUs optimized for
* High throughput
* Little reuse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some properties of a GPU chip?

A

Same silicon
More ALUs and registers
Less control logic and caches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do GPUs achieve simple control?

A

 No branch prediction/prefetching
 No out-of-order execution
 Groups of threads in lockstep (SIMT)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do GPUs achieve high throughput?

A

 Large number of execution units, registers, and threads
 Thread switching on long latency operations
 Fast + wide link to memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can GPUs be programmed?

A

o As graphics cards (OpenGL)
o As General Purpose GPUs (CUDA, OpenMP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is CUDA?

A

A general purpose GPUs
CUDA: heterogenous programming model
* Host (CPU) vs Device (GPU)
* Only for NVIDIA GPUs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

True or False.

A CUDA function cannot be both __host__ and __device___.

A

False. It can.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the three function declaration prefixes in CUDA?

A

 __global__: executed on device, called from host
 __device__: executed on device, called from device
 __host__: executed on hose, called from host

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the 4 levels of the thread hierarchy in GPUs?

A

Thread:
 Executes the kernel once

Warp:
 Group of 32 threads executing in lockstep
 One assigned to each SP (stream processor) of the stream multiprocessor (SM)
 Zero-cost context switching

Block:
 Group of threads assigned to one SM
 Runs as a group to completion

Grid:
 All the blocks for the kernel organized in a Cartesian Grid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Consider the following CUDA code:

MMKernel «<dim(width/32,width/32), dim3(32,32)»>(Md, Nd, Pd, width);

a. What does “dim(width/32,width/32)” specify?
b. What does “dim3(32,32)” specify?

A

a. Blocks per grid
b. Threads per block

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In GPUs, what happens if a block uses more resources than available in SM?

A

fail to launch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

True or False.

CUDA has more performance portability than OpenCL.

A

False.

OpenCL has better performance portability but is a bit more awkward to use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is an alternative to CUDA?

A

OpenCL:
 Standard for programming accelerators
 Similar to CUDA’s program structure
 Similar performance considerations
 Can run in multiple different platforms using multiple different compiler

SYCL:
 High level
 Single source
 Uses buffers for memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Is there a way to get portable performance from GPUs to CPUs?

A

Tuned libraries, such as MKL, can be used, but code might still be suboptimal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is high throughput achieved in GPUs?

A

Many ALUs and registers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the role of host and device in GPU programming?

A

Host: orchestrates execution by describing how the problem maps to threads

Device: does the computation in restricted C/C++