GPUs Flashcards

1
Q

What is a GPU?

A

Graphics Processing Unit
Specialised hardware for graphics computation, highly parallel multiprocessors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

List the 6 stages in the graphics pipeline

A
  1. Input assembler
  2. Vertex shader
  3. Geometry shader
  4. Setup & rasterise
  5. Pixel shader
  6. Output merger
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which 3 stages in the graphics pipeline are programmable?

A
  1. Vertex shader
  2. Geometry shader
  3. Pixel shader
    (all the shaders!!)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Draw the graphics pipeline, showing the unified processor array

A

.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Tasks pass through the unified processor array ? ?

A

several times

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

GPUs take advantage of ? level parallelism

A

data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which can support more threads: CPU or GPU?

A

GPU - supports multithreading whereas CPU is designed for single-threading

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the design goal of a CPU?

A

Designed for single-threaded performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the design goal of a GPU?

A

Designed for throughput
- individual thread performance is not critical
- each core is relatively simple

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Write out the code for a DAXPY (Double A X Plus Y) loop

A

for (i=0; i<1024; ++i) {
Z[i] = a*X[i] + Y[i]
}
Multiple vector of doubles X by constant a then add to vector of doubles Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why can we parallelise a DAXPY loop?

A

All iterations are independent because there are no data dependencies between them
eg. Z[0] = aX[0] + Y[0] is independent of Z[1] = aX[1] + Y[1]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do GPUs parallelise code?

A

Using SIMT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe SIMT

A

Single Instruction, Multiple Threads
Each processor runs many threads concurrently, each with its own registers and memory
All threads execute the same instructions but on different data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is 1 similarity between SIMD and SIMT?

A

Both take advantage of data-level parallelism

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the differences between SIMD and SIMT?

A

SIMD is a form of vector processing - vector registers each hold several items, vector operations work on vector registers

In SIMT, each instruction operates on threads - same instruction issued to each thread, each with separate state (registers and memory)

ie. one register storing lots of data items vs lots of registers storing 1 data item

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Draw a diagram showing SIMT execution of the DAXPY loop with 4 threads in lock-step

A

.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

In the case of 4 threads, how many times is the DAXPY kernel called?

A

Called 256 times with 4 threads each time

18
Q

Each thread executes the DAXPY ?

19
Q

What is a warp?

A

Group of threads that are executed simultaneously
All threads in a warp execute the same instruction

20
Q

What support is there in a GPU for conditional execution?

A

Use a predicate register to hold conditions, then only execute something if the condition holds - predication

21
Q

What is a downside of predication?

A

The thread sits idle if the predicate is false, leading to inefficiency

22
Q

Draw a diagram to show how conditional branches in GPUs are treated

23
Q

When might we need more complex branching?

A

Procedure call and return, nested if-then-else etc.

24
Q

How is more complex branching handled?

A

Handled by hardware with special instructions
Each thread has its own stack that predicates are pushed to and popped from

25
What is branch divergence?
When some but not all threads branch
26
When is complex branching inefficient?
When branch divergence occurs
27
What is the relationship between branch nesting and efficiency?
Efficiency decreases as branch nesting increases
28
Write a simple block of code and work out its efficiency
.
29
What are the disadvantages of data parallelism?
- High memory bandwidth requirements - Memory accesses can be high latency
30
When do we get stalls?
From long latency operations such as accessing memory (when a thread's data is not in cache)
31
How can synchronisation between threads cause delays?
May need to wait for all threads to execute the instructions before their memory barrier before continuing
32
What happens if there is a stall?
Have to wait, even if it is just 1 thread that has stalled, since all threads need to proceed at once
33
How can we overcome a stall / hide latency by doing useful work?
Organising execution better - execute other warps of threads whilst waiting
34
Draw a diagram showing one way of executing a DAXPY loop that will cause a stall
.
35
Draw a diagram showing a better way of organising DAXPY execution that will avoid a stall
.
36
What is the multithreading support in a GPU
Much more thread state than processing resource (ie. more registers than ALUs) so can swap threads in to hide latency
37
What are the trade-offs with multithreading?
+ Support many threads, enables a lot of latency hiding - Overhead in area from supporting lots of threads because need places to keep their thread state
38
What is a warp scheduler?
Responsible for choosing which warps to execute Selects based on when operands are ready, for latency hiding
39
How are warps scheduled? Draw a diagram to clarify
One instruction from each warp is executed Then the next instruction from each warp is executed - this can be in a different order This keeps all threads busy because we are executing instructions on all groups of threads (warps)
40
Multithreading hides ?
latency