GPUs Flashcards by Tabby Black

What is a GPU?

Graphics Processing Unit
Specialised hardware for graphics computation, highly parallel multiprocessors

How well did you know this?

Not at all

Perfectly

List the 6 stages in the graphics pipeline

Input assembler
Vertex shader
Geometry shader
Setup & rasterise
Pixel shader
Output merger

How well did you know this?

Not at all

Perfectly

Which 3 stages in the graphics pipeline are programmable?

Vertex shader
Geometry shader
Pixel shader
(all the shaders!!)

How well did you know this?

Not at all

Perfectly

Draw the graphics pipeline, showing the unified processor array

How well did you know this?

Not at all

Perfectly

Tasks pass through the unified processor array ? ?

several times

How well did you know this?

Not at all

Perfectly

GPUs take advantage of ? level parallelism

data

How well did you know this?

Not at all

Perfectly

Which can support more threads: CPU or GPU?

GPU - supports multithreading whereas CPU is designed for single-threading

How well did you know this?

Not at all

Perfectly

What is the design goal of a CPU?

Designed for single-threaded performance

How well did you know this?

Not at all

Perfectly

What is the design goal of a GPU?

Designed for throughput
- individual thread performance is not critical
- each core is relatively simple

How well did you know this?

Not at all

Perfectly

Write out the code for a DAXPY (Double A X Plus Y) loop

for (i=0; i<1024; ++i) {
Z[i] = a*X[i] + Y[i]
}
Multiple vector of doubles X by constant a then add to vector of doubles Y

How well did you know this?

Not at all

Perfectly

Why can we parallelise a DAXPY loop?

All iterations are independent because there are no data dependencies between them
eg. Z[0] = aX[0] + Y[0] is independent of Z[1] = aX[1] + Y[1]

How well did you know this?

Not at all

Perfectly

How do GPUs parallelise code?

Using SIMT

How well did you know this?

Not at all

Perfectly

Describe SIMT

Single Instruction, Multiple Threads
Each processor runs many threads concurrently, each with its own registers and memory
All threads execute the same instructions but on different data

How well did you know this?

Not at all

Perfectly

What is 1 similarity between SIMD and SIMT?

Both take advantage of data-level parallelism

How well did you know this?

Not at all

Perfectly

What are the differences between SIMD and SIMT?

SIMD is a form of vector processing - vector registers each hold several items, vector operations work on vector registers

In SIMT, each instruction operates on threads - same instruction issued to each thread, each with separate state (registers and memory)

ie. one register storing lots of data items vs lots of registers storing 1 data item

How well did you know this?

Not at all

Perfectly

Draw a diagram showing SIMT execution of the DAXPY loop with 4 threads in lock-step

How well did you know this?

Not at all

Perfectly

In the case of 4 threads, how many times is the DAXPY kernel called?

Study These Flashcards

Each thread executes the DAXPY ?

Study These Flashcards

kernel

What is a warp?

Study These Flashcards

Group of threads that are executed simultaneously
All threads in a warp execute the same instruction

What support is there in a GPU for conditional execution?

Study These Flashcards

Use a predicate register to hold conditions, then only execute something if the condition holds - predication

What is a downside of predication?

Study These Flashcards

The thread sits idle if the predicate is false, leading to inefficiency

Draw a diagram to show how conditional branches in GPUs are treated

Study These Flashcards

When might we need more complex branching?

Study These Flashcards

Procedure call and return, nested if-then-else etc.

How is more complex branching handled?

Study These Flashcards

Handled by hardware with special instructions
Each thread has its own stack that predicates are pushed to and popped from

What is branch divergence?

When some but not all threads branch

When is complex branching inefficient?

When branch divergence occurs

What is the relationship between branch nesting and efficiency?

Efficiency decreases as branch nesting increases

Write a simple block of code and work out its efficiency

What are the disadvantages of data parallelism?

- High memory bandwidth requirements - Memory accesses can be high latency

When do we get stalls?

From long latency operations such as accessing memory (when a thread's data is not in cache)

How can synchronisation between threads cause delays?

May need to wait for all threads to execute the instructions before their memory barrier before continuing

What happens if there is a stall?

Have to wait, even if it is just 1 thread that has stalled, since all threads need to proceed at once

How can we overcome a stall / hide latency by doing useful work?

Organising execution better - execute other warps of threads whilst waiting

Draw a diagram showing one way of executing a DAXPY loop that will cause a stall

Draw a diagram showing a better way of organising DAXPY execution that will avoid a stall

What is the multithreading support in a GPU

Much more thread state than processing resource (ie. more registers than ALUs) so can swap threads in to hide latency

What are the trade-offs with multithreading?

+ Support many threads, enables a lot of latency hiding - Overhead in area from supporting lots of threads because need places to keep their thread state

What is a warp scheduler?

Responsible for choosing which warps to execute Selects based on when operands are ready, for latency hiding

How are warps scheduled? Draw a diagram to clarify

One instruction from each warp is executed Then the next instruction from each warp is executed - this can be in a different order This keeps all threads busy because we are executing instructions on all groups of threads (warps)

Multithreading hides ?

latency

GPUs Flashcards

(40 cards)