7 - Communication & Synchronisation Flashcards
How do work items/threads communicate?
Through memory
What is the idea case for memory?
One type that is large, cheap and fast
What are the attributes of large, cheap and fast memory?
Large = slow/expensive
Cheap = small/slow
Fast = small/expensive
What are 4 types of GPU memory types?
Private memory, local memory, global memory, constant memory
What are the attributes of private memory?
Very fast, only accessible by a single work item, registers, 10/100 bytes
What are the attributes of local memory?
Fast, accessible by all work items within a single work group, user accessible cache, K/MB
What are the attributes of global memory?
Slow, accessible by threads from all work groups, DRAM, GB
What are the attributes of constant memory?
Fast, also accessible, by all threads, part of global memory but cached, not writable, relatively small, KB
What should you minimse time spent on?
Memory operations
How do you minimise time spent on memory operations?
Move frequently accessed data to a faster memory
What is the order of fast memory?
host»_space; global»_space; local»_space; private
What doesn’t benefit from moving frequently accessed data to a faster memory?
Single or sporadic accesses
When does data become global memory?
When it is transferred from host to device
What is local memory?
Making a local copy of the input to make accesses faster
Why do you need synchronisation?
Accesses to shared locations need to be correctly synchronised/coordinated to avoid race conditions
What are 3 types of synchronisation mechanisms?
Barriers/memory fences
Atomic operations
Separate kernel launches
What do barriers do?
Ensure that all work items within the same work group reach the same point
Which has lower overhead, global or local memory barriers?
Local
Where should you avoid putting barriers?
In conditional statements, should always apply to all work items from the group otherwise deadlock
What is impossible in modern GPU/CPU hardware?
Synchronise different work groups
How do you synchronise different workgroups?
By writing and launching separate kernels
What do Atomic functions do?
Provide a mechanism for atomic (without interruption) memory operations
What do Atomic functions guarantee?
Race free execution
How are Atomic updates performed?
Serially, so performance penalty
What is the order in Atomic functions?
The order is unspecific, so can only be used with associative and commutative operators
What are the limitations of Atomic functions?
Atomics are slower than normal accesses
Performance degrades with many simultaneous attempts to perform atomic operations on the same data
What is the usage for Atomic functions?
For infrequent, sparse and/or unpredictable global communication
Attempt to use shared memory and structure algorithms to avoid synchronisation whenever possible
What is does global memory reads by GPU involve?
Reading entire blocks of data
What is memory coalescing?
Sequential data access for better performance. When another value is requested and is from the same block then no additional memory access is required
What are the effects of the Stride in Strided memory access?
A stride affects the access the pattern, if the stride is larger than the block size then the benefits of blocking are gone