P3L4 Paper: Performance of Spin Locks Flashcards
Why is spinlock performance under contention important
- May slow down highly performant applications with heavily used lock (long queue lines of processors wanting to access lock)
- A spinlock that guards a bottleneck critical section => can affect all processes on the system!!!
Describe attributes of an efficient spinlock implementation
Good spinlock algorithm needs to both be efficient for high contention scenarios (minimal cost of spin-waiting) as well as for low-contention (quick to acquire lock)
Name Important parameters for efficient spin lock implementation under high lock contention
- Minimize communication bandwidth on the system bus (do not try to atomaically read the lock value in a polling fashion the whole time) => prevents other processors from doing actual work
- Minimize time between releasing lock and acquired by other processor (no one is executing critical section during that time)
Tradeoff: the more we poll, the higher the bandwith overhead, but the quicker we can acquire the free lock again
Describe the novel approach to queue spin-lock implementation. When is it good, when is it bad?
Maintain a queue of spinning processors. Whenever the lock is freed, the next processor in line can aquire the lock.
How it works: No explicit queue is maintained with dequeue and enqueue.
Instead, each arriving processor does an atomic read-and-increment
to obtain a unqiueue & ordered serial number. When lock is released, the releasing processor taps the processor with the serial number +1.
Describe advantages of the novel approach to queue spin-lock
Advantage
- time till new processor acquires the lock and executes critical section is reduced compared to static delays (both methods) or backoffs
- reduced cache invalidations
- Overhead decreases with more processors
Describe disadvantages of the novel approach to queue spin-lock
- increased lock latency (time till acquire lock)
When should queueing be used and when another method (static delay, backoff after lock release or after each memory reference)?
- With contention:: queuing the best
- Without contention: queuing worst due to high lock latency.
Name CPU cache update methods
- *No-write**
- Do not allow write to cache - only to Memory
- Each write invalidates the entry in the cache
- *Write-through**
- A processor ALWAYS writes to BOTH cache AND memory
- Avoid cache incohereance
- *Write-back**
- Write goes to cache first, but write to the memory location happens later
Name cache coherence mechanisms. What is their advantage?
- Write invalidate
- Write-update
- *Tradeoff**:
-
Write invalidate: Lower bandwith on shared interconnect!
- Reason: Only send the memory address (not the whole value! ) over the bus or i/c to update the other caches
- If address is already invalidated in other caches, subsequent updates do not cause any bus or i/c traffic (already invalidated)
- This can amortize the cost of having to make a memory request via bus or i/c in case of a cache miss due to invalidation
-
Write-update: Update available instantenously! No cache miss - no need for extra memory reference.
- has higher traffic on i/c or bus as the updated value has to be updated in every CPU cache
Can atomic instruction be executed against the cache of one CPU only?
No. An atomic operation always bypasses the CPU cache and goes directly to the memory location. This is to avoid cache incoherence.
What are the downsides/ costs of atomic instructions?
- Slow: always go directly to memory controller instead of cache in order to be ordered & synchronized & avoid cache coherence
- Creates traffic on bus / ic due to coherence traffic (update or invalidation of CPU caches that contain that entry)
How do you evaluate atomic operations on an SMP (shared-memory multiprocessor system) vs. atomic instructions on a single-CPU system?
Atomic operations more expensive than on single CPU systems due to
- Always content for memory (bypass cache) using the bus or i/c
- Cause even more bus or I/C contention due to coherence traffic (regardless of whether the mempry location was updated or not)
- (Me: Single CPU systems could theoretically operate on the cache and the underlying write-back or write-through mechanisms could update the RAM)
What kind of hardware support do spinlocks require?
Atomic instructions. Otherwise threads could interleave in checking if the lock is currently busy.
Name Spinlock Performance Metrics
Lock Latency, Delay / Lock handover waiting time, Contention
Name advantages and disadvantages of the test-and-set spinlock implementation
Advantage:
- portable
- minimal latency (just atomic operation) - metric does not consider contention
- Potentially Minimal delay (spin continuously on atomic operation) - quickly acquire freed lock
Disadvantage:
- Contention: spin on atomic: processors go to memory on each spin.