Parallel Systems: Cache Coherence, Synchronization, Communication Flashcards
What is the Shared Memory Machine Model?
Also called Dance Hall Architecture
Each CPU has a cache and its own memory
What is a Symmetric Memory Processor Model?
Each CPU has its own cache, but memory is shared by all CPUs
What is the Distributed Shared Memory Model?
Each CPU has its own cache and local memory, but can access other CPU’s memory via the network
How long does it take the SMP model to access main memory and cache?
- 100+ cycles to get to main memory
- ~2 cycles to access cache
What happens on update to a shared cache entry?
- Cache becomes snoop-y (checks for updates)
- Can invalidate other pages on change OR update all
What is the memory consistency model?
- Guarantee order of execution
- Program order + arbitrary interleaving
- Sequential consistency
What is memory consistency?
What is the model presented to the programmer
What is cache coherence?
How is the system implementing the model in the presence of private caches?
Shared address space + cache coherence
What is NCC?
Non cache coherence
Shared address space
What are the operations associated with hardware cache coherence?
- Write invalidate: invalidates cache entry if present
- Write update: send updated value to update cache entries
What is the issue with scalability?
- More processors doesn’t necessarily increase performance
- Increased overhead
Can exploit parallelism though
What are synchronization primitives for shared memory programming?
- Locks (exclusive and shared)
- Barriers
What are atomic operations?
- Guarantees no interrupts in instruction
What are examples of atomic operations?
- Test-and-set
- Patch-and-increment
- Fetch_and_Φ (Φ can be anything)
Basically Read Modify Write
Are atomic operations sufficient to ensure synchronization?
- No. Need atomicity system-wide
- Don’t look at the cache, just go straight to memory
What are the issues with scalability with synchronization?
- Latency
- Waiting Time (application problem)
- Contention
What are the goals of spin locks?
- Exploit the cache
- Disruptive of actual workers
- Avoid contention
What are spin locks?
- Spin lock located in cache
- Waiting processes spin on cached copy of lock
What are the issues with spin locks?
- Contention on lock release
- Takes N^2 operations to invalidate and update
What are spin locks with delay?
- Delay after lock release
- Delays should be staggered
- Delay with exponential back off
- Don’t need cache coherence for this to work
What is ticket lock?
- Similar to a deli situation
- Lock has a ticket number that matches the process it is serving
- Other processes have to wait until their number is called
What are the issues with ticket lock?
- Contention can happen if reading from memory
- Invalidate based protocol will cause contention
What are array based queueing locks?
- Circular queue
- Each slot has two values: has_lock, must_wait
- Array size = N number of processors
- queue_last pointer = pointer to last position in queue
- Processes point to where they are placed in the queue
- Processes spin on their spot to get the lock
- On unlock, set self to must_wait and set neighbor to has_lock
What is false sharing?
- Cache block may have multiple variables
- Variable appears shared due to cache layout/how data is stored in the cache
What is gang scheduling?
OS schedules multiple threads of the same process concurrently
How do we handle spin locks over blocking threads?
- Critical section should be small
- Avoid context switching overheads by allowing waiting threads to spin rather than block
What is a linked list based queue-ing locks?
- L stores last_requestor in queue
- Lock will point to current process
- Next process in line will spin on the lock
What is the issue with linked list based queueing locks?
Need mutual exclusion in order to work
What does a linked list based queue-ing lock do on unlock?
- Remove current process from the queue
- Need to check if a new process is being formed by checking who is next
What are the benefits of a linked list based queue-ing lock?
- Fair
- Spins on private variable
- Only only processor signaled when lock is released
- Only one RMW primitive per CS execution
- Space complexitiy proportional to number of sharers
What are the drawbacks of a linked list based queue-ing lock?
- Latency to get a lock could be higher due to linked list maintenance overhead
- Needs support of RMW on architecture
What are barrier algorithms?
Threads will do work until they reach a meeting point, where they will wait until everyone arrives
What is a counting barrier?
- As process reaches barrier, decrement counter
- If you are not last, spin on counter
- If you are last process, reset the counter and release all processes to continue
What are issues with counting barriers?
- Counter may not be reset fast enough before processes reach next barrier
- Need an additional variable to combat this: count == 0 and count reset to N
What is a sense reversing barrier?
Two shared variables:
- Sense (boolean that flips when all processes reach barrier)
- Count
All except last process: decrement counter and spin on sense
Last processor: reset count to N, flip sense variable
What is a tree variable barrier?
- Processors are leaves in the tree hierarchy
- Two phases: arrival phase and wake up phase
Arrival Phase:
- At each branch point, when children have reached 0, decrement count of parent
- Keep going until you reach the root
Wake Up Phase:
- Reset the senses and counters back down the tree
What are the problems with tree variable barriers?
- Spin memory location cannot be statically determined
- Ariness of the tree depends on how many processors will read from the same variable (contention)
- Spinning on remote memory if NCC/MP
- May cause contention on network if NCC/NUMA machine
How does the MCS Tree Barrier/4-ary tree work?
- Each parent also has a child vector that marks how many children they have and who their children are
- Child vector indicates the status of the child
- One spin location
- Parents notify children to wake up
- Wake up phase activated when all have been marked as arrived in arrival tree and vice versa
What are the benefits of the MCS Tree Barrier?
- Don’t need RMW instruction (only one processor reads/writes to the lock)
- Don’t need mutual exclusion
- Takes O(n) space
- Takes O(log n) network transactions
- Works on NCC and CC NUMA
How does MCS Tree Barrier work on a NUMA machine?
- Memory is accessible to everyone
- Remote memory is accessible via network
- Every CPU has associated memory
How does MCS Tree Barrier work on a NCC machine?
Cache only caches from local memory
How does Tournament Barrier work?
- N players play log2N rounds
- Works due to message passing
- Rounds are rigged; winning processor has been pre-selected
- Winning processor can be spinning locally on disk for shared memory multiprocessor
- Spinning location kept static
What are the virtues of Tournament Barrier?
- Static determination of spin location
- No need for fetch-and-Φ
- Takes O(n) space
- Can exploit parallelism if ICN is available
- Also works on cluster machine
What are the drawbacks of Tournament Barrier?
Does not exploit spatial locality
How does Dissemination Barrier work?
- Gossip protocol over several rounds
- Every process sends and receives a message to each other
- O(N) communication events per round
- Round ends when you have sent a message and received the expected message
What are the virtues of Dissemination Barrier?
- No hierarchy (can communicate in parallel)
- No pairwise synchronization
- Each node is independent
- Total communication: O(N log2 N)
- Total amount of communication in each round is constant and equal to N
- Works for NCC, MP and SM machines
What is the trade-off when it comes to cross domain communication?
Safety vs performance
When are call procedures determined?
At run time
When do RPCs do work?
At run time in the kernel
What work is involved in RPCs?
- 2 context switches (client and server address space switch in kernel)
- 2 traps
- 4 copies being made from client to server and back
How do RPC calls work in parallel systems?
- Kernel has no clue about the semantics of the RPC call
- Client and server work to facilitate via client stub and server stub
What is the flow for client-to-server process communication?
- Client writes to client stub in RPC message
- RPC message copied into kernel buffer
- Kernel copies message into server stub in server domain
- Server unpacks message and delivers to server
Total of 4 possible copies
How can the client-to-server RPC be made more efficient? (How can we reduce from 4 total copies?)
- Bind client to server through name server (one time cost)
- Name server is above the kernel and is accessible by all processes
- Kernel can import entry point for server code for ease of access into client
When is the kernel needed in the more efficient implementation of client-to-server RPC?
- On import to client
- Entry point is set up
How does the client-to-server RPC call bind the client to the server?
- Using the A-stack in a common communication area
- Client stub does most of the work
What is the cost of changing protection domains?
- Implicit cost (losing locality)
- Context switching
How does the binding object work in an RPC call?
- Client calls server code, which sends a call trap to the kernel
- Binding object is forwarded through the PD to the server
- Arguments are passed into the A-stack by the client
- Server extracts arguments from A-stack and puts them in the execution stack
- Result from server is passed back through the A-stack to the client
Number of copies reduced to 2
What are the processes of copying data to and from the A-stack called?
Marshaling and unmarshaling
Is it possible to reduce the number of copies in the RPC call to one?
Yes. Using Modula 3, can pass a pointer as an argument for the A-stack
What are areas that can be optimized further in an RPC call?
- Thread switching from client to server (can run client thread in server domain to save context switching)
- Can use Liedtke’s domain packing trick to avoid implicit cost
What is unavoidable when it comes to RPC calls?
- Client/server traps
- Switching domains
- Loss of locality
How does RPC work on SMP machines?
Exploits multiple processors by preloading server domain and keeping caches warm
What is cache affinity scheduling?
- Processors will prefer to run on processors that have run them before to take advantage of cache entries
- Other processes can run on the processor, but will pollute the cache
What are different scheduling policies?
- First come, first serve (emphasizes fairness, doesn’t care about cache affinity)
- Fixed processor: thread will always run on set processor
- Last processor: thread will run on processor it last ran on
- Minimum intervening: thread runs on processor with least amount of intervening threads
- Minimum intervening plus queue: thread runs on processor with minimum wait and intervening threads
What scheduling policies are good for light loads?
Minimum intervening, minimum intervening plus queue
What scheduling policy is good for heavy loads?
Fixed processor
What is the drawback of first come, first serve scheduling policy?
High variance in response time
What are implementation issues when it comes to scheduling?
- Global queue doesn’t scale well
- Affinity based local queues: occupancy is policy specific, and processors can steal work from other queues
- Priority of thread plays a role in determining position in queue
How do we calculate thread priority?
Priority = BPi + Age_i + Affinity_i
What are performance metrics for measuring scheduling?
- Throughput (System centric)
- Response time (User centric)
- Variance (User centric)
- Cache reload time vs memory footprint
How do you choose what scheduler to use?
Will depend on architecture and workload
How can schedulers be optimized?
Inserting an idle loop in processor scheduling can help increase affinity (like mutual exclusion locks with delay)