Parallel Systems: Cache Coherence, Synchronization, Communication Flashcards

Question

What is gang scheduling?

Answer 1

OS schedules multiple threads of the same process concurrently

Answer 2

- Critical section should be small | - Avoid context switching overheads by allowing waiting threads to spin rather than block

Answer 3

- L stores last_requestor in queue - Lock will point to current process - Next process in line will spin on the lock

Answer 4

Need mutual exclusion in order to work

Answer 5

- Remove current process from the queue | - Need to check if a new process is being formed by checking who is next

Answer 6

- Fair - Spins on private variable - Only only processor signaled when lock is released - Only one RMW primitive per CS execution - Space complexitiy proportional to number of sharers

Answer 7

- Latency to get a lock could be higher due to linked list maintenance overhead - Needs support of RMW on architecture

Answer 8

Threads will do work until they reach a meeting point, where they will wait until everyone arrives

Answer 9

- As process reaches barrier, decrement counter - If you are not last, spin on counter - If you are last process, reset the counter and release all processes to continue

Answer 10

- Counter may not be reset fast enough before processes reach next barrier - Need an additional variable to combat this: count == 0 and count reset to N

Answer 11

Two shared variables: - Sense (boolean that flips when all processes reach barrier) - Count All except last process: decrement counter and spin on sense Last processor: reset count to N, flip sense variable

Answer 12

- Processors are leaves in the tree hierarchy - Two phases: arrival phase and wake up phase Arrival Phase: - At each branch point, when children have reached 0, decrement count of parent - Keep going until you reach the root Wake Up Phase: - Reset the senses and counters back down the tree

Answer 13

- Spin memory location cannot be statically determined - Ariness of the tree depends on how many processors will read from the same variable (contention) - Spinning on remote memory if NCC/MP - May cause contention on network if NCC/NUMA machine

Answer 14

- Each parent also has a child vector that marks how many children they have and who their children are - Child vector indicates the status of the child - One spin location - Parents notify children to wake up - Wake up phase activated when all have been marked as arrived in arrival tree and vice versa

Answer 15

- Don't need RMW instruction (only one processor reads/writes to the lock) - Don't need mutual exclusion - Takes O(n) space - Takes O(log n) network transactions - Works on NCC and CC NUMA

Answer 16

- Memory is accessible to everyone - Remote memory is accessible via network - Every CPU has associated memory

Answer 17

Cache only caches from local memory

Answer 18

- N players play log2N rounds - Works due to message passing - Rounds are rigged; winning processor has been pre-selected - Winning processor can be spinning locally on disk for shared memory multiprocessor - Spinning location kept static

Answer 19

- Static determination of spin location - No need for fetch-and-Φ - Takes O(n) space - Can exploit parallelism if ICN is available - Also works on cluster machine

Answer 20

Does not exploit spatial locality

Answer 21

- Gossip protocol over several rounds - Every process sends and receives a message to each other - O(N) communication events per round - Round ends when you have sent a message and received the expected message

Answer 22

- No hierarchy (can communicate in parallel) - No pairwise synchronization - Each node is independent - Total communication: O(N log2 N) - Total amount of communication in each round is constant and equal to N - Works for NCC, MP and SM machines

Answer 23

Safety vs performance

Answer 24

At run time

Answer 25

At run time in the kernel

Answer 26

- 2 context switches (client and server address space switch in kernel) - 2 traps - 4 copies being made from client to server and back

Answer 27

- Kernel has no clue about the semantics of the RPC call | - Client and server work to facilitate via client stub and server stub

Answer 28

- Client writes to client stub in RPC message - RPC message copied into kernel buffer - Kernel copies message into server stub in server domain - Server unpacks message and delivers to server Total of 4 possible copies

Answer 29

- Bind client to server through name server (one time cost) - Name server is above the kernel and is accessible by all processes - Kernel can import entry point for server code for ease of access into client

Answer 30

- On import to client | - Entry point is set up

Answer 31

- Using the A-stack in a common communication area | - Client stub does most of the work

Answer 32

- Implicit cost (losing locality) | - Context switching

Answer 33

- Client calls server code, which sends a call trap to the kernel - Binding object is forwarded through the PD to the server - Arguments are passed into the A-stack by the client - Server extracts arguments from A-stack and puts them in the execution stack - Result from server is passed back through the A-stack to the client Number of copies reduced to 2

Answer 34

Marshaling and unmarshaling

Answer 35

Yes. Using Modula 3, can pass a pointer as an argument for the A-stack

Answer 36

- Thread switching from client to server (can run client thread in server domain to save context switching) - Can use Liedtke's domain packing trick to avoid implicit cost

Answer 37

- Client/server traps - Switching domains - Loss of locality

Answer 38

Exploits multiple processors by preloading server domain and keeping caches warm

Answer 39

- Processors will prefer to run on processors that have run them before to take advantage of cache entries - Other processes can run on the processor, but will pollute the cache

Answer 40

- First come, first serve (emphasizes fairness, doesn't care about cache affinity) - Fixed processor: thread will always run on set processor - Last processor: thread will run on processor it last ran on - Minimum intervening: thread runs on processor with least amount of intervening threads - Minimum intervening plus queue: thread runs on processor with minimum wait and intervening threads

Answer 41

Minimum intervening, minimum intervening plus queue

Answer 42

Fixed processor

Answer 43

High variance in response time

Answer 44

- Global queue doesn't scale well - Affinity based local queues: occupancy is policy specific, and processors can steal work from other queues - Priority of thread plays a role in determining position in queue

Answer 45

Priority = BPi + Age_i + Affinity_i

Answer 46

- Throughput (System centric) - Response time (User centric) - Variance (User centric) - Cache reload time vs memory footprint

Answer 47

Will depend on architecture and workload

Answer 48

Inserting an idle loop in processor scheduling can help increase affinity (like mutual exclusion locks with delay)

Parallel Systems: Cache Coherence, Synchronization, Communication Flashcards

(72 cards)