Parallel Systems: Cache Coherence, Synchronization, Communication Flashcards

1
Q

What is the Shared Memory Machine Model?

A

Also called Dance Hall Architecture

Each CPU has a cache and its own memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a Symmetric Memory Processor Model?

A

Each CPU has its own cache, but memory is shared by all CPUs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the Distributed Shared Memory Model?

A

Each CPU has its own cache and local memory, but can access other CPU’s memory via the network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How long does it take the SMP model to access main memory and cache?

A
  • 100+ cycles to get to main memory

- ~2 cycles to access cache

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What happens on update to a shared cache entry?

A
  • Cache becomes snoop-y (checks for updates)

- Can invalidate other pages on change OR update all

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the memory consistency model?

A
  • Guarantee order of execution
  • Program order + arbitrary interleaving
  • Sequential consistency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is memory consistency?

A

What is the model presented to the programmer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is cache coherence?

A

How is the system implementing the model in the presence of private caches?

Shared address space + cache coherence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is NCC?

A

Non cache coherence

Shared address space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the operations associated with hardware cache coherence?

A
  • Write invalidate: invalidates cache entry if present

- Write update: send updated value to update cache entries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the issue with scalability?

A
  • More processors doesn’t necessarily increase performance
  • Increased overhead

Can exploit parallelism though

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are synchronization primitives for shared memory programming?

A
  • Locks (exclusive and shared)

- Barriers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are atomic operations?

A
  • Guarantees no interrupts in instruction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are examples of atomic operations?

A
  • Test-and-set
  • Patch-and-increment
  • Fetch_and_Φ (Φ can be anything)

Basically Read Modify Write

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Are atomic operations sufficient to ensure synchronization?

A
  • No. Need atomicity system-wide

- Don’t look at the cache, just go straight to memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the issues with scalability with synchronization?

A
  • Latency
  • Waiting Time (application problem)
  • Contention
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the goals of spin locks?

A
  • Exploit the cache
  • Disruptive of actual workers
  • Avoid contention
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are spin locks?

A
  • Spin lock located in cache

- Waiting processes spin on cached copy of lock

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the issues with spin locks?

A
  • Contention on lock release

- Takes N^2 operations to invalidate and update

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are spin locks with delay?

A
  • Delay after lock release
  • Delays should be staggered
  • Delay with exponential back off
  • Don’t need cache coherence for this to work
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is ticket lock?

A
  • Similar to a deli situation
  • Lock has a ticket number that matches the process it is serving
  • Other processes have to wait until their number is called
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the issues with ticket lock?

A
  • Contention can happen if reading from memory

- Invalidate based protocol will cause contention

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are array based queueing locks?

A
  • Circular queue
  • Each slot has two values: has_lock, must_wait
  • Array size = N number of processors
  • queue_last pointer = pointer to last position in queue
  • Processes point to where they are placed in the queue
  • Processes spin on their spot to get the lock
  • On unlock, set self to must_wait and set neighbor to has_lock
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is false sharing?

A
  • Cache block may have multiple variables

- Variable appears shared due to cache layout/how data is stored in the cache

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is gang scheduling?

A

OS schedules multiple threads of the same process concurrently

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How do we handle spin locks over blocking threads?

A
  • Critical section should be small

- Avoid context switching overheads by allowing waiting threads to spin rather than block

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is a linked list based queue-ing locks?

A
  • L stores last_requestor in queue
  • Lock will point to current process
  • Next process in line will spin on the lock
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the issue with linked list based queueing locks?

A

Need mutual exclusion in order to work

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What does a linked list based queue-ing lock do on unlock?

A
  • Remove current process from the queue

- Need to check if a new process is being formed by checking who is next

30
Q

What are the benefits of a linked list based queue-ing lock?

A
  • Fair
  • Spins on private variable
  • Only only processor signaled when lock is released
  • Only one RMW primitive per CS execution
  • Space complexitiy proportional to number of sharers
31
Q

What are the drawbacks of a linked list based queue-ing lock?

A
  • Latency to get a lock could be higher due to linked list maintenance overhead
  • Needs support of RMW on architecture
32
Q

What are barrier algorithms?

A

Threads will do work until they reach a meeting point, where they will wait until everyone arrives

33
Q

What is a counting barrier?

A
  • As process reaches barrier, decrement counter
  • If you are not last, spin on counter
  • If you are last process, reset the counter and release all processes to continue
34
Q

What are issues with counting barriers?

A
  • Counter may not be reset fast enough before processes reach next barrier
  • Need an additional variable to combat this: count == 0 and count reset to N
35
Q

What is a sense reversing barrier?

A

Two shared variables:

  • Sense (boolean that flips when all processes reach barrier)
  • Count

All except last process: decrement counter and spin on sense

Last processor: reset count to N, flip sense variable

36
Q

What is a tree variable barrier?

A
  • Processors are leaves in the tree hierarchy
  • Two phases: arrival phase and wake up phase

Arrival Phase:

  • At each branch point, when children have reached 0, decrement count of parent
  • Keep going until you reach the root

Wake Up Phase:
- Reset the senses and counters back down the tree

37
Q

What are the problems with tree variable barriers?

A
  • Spin memory location cannot be statically determined
  • Ariness of the tree depends on how many processors will read from the same variable (contention)
  • Spinning on remote memory if NCC/MP
  • May cause contention on network if NCC/NUMA machine
38
Q

How does the MCS Tree Barrier/4-ary tree work?

A
  • Each parent also has a child vector that marks how many children they have and who their children are
  • Child vector indicates the status of the child
  • One spin location
  • Parents notify children to wake up
  • Wake up phase activated when all have been marked as arrived in arrival tree and vice versa
39
Q

What are the benefits of the MCS Tree Barrier?

A
  • Don’t need RMW instruction (only one processor reads/writes to the lock)
  • Don’t need mutual exclusion
  • Takes O(n) space
  • Takes O(log n) network transactions
  • Works on NCC and CC NUMA
40
Q

How does MCS Tree Barrier work on a NUMA machine?

A
  • Memory is accessible to everyone
  • Remote memory is accessible via network
  • Every CPU has associated memory
41
Q

How does MCS Tree Barrier work on a NCC machine?

A

Cache only caches from local memory

42
Q

How does Tournament Barrier work?

A
  • N players play log2N rounds
  • Works due to message passing
  • Rounds are rigged; winning processor has been pre-selected
  • Winning processor can be spinning locally on disk for shared memory multiprocessor
  • Spinning location kept static
43
Q

What are the virtues of Tournament Barrier?

A
  • Static determination of spin location
  • No need for fetch-and-Φ
  • Takes O(n) space
  • Can exploit parallelism if ICN is available
  • Also works on cluster machine
44
Q

What are the drawbacks of Tournament Barrier?

A

Does not exploit spatial locality

45
Q

How does Dissemination Barrier work?

A
  • Gossip protocol over several rounds
  • Every process sends and receives a message to each other
  • O(N) communication events per round
  • Round ends when you have sent a message and received the expected message
46
Q

What are the virtues of Dissemination Barrier?

A
  • No hierarchy (can communicate in parallel)
  • No pairwise synchronization
  • Each node is independent
  • Total communication: O(N log2 N)
  • Total amount of communication in each round is constant and equal to N
  • Works for NCC, MP and SM machines
47
Q

What is the trade-off when it comes to cross domain communication?

A

Safety vs performance

48
Q

When are call procedures determined?

A

At run time

49
Q

When do RPCs do work?

A

At run time in the kernel

50
Q

What work is involved in RPCs?

A
  • 2 context switches (client and server address space switch in kernel)
  • 2 traps
  • 4 copies being made from client to server and back
51
Q

How do RPC calls work in parallel systems?

A
  • Kernel has no clue about the semantics of the RPC call

- Client and server work to facilitate via client stub and server stub

52
Q

What is the flow for client-to-server process communication?

A
  • Client writes to client stub in RPC message
  • RPC message copied into kernel buffer
  • Kernel copies message into server stub in server domain
  • Server unpacks message and delivers to server

Total of 4 possible copies

53
Q

How can the client-to-server RPC be made more efficient? (How can we reduce from 4 total copies?)

A
  • Bind client to server through name server (one time cost)
  • Name server is above the kernel and is accessible by all processes
  • Kernel can import entry point for server code for ease of access into client
54
Q

When is the kernel needed in the more efficient implementation of client-to-server RPC?

A
  • On import to client

- Entry point is set up

55
Q

How does the client-to-server RPC call bind the client to the server?

A
  • Using the A-stack in a common communication area

- Client stub does most of the work

56
Q

What is the cost of changing protection domains?

A
  • Implicit cost (losing locality)

- Context switching

57
Q

How does the binding object work in an RPC call?

A
  • Client calls server code, which sends a call trap to the kernel
  • Binding object is forwarded through the PD to the server
  • Arguments are passed into the A-stack by the client
  • Server extracts arguments from A-stack and puts them in the execution stack
  • Result from server is passed back through the A-stack to the client

Number of copies reduced to 2

58
Q

What are the processes of copying data to and from the A-stack called?

A

Marshaling and unmarshaling

59
Q

Is it possible to reduce the number of copies in the RPC call to one?

A

Yes. Using Modula 3, can pass a pointer as an argument for the A-stack

60
Q

What are areas that can be optimized further in an RPC call?

A
  • Thread switching from client to server (can run client thread in server domain to save context switching)
  • Can use Liedtke’s domain packing trick to avoid implicit cost
61
Q

What is unavoidable when it comes to RPC calls?

A
  • Client/server traps
  • Switching domains
  • Loss of locality
62
Q

How does RPC work on SMP machines?

A

Exploits multiple processors by preloading server domain and keeping caches warm

63
Q

What is cache affinity scheduling?

A
  • Processors will prefer to run on processors that have run them before to take advantage of cache entries
  • Other processes can run on the processor, but will pollute the cache
64
Q

What are different scheduling policies?

A
  • First come, first serve (emphasizes fairness, doesn’t care about cache affinity)
  • Fixed processor: thread will always run on set processor
  • Last processor: thread will run on processor it last ran on
  • Minimum intervening: thread runs on processor with least amount of intervening threads
  • Minimum intervening plus queue: thread runs on processor with minimum wait and intervening threads
65
Q

What scheduling policies are good for light loads?

A

Minimum intervening, minimum intervening plus queue

66
Q

What scheduling policy is good for heavy loads?

A

Fixed processor

67
Q

What is the drawback of first come, first serve scheduling policy?

A

High variance in response time

68
Q

What are implementation issues when it comes to scheduling?

A
  • Global queue doesn’t scale well
  • Affinity based local queues: occupancy is policy specific, and processors can steal work from other queues
  • Priority of thread plays a role in determining position in queue
69
Q

How do we calculate thread priority?

A

Priority = BPi + Age_i + Affinity_i

70
Q

What are performance metrics for measuring scheduling?

A
  • Throughput (System centric)
  • Response time (User centric)
  • Variance (User centric)
  • Cache reload time vs memory footprint
71
Q

How do you choose what scheduler to use?

A

Will depend on architecture and workload

72
Q

How can schedulers be optimized?

A

Inserting an idle loop in processor scheduling can help increase affinity (like mutual exclusion locks with delay)