Multiprocessors Flashcards
What are multiprocessors
Multiprocessors refers to tightly coupled processors whose coordination and usage is controlled by a single operating system and that usually share memory through a shared address space
Multicores when all cores are in the same chip (named manycores when more than 32 cores)
Multiple Instruction Multiple Data (MIMD)
How to connect processors?
Single bus vs. interconnection network:
Single-bus approach imposes constraints on the number of processors connected to it (up to now, 36 is the largest number of processors connected in a commercial single bus system) => saturation.
To connect many processors with a
high bandwidth, the system needs to
use more than a single bus
=> introduction of an
interconnection network
What are the cost and performance tradeoffs between the ways to connect processors
The network-connected machine has a smaller initial cost, then the costs scale up more quickly than the bus-connected machine.
Performance for both machines scales linearly until the bus reaches its limit, then performance is flat no matter how many processors are used.
When these two effects are combined => the network-connected machine has consistent performance per unit cost, while the bus-connected machine has a ‘sweet spot’ plateau (8 to 16 processors).
Network-connected MPs have better cost/performance on the left of the plateau (because they are less expensive), and on the right of the plateau (because they have higher performance).
See picture 30
What are the different network topologies?
Single bus
Ring
Mesh
N-cube
Crossbar Network
What are the single-bus and ring network topologies
Single-bus: not capable of simultaneous transactions
Ring: capable of many simultaneous transfers (like a segmented bus). Some nodes are not directly connected => the communication between some nodes needs to pass through intermediate nodes to reach the final destination (multiple-hops).
See picture 31
How do we analyze the network performance and what are them for the single-bus and ring topologies
P = number of nodes
M = number of links
b = bandwidth of a single link
Total Network Bandwidth (best case): M x b number of links multiplied by the bandwidth of each link
For the single-bus topology, the total network bandwidth is just the bandwidth of the bus: (1 x b)
For the ring topology P = M and the total network bandwidth is P times the bandwidth of one link: (P x b)
Bisection Bandwidth (worst case): This is calculated by dividing the machine into two parts, each with half the nodes. Then you sum up the bandwidth of the links that cross that imaginary dividing line.
For the ring topology is two times the link bandwith: (2 x b),
For the single bus topology is just the bus bandwidth: (1 x b).
See picture 32
What is the crossbar network and its performances metrics?
Crossbar Network or fully connected network: every processor has a bidirectional dedicated communication link to every other processor
very high cost
Total Bandwidth: {(P x (P –1) )/ 2 } x b
Bisection Bandwidth: (P/2)2 x b
What is the bidimensional mesh and its performances metrics?
See picture 33
What is the hypercube and its performances metrics?
See picture 34
What are the possible memory space models
1) Single logically shared address space: A memory reference can be made by any processor to any memory location through loads/stores Shared Memory Architectures.
The address space is shared among processors: The same physical address on 2 processors refers to the same location in memory.
2) Multiple and private address spaces: The processors communicate among them through send/receive primitives => Message Passing Architectures.
The address space is logically disjoint and cannot be addressed by different processors: the same physical address on 2 processors refers to 2 different locations in 2 different memories.
How is the communication managed in shared addresses?
Implicit management of the communication through load/store operations to access any memory locations.
Shared memory model imposes the cache coherence problem among processors.
How is the communication managed in multiple private addresses?
The processors communicate among them through sending/receiving messages: message passing protocol
The memory of one processor cannot be accessed by another processor without the assistance of software protocols.
No cache coherence problem among processors
What are the possibilities for physical memory organization?
Centralized Memory:
UMA (Uniform Memory Access): The access time to a memory location is uniform for all the processors: no matter which processor requests it and no matter which word is asked.
Distributed Memory:
The physical memory is divided into memory modules distributed on each single processor.
NUMA (Non Uniform Memory Access): The access time to a memory location is non uniform for all the processors: it depends on the location of the data word in memory and the processor location.
What is the relation between address space and the physical memory organization
See picture 35
What is the problem of cache coherence?
When shared data are cached, the shared values may be replicated in multiple caches.
The use of multiple copies of same data introduces a new problem: cache coherence.
Multiple copies are not a problem when reading, but a processor must have exclusive access to write a word.
What is the solution of the cache coherence problem
HW-based solutions to maintain coherence: Cache-Coherence Protocols
Key issues to implement a cache coherent protocol in multiprocessors is tracking the state of any sharing of a data block.
Two classes of protocols:
Ø Snooping Protocols
Ø Directory-Based Protocols
What is the snooping protocol?
All cache controllers monitor (snoop) on the bus to determine whether or not they have a copy of the block requested on the bus and respond accordingly.
Every cache that has a copy of the shared block, also has a copy of the sharing state of the block, and no centralized state is kept.
Send all requests for shared data to all processors.
Why are the types of snooping protocols
Ø Write-Invalidate Protocol
Ø Write-Update (or Write-Broadcast) Protocol
What is the write-invalidate protocol
The writing processor issues an invalidation signal over the bus to cause all copies in other caches to be invalidated before changing its local copy.
This scheme uses the bus only on the first write to invalidate the other copies.
What is the write-update protocol
The writing processor broadcasts the new data over the bus; all caches check if they have a copy of the data and, if so, all copies are updated with the new value.
What is the combination of horizontal and vertical cache memory coherence generally used?
Write invalidate + write back
Write update + write through
What is MSI and what are the stages it has
It is a Write-Invalidate Snooping Protocol, Write-Back Cache
Each cache block can be in one of three states:
§ Modified (or Dirty) : cache has only copy, its writeable, and dirty (block cannot be shared anymore)
§ Shared (or Clean) (read only): the block is clean (not modified) and can be read
§ Invalid : block contains no valid data
Each block of memory is in one of three states:
§ Shared in all caches and up-to-date in memory (Clean)
§ Modified in exactly one cache (Dirty)
§ Uncached when not in any caches
What are the possible consequences of a read miss on a cache block?
IF (other caches have the block TAG SHARED) => blocks stay SHARED;
IF (another cache has TAG MODIF.) =>
WRITE-BACK block in memory;
block SHARED in previous cache;
LOAD from mem in cache that requested as SHARED
What are the possible consequences of a write miss in a cache block?
IF (other caches have the requested block TAG SHARED) => blocks INVALIDATE in others;
IF (another cache has the requested block TAG MODIF.) =>
WRITE-BACK block TAG in mem;
block INVALIDATE in previous cache;
LOAD block TAG from mem in cache that requested the block as MODIF;
WRITE in the requested cache;