Stuff I Missed Flashcards
MPI_Gather()?
Each node sends the contents of the send buffer to the root node.
Root node stores them in rank order.
MPI_Scatter()?
Root process splits buffer into even chunks and sends one chunk to each processor.
MPI_Alltoall()?
Each node performs scatter on its own data. Thus every node receives some data from every other node.
What is the main issue with cache?
When a CPU writes data to cache, the value in cache maybe inconsistent with main memory.
Write-through caches handle this by updating the data in the main memory at the time its written to cache.
Write-back caches mark data in cache as dirty. When the cache line is replaced by a new cache line from memory, the dirty line is written to memory.
Cache mappings?
Full associative: a new line can be placed at any location in cache.
Direct mapped: Each cache line has a unique location in the cache to which it will be assigned.
n-way set associative: each cache line can be placed in one of different n locations in the cache. We also need to decide which line to replace or evict.
Virtual memory?
If we run a very large program or a program that accesses very large data sets, all of the instructions and data may not fit into main memory.
Virtual memory functions as a cache for secondary memory.
Exploits temporal and spacial locality by only keeping active parts of running programs in main memory.
Instruction-level parallelism?
Pipelining: instructions are divided into a sequence of stages, and multiple instructions are overlapped in execution.
Multiple issue: multiple instructions can be simultaneously initiated.
Fine vs coarse grained multithreading?
Fine-grained:
Processor switches between threads after each instructions, skipping threads that are stalled.
Pros: avoids wasted time from stalls.
Cons: a thread ready to execute a long sequence may have to wait to execute every instruction.
Coarse-grained:
Only switches threads that are stalled waiting for a time-consuming process to complete.
Pros: switches don’t need to be instantaneous.
Cons: The processor can be idled on shorter stalls, and thread switching will also cause delays.
Drawbacks of SIMD?
All ALUs are required to execute the same instruction or remain idle.
They must also operate synchronously.
The ALUs have no instruction storage.
Efficient for large data parallel problems but not other types of complex problems.
Vector processors?
Type of computer processor specifically designed to efficiently perform operations on vectors or arrays of data elements.
Vector registers: capable of storing a vector of operands and operating simultaneously on their contents.
Pros:
Fast, easy to use, compilers are good at identifying code to exploit. They can also provide information about code that cannot be vectorised. High memory bandwidth, uses every line in the cache line.
Cons:
They don’t handle irregular data structures as well as well.
Limited scalability.
UMA vs NUMA shared memory systems?
Uniform memory access:
All processors have equal access to a shared memory space.
Provides a uniform view of memory to all processes.
Non-uniform memory access:
Processors are grouped into nodes and each node has a local memory. Accessing remote nodes involves additional latency.