Chapter 3 - MPI Flashcards
What are the two types of systems in the world of MIMD?
Distributed memory: The memory associated with a core is only accessible to that core.
Shared memory: Collection of cores connected to a global shared memory.
What is message passing?
Way of communication for distributed memory systems.
One process running on one cores communicates through a send() call, another process on another core calls receive() to get the message.
What is MPI
Message Passing Interface
Library implementation of message passing communication
What are collective communication?
Functions that allow for communication between multiple (more than 2) processes.
What is a rank?
A non-negative identifier of a process running in MPI.
n number of processes -> 0, 1, …, (n-1) ranks
What library must be included to use MPI
include <mpi.h></mpi.h>
How is MPI initialized?
MPI_Init(int* argc_p, char*** argv_p);
argc_p and argv_p: pointers to arguments to main, argc and argv
If program does not use these, NULL are passed to both
In main() using arguments:
MPI_Init(&argc, &argv);
How do you get the communicator size in MPI?
MPI_Comm_size(MPI_Comm comm_name, int* size)
What is the name of the global communicator?
MPI_COMM_WORLD
Set up by MPI_Init();
How do you get a process’s rank within a communicator?
MPI_Comm_rank(MPI_Comm comm_name, int* rank)
How is a MPI program finalized?
MPI_Finalize(void);
Any resource allocated to MPI is freed.
What does
mpiexec -n <n> ./program</n>
do when compiling a MPI program?
mpiexec tells the system to run the program with <n> instances of the program</n>
What is a MPI communicator?
A collection of processes that can send messages to each other.
How can MPI produce SPMD programs?
Processes branch out doing different tasks based on their ranks.
if-else
rank = 0 can print, rank = 1 send, rank = 2 receive
What is the syntax of MPI_Send()
MPI_Send(
void* buffer,
int message_size,
MPI_Datatype type,
int dest,
int tag,
MPI_Comm communicator
):
buffer: holds the content of the message to be sent
size: Number of elements to send from the buffer
type: MPI_CHAR, MPI_DOUBLE, etc.
dest: Destination rank, who is receiving the message
tag: non-negative int, can be used to destinguish messages that are otherwise identical
What is the syntax of MPI_Recv
int MPI_Recv(
void* msg_buf,
int size,
MPI_Datatype type,
int source,
int tag,
MPI_Comm communicator,
MPI_Status* status
)
msg_buf: Buffer to receive message in
size: Number of elements to receive
type: Types of elements in message
source: Source rank, rank that sent message
tag: Tag should match the tag from the send
communicator: Must match the communicator at the send
status: When not using status MPI_STATUS_IGNORE is passed
What conditions must be met for a MPI to be successfully sent by process a and received by process b?
dest = b
src = a
comm_a = comm_b
tag_a = tag_b
buffer_a|buffer_b + size_a|size_b + type_a|type_b must be compatible
Most of the time, if type_a=type_b and size_b >= size_a, the message will be successfully received
What is a wildcard argument in MPI communication
If a receiver will be receiving multiple messages from multiple source ranks, and it does not know the order it will receive, it can loop through all the MPI_recv() calls and pass the wildcard argument: MPI_ANY_SOURCE to allow any order of ranks to send messages.
Similarly, if a process will receive multiple messages from another process, but with different tags, it can do the same but with the wildcard argument MPI_ANY_TAG
Only receivers can use wildcard arguments
There is no communicator wildard argument
What is MPI_Status used in MPI_Recv?
A struct with atleast the three members:
MPI_SOURCE
MPI_TAG
MPI_ERROR
Before recv call, create status pointer:
MPI_Status status;
MPI_Recv(…, &status);
These are useful if a process uses wildcards and now need to figure out either the source or tag of a message. These attributes can then be examined.
What is MPI_Get_count() used for?
Used to figure out how many elements of the provided type was received in the message
MPI_Get_count(
MPI_Status* status, (in)
MPI_Datatype type, (in)
int* count (out)
)
status: Status struct passed to recv()
type: Type passed in recv
count: Number of elements received in the message
What happens if the process buffers a MPI_Send message?
MPI puts the complete message into its internal storage.
The MPI_Send() call will return.
The message might not be in transmission yet, but as it is now stored internally, we can use the send_buffer for other purposes if we want to.
What happens if the process block the MPI_Send message?
The process will wait until it can begin transmitting the message.
The MPI_Send() call might not return immediately.
When will MPI_Send block?
It depends on the MPI implementation.
But many implementations have a “cutoff” message size. If the size is within this, it will be buffered. If it exceeds the cutoff-size the message will block.
Does MPI_Recv block?
Yes, unlike MPI_Send, when MPI_Recv returns we know the message has been fully received.
What does it mean that MPI messages are non-overtaking?
If one process a sends 2 messages to process b, the first message must be available to b before the second one is.
Messages from different processes does not care which was sent first.
What is a pitfall with MPI_Recv / MPI_Send in the context of blocking?
If the MPI_Recv does not have a matching MPI_Send it will block forever and the program will hang.
The same can happen for a blocking send if it has no matching receiver.
If a MPI_Send if buffered and there are no matching send, the message will be lost
What is non-determinism in parallel programs?
When the output of a program vary depending on the order of which processes does computations.
How can MPI programs implement I/O to avoid non-determinism?
Make processes branch on process rank.
E.g. rank 0 can read input and send it to the remaining ranks.
All ranks can send their output to rank 0 who can print it in rank order.
In MPI, what are collective communications?
Communication functions that include all processes in a communicator.
What is point-to-point communication?
One sender and one receiver
(MPI_Senc | MPI_Recv)
What is MPI_Reduce?
Implementation of collective communication.
Generalized function that allows different operations on data that is held by all processes in a communicator
Syntax:
MPI_Reduce(
void* input_data_buf,
void* output_data_buf,
int count,
MPI_Datatype type,
MPI_Op operator,
int dest_process,
MPI_Comm comm
)
input_data_buf: local data for the process, this is used in the operation
output_data_buf: buffer to hold the output computation done by the operator
count: Number of elements to do operation on. This allows for e.g. operations on arrays
type
operator: Specifies what operation is to be done on the data
dest_process: rank to receive computed output (?)
What are some operators available for MPI_Reduce?
MPI_SUM: Optimized global sum of all local_ints
MPI_MAX: Finds the largest value from the processes
What is important to remember when using collective communication
All processes must call MPI_Reduce
Arguments passed to a collective must be compatible (e.g. dest rank)
out buffer is only used by dest rank. The other ranks still need to pass the out argument, but this can be NULL for the other processes
Where point-to-point matches on tags and communicators, collective match on communicator and the order they where called.
It is illegal to use the same buffer for input and output
What does itmean to alias arguments?
Two arguments are aliased if they refer to the same block of memory.
This is illegal in MPI if one of the is output or input/output
What does MPI_Allreduce do?
Optimized collective function that stores the output of reduce in all processes
MPI_Allreduce(
void* input_data_buf,
void* output_data_buf,
int count,
MPI_Datatype type,
MPI_Op operator,
MPI_Comm comm
)
Identical argument list to reduce() without dest rank
What is MPI_Broadcast
Collective function that allows a process to send a message to all other processes in a communicator
MPI_Bcast(
void* data_buf,
int count,
MPI_Datatype type,
int source_process,
MPI_Comm comm
)
source_process: The process with rank source_process sends its content of data_buf.
data_buf: Buffer to either send from, or if processes aren’t the source, receive the data in. Acts as both input and output
What does MPI_Scatter do?
If a communicator is doing operations on a large vector, and each process is only doing work on certain parts of the vectors, it can be expensive if one process communicates the whole vector to all processes. Because these must then allocate alot of memory to the whole vector, though they only does computations on parts of these.
MPI_Scatter reads in the complete data from one rank and sends only the needed components to the rest of the processes.
MPI_Scatter(
void* send_buf,
int send_count,
MPI_Datatype sent_type,
void* recv_buf,
int recv_count,
MPI_Datatype recv_type,
int src_process,
MPI_Comm comm
)
The function divides the data referenced in send_buf into comm_size pieces. The first piece goes to rank 0, then rank 1, and so on.
send_count needs to be the amount of data going to each process, so not the complete amount of data in send_buf.
A thing to note, is that the complete amount of data must be divisible by the number of ranks in the communicator
What does MPI_Gather do?
Function to collect all data-components from all processors into one process to get the complete data, e.g. the complete vector.
MPI_Gather(
void* send_buf,
int send_count,
MPI_Datatype sent_type,
void* recv_buf,
int recv_count,
MPI_Datatype recv_type,
int dest_process,
MPI_Comm comm
)
Same as scatter, but with a destination rank to receive all the data.
Data from rank 0 is stored in the first block of recv_buffer, send_buf in rank 1 is stored in second block of recv_buf
What is MPI_Allgather?
MPI_Allgather(
void* send_buf,
int send_count,
MPI_Datatype sent_type,
void* recv_buf,
int recv_count,
MPI_Datatype recv_type,
MPI_Comm comm
)
Function concatinates each processes send_buf and stores this in each process’s recv_buf
What are derived datatypes in MPI?
Used to represent any collection of data items in memory.
Stores the type of the items, and their relative locations in memory.
Derived datatypes consist of a sequence of basic MPI_Datatypes and a displacement for each type (from the beginning of the type).
What is the syntax of MPI_Type_create_struct?
MPI_Type_create_struct(
int count,
int array_of_blocklengths[],
int array_of_displacements[],
MPI_Datatype array_of_types[],
MPI_Datatype new_type*
)
count: Number of elements in the type
blocklengths: allows for possibility that the subtypes are arrays. if one element is an array with 5 elements. blocklength = 5
displacement: Each elements displacement from the start if the message
What does the function MPI_Get_address do?
MPI_Get_address(
void* location,
MPI_Aint* address
)
returns address of pointer location
MPI_Aint is used because this is the datatype that is big enough to store an address.
Howcan we get displacements of datatype elements using MPI_Get_address?
int a, b, c
MPI_Aint addr_a, addr_b, addr_c
MPI_Get_address(&a, &addr_a)
MPI_Get_address(&b, &addr_b)
MPI_Get_address(&c, &addr_c)
array_of_displacements[0] = 0;
array_of_displacements[1] = addr_b - addr_a;
array_of_displacements[1] = addr_c - addr_a;
How are datatypes created in MPI?
MPI_Datatype new_type;
MPI_Type_create_struct(
3,
block_lengths,
displacements,
types,
&new_type
)
Then the type must be committed:
MPI_Type_commit(&new_type)
Finished using the type:
MPI_Type_free(&new_type)
What does MPI_Type_commit do?
Commits a MPI datatype that was created using MPI_Type_create_struct
MPI_Type_commit(
MPI_Datatype* new_type
)
What does MPI_Type_free do?
When we are finished using a MPI type that we have created, we can free the storage it has used using:
MPI_Type_Free(
MPI_Datatype *used_custom_type
)
What does MPI_Barrier do?
A collective communication function that is used to synchronize processes.
No process will return from calling it until every process in the communicator has started calling it.
Does not guarantee communication has finished, messages can still be in transit.
written data that is waiting in a buffer, will not be flushed by the barrier. It stays the same.
pending requests for work will not be cleaned up by barriers, you must wait for their completion to finalize them
MPI_Barrier(MPI_Comm comm)
What is speedup?
Ratio of serial runtime to parallel
S = T_serial / T_parallel
p: number of processes (?)
What is linear speedup?
When a parallel program running p processes runs p times faster than the serial program.
What is efficiency?
Per process speedup
S = T_serial / T_parallel
E = S / p = T_serial / p*T_parallel
How does linear speedup correspond to parallel efficiency?
Linear speadup gives efficiency of p/p = 1
What is MPI_PROC_NULL
A MPI constant used in point-to-point communication as src/dest rank. When the constant is used, no communication will take place.
What is an unsafe program?
A program that relies on MPI-buffering to avoid deadlocks when Sends- and receives are waiting for each other.
Unsafe programs may hang, crash or deadlock
What is MPI_Ssend?
An synchronous MPI_Send call that guarantees to block until the matchin receive starts
Same arguments as MPI_Send
How can you check if a program is safe or unsafe?
If the MPI_Send calls are replaced with MPI_Ssend we can see if the program hangs. If it does not, the program was safe.
What can cause a deadlock in MPI programs, and how can this be resolved?
Processes first sending a message annd then waiting to receive. This will cause them to wait in a circle.
A way to solve this is to vary in what order ranks sends or receives.
If half of the ranks sends first and then receives, and the other half first receives and then sende, there will be no deadlock.
What is MPI_Sendrecv
MPI’s implementation of safe sends- and receives.
MPI_Sendrecv(
void* send_buf,
int send_size
MPI_Datatype send_type,
int dest,
int send_tag,
void* recv_buf,
int recv_size
MPI_Datatype recv_type,
int src,
int recv_tag,
MPI_Comm comm,
MPI_Status* status
)
Guarantees no deadlock
What are local and global variables in MPI programs?
Local: Values specific to a process
Global: Values available to all processes
What are parallel overhead and what causes this in MPI programs?
Overhead due to additional work that is not done in serial programs.
In MPI, this would be the work done in communicating between processes
When is a parallel program scalable?
If you can increase the problemsize n so that efficiency doesn’t decrease as p is increased.
In flynn’s taxonomy, what type of programs are MPI programs?
SPMD
Can branch by ranks to do different things
What are the 4 types of communication modes?
Standard: Default for MPI_Send
Synchronized: Send-function will block until reception is acknowledged
Buffered: Explicitly manage the memory that’s used for send/recv
Ready: Assume that the receiver has already initiated the receive when the send() call is made
What category of parallel program is MPI?
SPMD
P copies of the same program can do different things because of their identity number
What is a cartesian communicator?
Each rank has a set of coordinates.
Grid structure, 2 neighbours in each direction, up/down/left/right, possibly in 3D
What is non-blocking sending and receive?
Send() and recv() call immediately returns with a request, so execution can continue.
To make sure if communication was successful, you must issue a wait-for-completion call for the request
When a MPI program is launced with multiple processes, what is every process delegated?
A full memory space
Stack, heap, data (includes rank), text
How are MPI programs run?
mpirun -np 4 ./my_program
What is a pre-requisite when using collective operations?
All ranks in the communicator MUST participate in the collective operation
What is a memory fence?
An operation that forces all committed work to be completed before continuing
What does MPI_Alltoall do?
Total exchange - from everyone to everyone
Can collective functionality be implemented using point-to-point communication?
Yes, all collective functions can be implemented using normal send/recv
When is internal buffering not faster (when buffering sends)
When the message exceeds a size making a copy of the message takes longer than sending message right away.
After this message size, MPI_Send will switch to blocking mode
What does MPI_Ssend() do?
Synchronized mode of Send
Does not return until receiver starts receiving
Synchronizes progress between communicating processes
What is MPI_Bsend?
Buffered Send mode
Lets you allocate buffer yourself, so that you can make it long and contiguous in memory
Useful when you’re sending a lot of tiny messages at a time. This usually causes tiny buffer allocations and deallocations, which takes time and fragments heap-memory
Buffer must be registered before the send call
MPI_Buffer_attach(buffer, buffer_size)
MPI_Buffer_detach(&buffer, &buffer_size)
What is MPI_Rsend?
Has liberty to bypass protocals that establish whether the recipient is ready.
Can be used when the programmer is 100% sure that the receiver has already made the receive call
What is MPI_Isend?
MPI_Isend(
void* buffer,
int count,
MPI_Datatype type,
int dest,
int tag,
MPI_Comm communicator,
MPI_Request *request
)
Return immediately. Message sending gets put in the background and executed later at MPIs own convenience
Program can do something else in the meantime
MPI_Wait(MPI_Request *req, MPI_Status *stat)
is called when you need to make sure the transfer was successful.
MPI_Waitall(n_reqs, *array_reqs, MPI_STATUS_IGNORE)
Can be used if multiple messages was sent, and you want to wait for all at the same time
Why is MPI_Isend useful for performance?
You can overlap computetion and communication.
Communication is expensive, but this allows you to do useful work in the meantime.
What are the modes of non-blocking send?
MPI_Isend
MPI_Issend
MPI_Ibsend
MPI_Irsend
MPI_Irecv
What is persistent communication, and how can it be implemented?
If the same communication pattern is going to be used over and over, MPI can prepare these in advance and you can activate them later.
This is the case for ISend
int MPI_Send_init(<usual>, MPI_Request *req)</usual>
int MPI_Recv_init(<usual>, MPI_Request *req)</usual>
Triggered:
MPI_Start(MPI_request *req)
What does double MPI_Wtime(void) do?
Returns a number of seconds representen as a double-precision float value
How can MPI_Wtime() be used to measure execution time?
MPI_Barrier();
double start = MPI_Wtime();
/… work …/
double end = MPI_Wtime();
Elapsed time = end - start (on this rank)
What is bandwidth?
Bytes / seconds
What is inverse bandwidth?
seconds / byte
How much transfer time is added for sending additional bytes
What are vector types?
Types with regular layout
Vector types consist of:
- count
- a block length
- a common stride between the blocks
Stride: Distance between neighbours
How can vector types be created?
MPI_Type_vector(n_elements, blocklength, stride, type, &new_type)
How can types of internal regions of arrays be constructed?
MPI_Type_create_subarray(
int ndims,
const int array_of_sizes,
const int array_of_subsizes,
const int array_of_Starts,
int order,
MPI_Datatype old,
MPI_Datatype new
)
ndims: dimensions in array
array_of_sizes: how big is entire array
array_of_subsizes: How big is our slice of the array
array_of_starts: where is the origin of the slice
Example: create 4x4 subarray from themiddle of an 6x6 array
MPI_Type_create_subarray(
int 2,
{6, 6},
{4, 4},
{1,1},
int order, ?
MPI_Datatype old, ?
MPI_Datatype new ?
)
How can you create a type from a contiguous part of memory?
MPI_Type_contiguous(
count, oldtype, newtype
)
What does MPI_Type_indexed do?
Like MPI_Type struct, exceptthat all struct members have the same type
What is MPI_Group?
An arbitrary set of ranks
How can we create a group from all ranks in a communicator?
MPI_Comm_group(MPI_Comm comm, MPI_Group *group)
What does MPI_Group_incl do?
Create a subgroup from a group.
Include n_members of the ranks in the rank_list
MPI_Group_incl(
MPI_Group old,
int n_members
const int rank_list[],
&new_group
)
What does MPI_Group_excl do?
MPI_Group_excl(
MPI_Group old,
int n_to_remove
const int rank_list[],
&new_group
)
removes ranks from a group
What set operations can be done on groups?
MPI_Group_union
MPI_Group_intersection
Why are groups useful?
They can be made into communicators
MPI_Comm_create(
MPI_Comm old_comm,
MPI_Group g,
MPI_Comm *new_comm
)
When a rank within the group calls this, the communicator handle is returned
A rank outside the group gets returned MPI_COMM_NULL
What does MPI_Graph_create do?
Create a graph communicator out of another comm
MPI_Graph_create(
old_comm,
int n_node,
int indexes[],
int edges[],
int reorder,
&new_comm
)
reorder: Can MPI give new ranks in the new comm
indexes: Used to map ranks to nodes, gives the start of each ranks neighbour list by its entry in the index list
edges: Sorted list of neighbour ranks of each rank
How is cartesian communicators created?
MPI_Cart_create(
old_comm,
n_dims,
period[], (wrap the edges?)
reorder, (new rank in comm)
&new_comm
)
How is the dim-array for cartesian communicators created?
In cartesian comms we want the dimensions to be as close to the square (2D), or cube (3D), and so on
MPI_Dims_create(
n_nodes, (rank count)
n_dims,
int dims[], (result array)
)
How does a rank find its position within a cartesian grid?
MPI_Cart_coords(
cart_comm,
rank, (current rank)
dims, (dims in comm)
coords[] (result array)
)
How are coords structured within cartesian grid?
{y, x}
y: elements starting at top, and going down
x: starting left, going right
How does a rank in a cartesian grid find its neighbours?
MPI_Cart_shift(
comm,
dir, (axis to shift)
displacements, (how far to shift)
*rank_src,
*rank_dest
)
Rank_src: when shifted, what rank is now in my place
Rank_dest: when shifted, in what place am I
What is MPI_PROC_NULL?
on non-periodic comms, if ranks have neighbours off-grid, these will return ass MPI_PROC_NULL
When this is fed to a comm call, where it would expect a rank, the matching operation would not be carried out
How does MPI_IO work?
All ranks can open file at the same time
Each rank sets a view of the file, this is the window where they can write
All ranks can write within their own views at the same time
How are files opened and closed using MPI_IO?
MPI_Open(
comm,
*filename, (string)
int access_mode, (MPI flags)
MPI_Info info, (can be NULL)
MPI_File *fh (the open file handle)
)
MPI_File_close(MPI_File *fh)
What are the MPI_IO access flags
MPI_MODE_CREATE: create if not exist
MPI_MODE_WRONLY
MPI_MODE_RDONLY
MPI_MODE_RDWR
MPI_MODE_APPEND: signals that will be adding data at the end
What does MPI_File_write_at do?
Allows you to specify position for each data chunk to write
MPI_File_write_at(
MPI_File fh,
MPI_Offset offset (where to write in it, different on each rank)
*buf (data to write)
count,
type,
*status
)
What does MPI_File_set_view do?
Restrict area a rank will write in to shape it like an MPI_Datatype
MPI_File_set_view(
*fh,
*displacement
type, (type to read/write)
MPI_Datatype file_layout (what region of file to acces)
*representation,
MPI_Info info
)