High performance Computing Flashcards

1
Q

Week 1

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What new science is a major user of high performance computing?

A

Life sciences for applications such as genome processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can we determine the performance of a high performance computer using floating point mathematics

A
  • Linpack is a performance benchmark which measures floating point operations per
    second (flops) using a dense linear algebra workload
  • A widely used performance benchmark for HPC systems is a parallel version of Linpack
    called HPL (High-Performance Linpack)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why is parallelization important in HPC

A

High requirement floating point calculations can be run in parallel to make use of multiple cores

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is dennard scaling?

A

Dennard scaling is a recipe for keeping power per unit area (power density) constant as transistors were scaled to smaller sizes

As transistors became smaller they also became faster (delay reduction) and more
energy efficient (reduce threshold voltage)

With very small features limits associated with the physics of the device (e.g. leakage
current) are reached

Dennard scaling has broken down and processor clock speeds are no longer
increasing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the current most common supercomputer architecture

A

Current systems are all
based on integrating many
multi-core processors

  • The dominant architecture is
    now the “commodity cluster”
  • Commodity clusters
    integrate off-the-shelf (OTS)
    components to make an
    HPC system (cluster)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Give the proper definition for a Commodity cluster

A

A commodity cluster is a cluster in which both the network and the compute nodes are commercial products available for procurement and independent application by organisations (end users or separate vendors) other than the original equipment manufacturer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Give four components of a cluster algorithm

A

Compute nodes: provide the processor cores and memory required to run the workload
* Interconnect: cluster internal network enabling compute nodes to communicate and access storage
* Mass storage: disk arrays and storage nodes which provide user filesystems
* Login nodes: provide access (e.g. ssh) for users and administrators via external network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why do High performance Computers use compiled languages

A

Maximizes performance.
Compilers parse code and generate executables with optimizations.
Optimizations at compile-time are less costly than at runtime.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are common langauges for High performance computers?

A

C, C++ and Fortran

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why must parellisation be done manually

A

Parallelization is too complex for compilers to handle automatically.
Programmers add parallel features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Week 2

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What can we use OpenMP for?

A

OpenMP provides extensions to C, C++ and Fortran
* These extensions enable the programmer to specify where parallelism should be added and how to add it
* The extensions provided by OpenMP are:
- Compiler directives
- Environment variables
- Runtime library routines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does it mean to say that OpenMP uses a fork join execution model?

A

Execution starts with a single thread (master thread)
- Worker threads start (fork) on entry to a parallel region
- Worker threads exit (join) at the end of the parallel region

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What can we use the OMP_NUM_THREADS header for?

A

We can use the OMP_NUM_THREADS environment variable to control the number of threads forked in a parallel region e.g.

  • export OMP_NUM_THREADS=4
  • OMP_NUM_THREADS is one of the environment variables defined in the standard
  • If you don’t specify the number of threads the default value is implementation defined
    (i.e. the standard doesn’t say what it has to be)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does openMP provide that we can call directly from our functions?

A

OpenMP provides compiler directives, environment variables and a runtime library with functions we can call directly from our programs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What header file must be included to use the open.mp library

A

<omp.h>
</omp.h>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why is conditional compilation useful in programs that use OpenMP?

A

It ensures that the program can compile and run as a serial version when OpenMP is not enabled, avoiding errors caused by missing OpenMP compiler flags.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the role of the C pre-processor in the compilation process?

A

The C pre-processor processes source code before it is passed to the compiler, handling directives such as #include and #ifdef.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does the _OPENMP macro indicate when it is defined?

A

It indicates that OpenMP is enabled and supported by the compiler.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

hat is the syntax of the #ifdef directive used for conditional compilation?

A

ifdef MACRO

// Code included if MACRO is defined #else
// Code included if MACRO is not defined #endif
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the main benefit of using conditional compilation with OpenMP programs?

A

It allows the same source code to support both serial and parallel execution by enabling or disabling OpenMP-related code.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is one good way to distribute the workload when working in parallel

A

One way to do this in OpenMP is to parallelise loops

  • Different threads carry out different iterations of the loop
  • We can parallelise a for loop from inside a parallel region:
    #pragma omp for
  • We can start a parallel region and parallelise a for loop:
    #pragma omp parallel for
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What changes about the order of loop iterations when they are executed in parallel?

A

When the loop is parallelised the iterations will not take place in the order specified by the loop iterator
* We can’t rely on the loop iterations taking place in any particular order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How do we solve the issue of loops not iterating in order when executed parallelly

A

The results are stored in arrays which do hold the results in the order we require
* If we want to print out the results of our calculation in order we will need a second sequential (not parallelised) loop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the correct way to define variable scope for OpenMP loops

A

In the examples in the previous unit different threads accessed the same copies of arrays x and y (shared)
* The loop index i was different for each thread (private)
The correct functioning of an OpenMP loop requires correct variable scoping
* In this case the default scoping rules did the right thing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Why might we want to define the variable scope before running parallelised code?

A

Explicitly declaring variable scope makes the code much easier to understand and more likely to be correct

  • We can specify the default scope by adding a clause to the directive which starts the parallel region:
  • default (shared)
  • default (none)
  • The default clause can be followed by a list of private and/or shared variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is a data race?

A

Data races are bugs specific to parallel programming and occurs when multiple threads try to update the same variable at the same time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

How can we solve data race issues in parallel programming?

A

We need each thread to have it’s own copy of a variable and combine them all at the end of execution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How is the reduction clause used for combining multithreaded variables?

A

When OpenMP encounters the reduction clause:

Each thread gets a private copy of variable, initialized based on the operator (e.g., 0 for + or 1 for *).

Threads compute partial results in parallel.
At the end of the parallel region, OpenMP combines all thread-local results into the global variable using the specified operator.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are data dependencies?

A

Data dependencies occur when:

One statement reads from or writes to a memory location.

Another statement reads from or writes to the same memory location.

At least one of these operations is a write.

The result depends on the order of execution, leading to potential issues in parallel computing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What are loop-carried dependencies?

A

Loop-carried dependencies occur when iterations of a loop depend on the results of previous iterations. This can cause issues in parallel execution. There are three main types:

Flow Dependency (Read-after-Write)

Anti-Dependency (Write-after-Read)

Output Dependency (Write-after-Write)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is a flow dependency?

A

Flow dependency, also known as Read-after-Write (RAW) dependency, occurs when an iteration requires the result from a previous iteration. For example, calculating a[i] might depend on a[i-1] being updated. This dependency is difficult to eliminate and can restrict parallelism.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Why is a flow dependency challenging to remove in parallel loops?

A

Flow dependencies require iterations to be executed in a specific order because the result of one iteration directly affects the next. For example, a[i] = a[i-1] + 1 depends on a[i-1] being calculated first, making it hard to parallelize.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is an anti-dependency?

A

Anti-dependency, or Write-after-Read (WAR) dependency, happens when an iteration requires a value that another iteration might overwrite. For instance, calculating a[i] needs a[i+1] before it is updated by a later iteration. This can often be resolved by writing results to a different array.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

How can anti-dependencies be resolved?

A

Anti-dependencies can be resolved by:

Writing to a different array, ensuring iterations are independent.

Copying the results back to the original array if necessary after the loop completes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is an output dependency?

A

Output dependency, or Write-after-Write (WAW) dependency, occurs when multiple iterations write to the same memory location. For example, if every loop iteration writes to x, the final value of x depends on which iteration executes last, leading to unpredictable behavior in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

How can output dependencies be addressed?

A

Output dependencies can be addressed by scoping the variable as lastprivate in OpenMP. This ensures that each thread has a private copy, and the value from the sequentially last iteration is retained after the parallel region ends.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is the role of lastprivate variables in parallel loops?

A

Lastprivate variables in OpenMP:

Provide each thread with a private copy of the variable.

Retain the value from the sequentially last iteration when the parallel region ends.

This helps resolve output dependencies and ensures correct behavior in parallelized loops.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What are the key types of loop-carried dependencies and their solutions?

A

The three types of loop-carried dependencies and their typical solutions are:

Flow Dependency (Read-after-Write): May require rethinking the algorithm, as it’s challenging to parallelize.

Anti-Dependency (Write-after-Read): Resolved by writing to a different array.

Output Dependency (Write-after-Write): Resolved by using lastprivate variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Week 3

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is a differential equation?

A

A differential equation (DE) is an equation that contains one or more derivatives. Derivatives represent the rate of change of one variable concerning another. Examples include velocity, which is the rate of change of position with time (dx/dt), and acceleration, which is the rate of change of velocity with time (d²x/dt²).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is a partial differential equation (PDE)?

A

A partial differential equation (PDE) is a differential equation that contains derivatives concerning more than one variable. For example, if a function u depends on two variables x and t, we write u = u(x, t). The partial derivatives are written using curly symbols (∂), indicating the rate of change with respect to one variable while keeping others constant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Why are partial differential equations important in high-performance computing (HPC)?

A

Many HPC applications require solving PDEs numerically. Examples include:

Weather and climate modeling

Astrophysical simulations

Engineering problems (e.g., structural analysis, fluid dynamics)
These fields rely on PDEs to describe changes over space and time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What is exponential decay and how is it denoted

A

When a quantity decreases at a rate which is proportional to the quantity itself it
undergoes exponential decay

  • The exponential decay of a quantity N(t) is described by the equation:

dN/dT = - λN

Where λ is a positive constant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What differential represents Number of nuclei as a function of time

A

dN/dt = − λNt

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

How to numerically estimate a derivative when you can’t solve the differential equation directly?

A

We can approximate the derivative using small but finite differences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is the trade-off when choosing a time step (Δt) in numerical solutions(estimating derivatives)

A

Larger time steps require less computation but result in less accurate solutions, while smaller time steps increase accuracy but require more computation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is advection?

A

Advection is the transport of a quantity by a velocity field. Examples include silt in a river, dust in the atmosphere, and salt in the ocean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Week 3 is being sacked off for now - Must do Later

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Week 4 notes

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What is an advantage of using MPI over OpenMP in distributed systems?

A

OpenMP will only work where processors share memory but MPI can work in distributed memory systems such as clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What are the key steps in setting up and shutting down an MPI environment?

A

The key steps are:

Initialisation - Call MPI_Init before any MPI functions to set up the MPI environment.

Execution - Execute MPI functions within the environment.

Finalisation - Call MPI_Finalize after all MPI calls to cleanly shut down the MPI environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What does MPI_Init do?

A

MPI_Init initializes the MPI execution environment, passing command-line arguments to all MPI processes.

Syntax:

int MPI_Init(int argc, char **argv);

argc: Pointer to the number of arguments.

argv: Pointer to the argument array.

Returns an integer error code.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What does MPI_Finalize do?

A

MPI_Finalize closes down the MPI execution environment. It takes no arguments and returns an integer error code. Syntax:
int MPI_Finalize();

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What is the purpose of MPI_Comm_size and how are processes ranked?

A

MPI_Comm_size reports the number of MPI processes in a specified communicator. Syntax:

int MPI_Comm_rank(MPI_Comm comm, int* rank);

comm: Communicator (e.g., MPI_COMM_WORLD).

rank: Stores the rank of the calling process.

Each MPI process has a rank between 0 and size -1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What is a communicator in MPI?

A

A communicator defines a group of processes that can communicate with each other. The predefined MPI_COMM_WORLD includes all processes in an MPI program. Custom communicators can also be created if needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

How do you compile an MPI program?

A

Use the mpicc compiler wrapper:

mpicc -o hello_mpi hello_mpi.c

This calls the backend compiler and handles include paths, library paths, and flags.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

How do you run an MPI program?

A

Use mpirun to start multiple instances of the executable:

mpirun -np 4 hello_mpi

-np 4 specifies running the program with 4 processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What is the function of mpicc?

A

mpicc is a wrapper compiler for MPI programs. It calls the backend compiler, managing include paths, library paths, and additional flags to simplify compilation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What is the function of mpirun?

A

mpirun is used to launch an MPI program with multiple processes. It ensures correct execution across multiple processes and nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

What is MPI_COMM_WORLD?

A

MPI_COMM_WORLD is the default communicator that includes all processes in an MPI program, enabling communication between them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

What are the main MPI housekeeping functions?

A

The main MPI housekeeping functions are:

MPI_Init (Initialize environment)

MPI_Finalize (Shut down environment)

MPI_Comm_size (Get number of processes)

MPI_Comm_rank (Get process rank)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

What is point-to-point communication in MPI?

A

Point-to-point communication is a type of MPI communication that occurs between a pair of processes in a communicator. It allows direct sending and receiving of messages between specific processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

What are the basic functions for point-to-point communication in MPI?

A

The two basic functions are:

MPI_Send - Used to send a message to another process.

MPI_Recv - Used to receive a message from another process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

What is the syntax of MPI_Send?

A

MPI_Send is used to send a message and has the following syntax:

int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm);

buf: Pointer to the data to be sent.

count: Number of elements to send.

datatype: Type of data being sent.

dest: Rank of the destination process.

tag: User-defined label for communication.

comm: Communicator (e.g., MPI_COMM_WORLD).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

What is the syntax of MPI_Recv?

A

MPI_Recv is used to receive a message and has the following syntax:

int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status);

buf: Pointer to the buffer where received data will be stored.

count: Number of elements to receive.

datatype: Type of data being received.

source: Rank of the sending process.

tag: User-defined label for communication.

comm: Communicator (e.g., MPI_COMM_WORLD).

status: Structure containing details about the received message (source rank and tag).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

What is the role of the tag parameter in MPI_Send and MPI_Recv?

A

The tag parameter acts as a user-defined label for communication, helping to distinguish between different messages. The receiver can filter messages based on their tag value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

What is the purpose of MPI_Status in MPI_Recv?

A

MPI_Status provides information about the received message, such as:

The rank of the sending process.

The message tag.

Additional error codes if applicable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

What are the key takeaways from MPI point-to-point communication?

A

MPI point-to-point communication occurs between two processes.

MPI_Send and MPI_Recv are the fundamental functions.

Messages are identified using rank, tag, and communicator.

Message lengths are given in MPI data types, not bytes.

MPI_Status provides metadata about received messages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

What is collective communication in MPI?

A

Collective communication refers to communication operations that involve a group of processes defined by a communicator. It simplifies MPI programming by providing built-in functions for data exchange between multiple processes, making it more efficient than using point-to-point communication.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

What are the main types of MPI collective communication functions?

A

The main types of MPI collective functions include:

MPI_Bcast (Broadcast)

MPI_Scatter (Scatter)

MPI_Gather/MPI_Allgather (Gather and Allgather)

MPI_Reduce/MPI_Allreduce (Reduce and Allreduce)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

What does MPI_Bcast do?

A

MPI_Bcast broadcasts data from one process (root) to all other processes within a communicator. Every process receives the same data.

Function signature:

int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm);

buffer: data to send

count: number of items to send

datatype: type of data

root: rank of the sending process

comm: communicator

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

When would you use MPI_Bcast?

A

MPI_Bcast is useful when distributing the same data to all processes, such as when initializing parameters for computations that must be synchronized across multiple processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

What is MPI_Scatter?

A

MPI_Scatter distributes different portions of data from one root process to all other processes in a communicator. Unlike MPI_Bcast, each process receives a unique subset of the data.

Function signature:

int MPI_Scatter(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm);

sendbuf: buffer containing data to send

sendcount: number of items sent per process

sendtype: type of data

recvbuf: buffer to receive data

recvcount: number of items received

recvtype: type of data

root: rank of sending process

comm: communicator

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

When would you use MPI_Scatter?

A

MPI_Scatter is useful when dividing a large dataset into smaller parts, such as distributing different work portions to parallel processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

What is MPI_Gather?

A

MPI_Gather collects data from multiple processes and sends it to a single root process.

Function signature:

int MPI_Gather(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm);

sendbuf: buffer containing data to send

sendcount: number of elements sent per process

sendtype: type of data

recvbuf: buffer to collect data (only relevant at root)

recvcount: number of elements received per process

recvtype: type of data

root: rank of receiving process

comm: communicator

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

When would you use MPI_Gather?

A

MPI_Gather is useful when collecting results from multiple processes for centralized processing, such as collecting partial computations for final aggregation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

What is MPI_Allgather?

A

MPI_Allgather is similar to MPI_Gather but sends the collected data to all processes instead of just one root process.

Function signature:

int MPI_Allgather(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm);

Same parameters as MPI_Gather but without a root process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

When would you use MPI_Allgather?

A

MPI_Allgather is useful when all processes need a complete dataset after individual contributions, such as in global synchronization scenarios.

81
Q

What is MPI_Reduce?

A

MPI_Reduce performs a reduction operation (e.g., sum, max, min) on data across multiple processes and sends the result to a designated root process.

Function signature:

int MPI_Reduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype type, MPI_Op op, int root, MPI_Comm comm);

sendbuf: data to reduce

recvbuf: buffer for the result (only relevant at root)

count: number of elements to reduce

type: type of data

op: reduction operator

root: rank of process receiving the result

comm: communicator

82
Q

What are some common reduction operators used with MPI_Reduce?

A

Common reduction operators include:

MPI_MAX: Maximum value

MPI_MIN: Minimum value

MPI_SUM: Sum of values

MPI_PROD: Product of values

MPI_LAND: Logical AND

MPI_LOR: Logical OR

MPI_BAND: Bitwise AND

MPI_BOR: Bitwise OR

83
Q

What is MPI_Allreduce?

A

MPI_Allreduce is similar to MPI_Reduce but distributes the final reduced result to all processes instead of a single root.

Function signature:

int MPI_Allreduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype type, MPI_Op op, MPI_Comm comm);

Same parameters as MPI_Reduce but without a root process

84
Q

When would you use MPI_Allreduce?

A

MPI_Allreduce is useful when all processes require the reduced result, such as computing a global sum or average that must be available to every process.

85
Q

What are the communication patterns of MPI collective operations?

A

MPI collective communication patterns include:

One-to-all: MPI_Bcast, MPI_Scatter

All-to-one: MPI_Gather, MPI_Reduce

All-to-all: MPI_Allgather, MPI_Allreduce

86
Q

What is blocking communication in MPI?

A

Blocking communication in MPI means that the sender or receiver function does not return until the operation is complete. For example, MPI_Send blocks until the message is received or it is safe to modify the send buffer, and MPI_Recv blocks until the message has been received.

87
Q

What is the main issue with blocking communication?

A

Blocking communication can lead to deadlocks if the send and receive operations are not correctly matched. If a send operation blocks waiting for a corresponding receive, but the receive operation is also blocked waiting for a send, the program can become stuck indefinitely.

88
Q

How can blocking communication cause deadlocks?

A

If both processes call MPI_Send before either calls MPI_Recv, they will be waiting indefinitely, resulting in a deadlock. Small messages might be buffered, avoiding deadlocks in some cases, but increasing the message size can reintroduce the issue.

89
Q

How does non-blocking communication help avoid deadlocks?

A

Non-blocking communication allows the program to continue executing other instructions while the message is being sent or received. Since MPI_Isend and MPI_Irecv do not require immediate synchronization, they can prevent deadlocks by decoupling the send and receive operations.

90
Q

What is MPI_Isend and how is it used?

A

MPI_Isend is the non-blocking version of MPI_Send. It initiates a send operation but does not wait for completion, allowing computation to continue. Its syntax is:

int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request);

The request handle must later be checked using MPI_Wait or MPI_Waitall to ensure completion.

91
Q

What is MPI_Irecv and how is it used?

A

MPI_Irecv is the non-blocking version of MPI_Recv. It initiates a receive operation but does not wait for completion. The syntax is:

int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request);

Like MPI_Isend, it requires MPI_Wait or MPI_Waitall to check completion.

92
Q

Why do we need MPI_Wait in non-blocking communication?

A

Since MPI_Isend and MPI_Irecv do not block, the program must ensure that the communication has completed before accessing the buffers again. MPI_Wait blocks execution until a specified non-blocking operation is finished.

93
Q

What is the syntax and purpose of MPI_Wait?

A

MPI_Wait waits for a specific non-blocking request to complete. The syntax is:

int MPI_Wait(MPI_Request *request, MPI_Status *status);

The request specifies the operation to wait for, and status returns the operation’s status.

94
Q

What is MPI_Waitall, and when is it used?

A

MPI_Waitall waits for multiple non-blocking requests to complete simultaneously. This is useful when handling multiple communications at once. The syntax is:

int MPI_Waitall(int count, MPI_Request array_of_requests[], MPI_Status array_of_statuses[]);

It takes an array of requests and ensures all are completed before proceeding.

95
Q

What are collectives in MPI, and how do they behave?

A

What are collectives in MPI, and how do they behave?

96
Q

What are non-blocking collectives, and why were they introduced?

A

Non-blocking collectives, introduced in MPI version 3, allow processes to initiate collective operations without waiting for all participants. This helps avoid synchronization overhead and improves efficiency.

97
Q

What is domain decomposition in the context of MPI?

A

Domain decomposition is a method used in MPI to distribute computational work among multiple MPI processes. It involves dividing the computational domain into smaller subdomains, each assigned to a different process. This technique is commonly used in parallel computing for solving Partial Differential Equations (PDEs), where each process is responsible for a portion of the data and computation.

98
Q

How does domain decomposition distribute work among MPI processes?

A

Domain decomposition divides the entire computational domain into smaller regions, with each MPI process handling a separate region. This distribution allows parallel execution by ensuring that each process works on its assigned portion of the domain. Communication is required between processes to share boundary information where subdomains interact.

99
Q

What role does the rank zero process play in domain decomposition?

A

The rank zero process in an MPI program typically handles the following:

Reading input parameters from a file and broadcasting them to all processes.

Reading initial conditions from a file and distributing them using MPI_Scatter.

Collecting results from all processes using MPI_Gather and writing the final output to a file.

100
Q

Why must all MPI processes use the same time step in a PDE solver?

A

In a parallel PDE solver, all MPI processes must update their solutions at the same time step to maintain consistency. If the time step is constrained by stability conditions (e.g., the Courant condition), different regions of the domain may have different maximum allowable time steps. To ensure stability, the smallest time step across all regions must be used, which is determined using MPI_Allreduce with the MPI_MIN operator.

101
Q

How is the global minimum time step determined in a parallel PDE solver?

A

Each MPI process computes its local minimum allowable time step based on local conditions. The global minimum time step is then determined using the MPI_Allreduce function with the MPI_MIN operator. This ensures that all processes use the smallest time step from across the entire computational domain.

102
Q

Why is communication needed in domain decomposition when using finite difference methods?

A

Finite difference methods require values from neighboring grid points to compute derivatives. When the computational domain is decomposed among multiple MPI processes, some required values may reside in neighboring subdomains. To ensure correct calculations, processes must exchange boundary values with their neighbors through MPI communication.

103
Q

What is a “halo” in the context of domain decomposition?

A

A “halo” is a region of copied data from neighboring MPI processes, stored within each process’s local memory. Halos ensure that each process has access to the necessary boundary values from adjacent subdomains without constantly requesting them during calculations. This reduces communication overhead and improves efficiency.

104
Q

How are halos exchanged between neighboring MPI processes?

A

Halos are exchanged using non-blocking point-to-point MPI communications. Typical steps include:

Sending boundary values to neighboring processes using MPI_Isend.

Receiving boundary values from neighboring processes using MPI_Irecv.

Using MPI_Waitall to ensure that all communications complete before computation continues.
This ensures data consistency across subdomains without introducing unnecessary synchronization delays.

105
Q

What MPI functions are commonly used for halo exchange?

A

The following MPI functions are commonly used for halo exchange:

MPI_Isend: Sends boundary data to neighboring processes in a non-blocking manner.

MPI_Irecv: Receives boundary data from neighboring processes in a non-blocking manner.

MPI_Waitall: Ensures that all non-blocking communications complete before computation resumes.
These functions enable efficient data exchange between processes without unnecessary blocking.

106
Q

Week 5

107
Q

Why can’t all real numbers be represented as integers?

A

Real numbers include fractions and decimals, which cannot be accurately represented using only integer values. Computing often requires handling non-integer values, which necessitates floating-point representation.

108
Q

Why do we need floating point representation?

A

Floating point representation is needed to handle very large and very small numbers efficiently.

109
Q

What is the general structure of a floating point number?

A

A floating point number follows the format
𝑑.𝑑𝑑𝑑×𝛽^𝑒
where:

d.ddd is the significand (mantissa)

e is the exponent

β is the base (commonly 2 for binary systems)

p is the precision (number of significant digits stored)
For example,
9.109×10−^31kg follows this representation.

110
Q

What is IEEE 754, and why is it important?

A

IEEE 754 is the most widely used standard for floating point arithmetic. It specifies:

Number representations (single, double precision)
Arithmetic operations (addition, subtraction, multiplication, division, square root)
Handling of special cases like infinity and NaN (Not a Number)

111
Q

What are single and double precision floating point formats in IEEE 754?

A

Single precision: 32-bit (float), commonly used in graphics, machine learning
Double precision: 64-bit (double), used in scientific and engineering computations
Double precision provides higher accuracy but requires more storage and computational power

112
Q

Why is double precision commonly used in scientific computing?

A

Double precision (64-bit) provides higher accuracy, which is crucial for scientific and engineering applications where small errors can accumulate significantly.

113
Q

How is a double precision floating point number structured in IEEE 754?

A

A double precision number (64 bits) consists of:

1 bit for sign (positive/negative)
11 bits for exponent (determining the range)
52 bits for the mantissa (determining precision)
It follows the formula:

𝑥=±(1.𝑏1𝑏2…𝑏52)×2(𝑎1𝑎2…𝑎11)^−1023

where the exponent has a bias of 1023.

(ngl probably look this up a little more)

114
Q

What are the smallest and largest normalised numbers in double precision?

A

Smallest normalised number:
2^−1022≈10^−308

Largest normalised number:
(2−2−^52)×2^1023≈10^308

This defines the range of numbers that can be accurately represented.

115
Q

What is machine epsilon and why is it important?

A

Machine epsilon is the smallest difference between 1 and the next largest representable floating point number. For double precision, it is:
𝜖=2^−52≈10^−16

It determines the precision limit and affects numerical stability.

116
Q

What causes rounding errors in floating point arithmetic?

A

Floating point numbers can only approximate real numbers due to their finite precision. Rounding occurs when a number cannot be exactly represented, leading to small inaccuracies that can accumulate in calculations.

117
Q

What are the three main sources of numerical errors in computations?

A

Truncation error: From approximating continuous functions (e.g., finite difference methods).
Numerical method error: Errors from iterative or approximate solutions.
Round-off error: Due to the finite representation of floating point numbers.
If the numerical method error is smaller than the floating point round-off error, the solution is said to be computed to “machine precision.”

118
Q

How do floating point errors affect scientific computations?

A

Errors accumulate with each floating point operation. In high-precision calculations, rounding errors can propagate and significantly impact results, requiring careful numerical analysis.

119
Q

What are some strategies to reduce floating point errors?

A

Use higher precision formats (e.g., double instead of float)
Rearrange calculations to minimize subtraction of nearly equal numbers
Use numerically stable algorithms that minimize error accumulation

120
Q

What are floating point exceptions?

A

Floating point exceptions occur when floating point arithmetic encounters an issue, such as division by zero, overflow, or operations on undefined values. Examples include:

1.0 / 0.0
(infinity)

0.0 / 0.0
(undefined, NaN)

Square root of a negative number (NaN)

Assigning a result too large/small to a floating point format (overflow/underflow)

121
Q

What are normalised floating point numbers?

A

A normalised floating point number has an exponent that is neither all zeros nor all ones. It follows the IEEE 754 representation:
𝑥=±(1.𝑏1𝑏2…𝑏52)2×2^(𝑎1𝑎2…𝑎11)−^1023

^DOUBLE CHECK

This ensures efficient use of bits and maintains precision.

122
Q

What happens when the exponent is all zeros or all ones?

A

All zeros → Subnormal number (reduced precision, gradual underflow)
All ones → Exceptional value (Infinity or NaN)

123
Q

What are subnormal floating point numbers?

A

subnormal number is when the exponent is all zeros. It follows:

𝑥=±(0.𝑏1𝑏2…𝑏52)2×2^−1022

Subnormal numbers allow for gradual underflow instead of sudden zeroing.
However, they have leading zeros in the mantissa, meaning precision is lost.

124
Q

What are infinity (±∞) and NaN (Not a Number)?

A

Infinity (±∞) occurs when all exponent bits are ones and all mantissa bits are zeros.
NaN (Not a Number) occurs when all exponent bits are ones and at least one mantissa bit is nonzero.
Common causes:

1.0 / 0.0 → +∞

-1.0 / 0.0 → -∞

0.0 / 0.0 or sqrt(-1) → NaN

125
Q

What are the five floating point exceptions in IEEE 754?

A

Overflow → Result too large, returns +∞ or −∞.
Underflow → Result too small, returns 0 or a subnormal number.
Divide by zero → Returns ±∞.
Invalid operation → Returns NaN (e.g., 0/0, sqrt(-1))
Inexact result → Rounding occurs due to finite precision.

126
Q

What happens when a floating point exception occurs?

A

EEE 754 sets a hardware flag to indicate the exception.
Some programming languages allow trapping these exceptions for debugging.
Most modern systems automatically handle them (e.g., returning NaN or infinity).

127
Q

How do floating point exceptions affect calculations?

A

Overflow can cause unrealistic results like infinite values.
Underflow can lead to unexpected zeros, affecting precision.
NaN results can propagate and break computations.
Inexact results can cause small but cumulative errors in iterative calculations.

128
Q

Why do subnormal numbers exist, and what is their downside?

A

Subnormal numbers extend the range of small values, preventing abrupt underflow. However:

They have reduced precision due to leading zeros.
Some hardware handles them more slowly than normal numbers.

129
Q

What is peak performance in computing?

A

Peak performance (R peak) is the theoretical maximum performance a system can achieve. It is measured in floating point operations per second (FLOP/s) and is calculated based on hardware specifications.

130
Q

How is peak performance used in ranking supercomputers?

A

While RPeak is quoted in the Top500 list Rmax(the maximum observed performance) is used for ranking systems

131
Q

What are the main factors that determine peak performance?

A

The peak performance of a system depends on:

Number of sockets (physical CPU packages).
Number of processor cores per socket (multi-core processors).
Clock frequency (measured in GHz).
Number of floating point operations per cycle (depends on the instruction set and precision).

132
Q

How does a compute node architecture affect performance?

A

A compute node consists of multiple CPU cores and shared memory. Example:

Zen compute node:
2 sockets
6 cores per socket
12 cores in total
24GB RAM (physically split across processors)
The number of cores and memory layout affect the overall performance.

133
Q

How do modern processors perform multiple floating point operations per cycle?

A

Modern CPUs support vector instructions and fused multiply-add (FMA), which allow multiple floating point operations per cycle.

Vector instructions (e.g., x86_64 AVX) operate on multiple values simultaneously.
FMA combines multiplication and addition into a single instruction.
The number of FLOPs per cycle depends on:

The instruction set (e.g., AVX, SVE).
The precision (single or double).

134
Q

What is the formula for Rpeak performance for a compute node?

A

Rpeak = Nsockets x Ncores x Fclock x Noperations

Where:
Nsockets is the number of CPU sockets
Ncores is the number of cores per socket
fclock is the clock speed in GHz
Noperations is the number of floating point operations per cycle

135
Q

How is the peak performance of a cluster calculated?

A

RpeakCluster = RpeakNode x Number of compute nodes

136
Q

What is the difference between peak performance for single vs. double precision?

A

Single precision (32-bit) may achieve a higher Rpeak than double precision (64-bit) because some processors execute more single-precision FLOPs per cycle.
Scientific computing typically uses double precision for higher accuracy.

137
Q

Why is peak performance only a theoretical value?

A

Actual performance is lower due to memory bottlenecks, instruction scheduling, and other inefficiencies.

138
Q

Week 6

139
Q

Why is memory access a significant factor in computing performance?

A

Memory access is significantly slower than CPU speed. It takes approximately 100 clock cycles to access main memory, creating a performance bottleneck. Optimizing memory access through techniques like caching and memory hierarchy can improve overall efficiency.

140
Q

What is the memory hierarchy in modern computers, and why is it important?

A

The memory hierarchy consists of different levels of storage that balance capacity, cost, and access time. It includes registers, cache (L1, L2, L3), RAM, and secondary storage. This structure helps manage data efficiently and optimizes performance by reducing the need for frequent access to slower memory types.

141
Q

What is cache memory, and how does it help improve performance?

A

Cache memory is a small, fast memory that stores frequently accessed data. It reduces latency by keeping data close to the CPU, thereby decreasing the need to fetch data from slower main memory. Efficient use of cache memory significantly enhances processing speed.

142
Q

How is cache memory structured in modern CPUs?

A

Modern CPUs have multiple levels of cache:

L1 Cache: Small (32kB per core), very fast (~4 cycle latency).

L2 Cache: Larger (256kB per core), moderate speed (~11 cycle latency).

L3 Cache: Shared among cores (e.g., 20MB for 16-core processors), slower (~34 cycle latency).
Each level provides a trade-off between speed and storage capacity.

143
Q

What are temporal and spatial locality, and why are they important for caching?

A

Temporal Locality: Data recently accessed is likely to be accessed again soon.

Spatial Locality: Data near recently accessed memory locations is likely to be used soon.
Efficient programs leverage these principles to maximize cache hits and reduce memory latency.

144
Q

What is a cache line, and why is it important for performance?

A

A cache line is a fixed-size block of memory transferred between main memory and cache. On Intel x86_64 processors, cache lines are 64 bytes. Programs that access data sequentially (e.g., arrays with stride-1 access) optimize cache usage, reducing cache misses and improving speed.

145
Q

What is the impact of loop ordering on memory access performance?

A

Loop ordering affects memory access patterns. In C, iterating over arrays with the inner loop using the right-most index (e.g., T[i][j] with j as the inner loop) ensures sequential memory access, improving spatial locality and cache efficiency. Incorrect loop ordering can lead to poor cache performance and slower execution.

146
Q

How does cache blocking improve performance?

A

Cache blocking divides data into blocks that fit into cache, ensuring that all required data is accessed from fast memory rather than slower main memory. This technique enhances temporal locality by reusing data within cache before moving to the next block.

147
Q

What is the significance of cache blocking in numerical computations?

A

Numerical computations, such as finite difference methods, can suffer from poor cache performance if data access patterns are inefficient. Cache blocking ensures that computations reuse data stored in cache before fetching new data from main memory, improving execution speed.

148
Q

How do test case sizes impact performance testing?

A

Small test cases may not reflect real-world performance due to cache effects. Performance tests should use problem sizes that reflect realistic memory footprints while keeping execution time manageable.

149
Q

What role do compilers play in optimizing memory access?

A

Compilers can apply optimizations such as loop interchange and cache blocking to improve memory access patterns. Compiler flags like -O3 in GCC enable aggressive optimizations to enhance execution speed.

150
Q

What is algorithmic intensity, and why is it important?

A

Algorithmic intensity (also called arithmetic intensity or operational intensity) is the ratio of floating-point operations to memory accesses, measured in FLOPs per byte. It helps determine whether an algorithm is limited by computation or memory bandwidth.

151
Q

How does arithmetic intensity vary among different algorithms?

A

Different algorithms have different arithmetic intensities. High arithmetic intensity means an algorithm is compute-bound, making better use of floating-point units, while low arithmetic intensity indicates a memory-bound algorithm limited by bandwidth.

152
Q

What is the Roofline model, and how does it help in performance analysis?

A

The Roofline model is a visual representation of floating-point performance based on peak performance, memory bandwidth, and arithmetic intensity. It helps determine whether an application is compute-bound or memory-bound and guides optimization efforts.

153
Q

What does the Roofline model tell us about memory bandwidth and caching?

A

The Roofline model shows that algorithms with higher arithmetic intensity can achieve higher performance. Data stored in cache has significantly higher bandwidth than DRAM, highlighting the importance of optimizing memory access patterns.

154
Q

How does problem size affect arithmetic intensity?

A

Arithmetic intensity can vary with problem size. Some algorithms exhibit higher intensity at larger problem sizes, making them more compute-efficient, while others remain memory-bound regardless of scale.

155
Q

What is NUMA (Non-Uniform Memory Access)?

A

NUMA is a memory architecture where memory access times vary depending on the memory location relative to the processor. Some memory regions are faster to access than others, affecting performance.

156
Q

Why does NUMA exist in modern CPU architectures?

A

Modern CPUs integrate memory controllers and use multi-socket configurations. Accessing memory controlled by another socket is slower than accessing local memory, creating non-uniform memory access times.

157
Q

How does the first-touch memory allocation policy work?

A

Memory pages are allocated on the first memory controller that accesses them. If an application initializes an array sequentially, all memory may be allocated to a single socket, creating performance imbalances in multi-socket systems.

158
Q

Why is NUMA relevant for OpenMP programs?

A

In OpenMP, memory is shared between threads. If a single thread initializes an array, all memory may be assigned to one socket. Other threads from a different socket experience higher access latency, impacting performance.

159
Q

How can OpenMP performance issues caused by NUMA be mitigated?

A

Temporal Locality: Data recently accessed is likely to be accessed again soon.

Spatial Locality: Data near recently accessed memory locations is likely to be used soon.
Efficient programs leverage these principles to maximize cache hits and reduce memory latency.

160
Q

What is hybrid parallelism, and how does it help in NUMA systems?

A

Hybrid parallelism combines MPI and OpenMP. Each MPI process runs within a single NUMA node, launching OpenMP threads that operate locally. This prevents NUMA-related slowdowns in OpenMP programs.

161
Q

What is the impact of loop ordering on memory access performance?

A

Loop ordering affects memory access patterns. In C, iterating over arrays with the inner loop using the right-most index (e.g., T[i][j] with j as the inner loop) ensures sequential memory access, improving spatial locality and cache efficiency. Incorrect loop ordering can lead to poor cache performance and slower execution.

162
Q

Week 7

163
Q

What factors contribute to performance degradation in parallel computing?

A

Several factors can reduce the idealized peak performance in parallel computing:

Starvation: Not enough parallel work to keep processors busy, leading to inefficiencies.

Latency: The time taken for information to travel across the system (e.g., memory access, message passing).

Overhead: Extra computational work required beyond the main computation (e.g., managing OpenMP parallel regions).

Waiting: Contention for shared resources (e.g., memory or network bandwidth), causing delays.

164
Q

What is parallel speed-up, and how is it calculated?

A

Parallel speed-up measures how much faster a parallel program runs compared to a serial version:

SN = T0/TN where:

SN is the speed-up on processors,

T0 is the execution time of the serial program,

TN is the execution time of the parallel program using processors.
If execution time is halved when using processors, the speed-up is 2.

165
Q

What is parallel efficiency, and how is it calculated?

A

Parallel efficiency measures how effectively computational resources are utilized: EN = SN/N

where:

EN is the efficiency,

SN is the speed-up,

N is the number of processors.
An efficiency of 1 means the computation scales perfectly with the number of processors.

166
Q

What is strong scaling in parallel computing?

A

Strong scaling refers to reducing execution time while keeping the total problem size fixed as the number of processors increases. Example:

A 1024×1024 problem using domain decomposition:

4 processors → 512×512 sub-domains

16 processors → 256×256 sub-domains

64 processors → 128×128 sub-domains
Strong scaling is described by Amdahl’s Law.

167
Q

What is Amdahl’s Law, and how does it limit parallel performance?

A

Amdahl’s Law states that the speed-up of a parallel program is limited by the fraction of the program that cannot be parallelized:

SN = 1/(s+(P/N))
where:

s is the serial fraction,

p is the parallel fraction,

N is the number of processors.
Even with infinitely many processors, the speed-up is limited by the serial portion .

168
Q

How does parallel fraction affect speed-up?

A

The effectiveness of parallelization depends on the parallel fraction . For large :

If , maximum speed-up is 2.

If , maximum speed-up is 10.

If , maximum speed-up is 100.
A high parallel fraction is necessary for effective parallel computing.

169
Q

What limitation does Amdahl’s Law present for parallel processing?

A

Amdahl’s Law suggests that the serial portion of a program limits the maximum possible speed-up, making parallelization less beneficial when a significant portion of the computation cannot be parallelized.

170
Q

How did Gustafson’s Law challenge Amdahl’s Law?

A

Gustafson’s Law argues that in practical scenarios, problem sizes tend to increase with computational power. The speed-up formula is:
SN = s + pN

where:

s is the serial part,

pN is the parallelized workload.

Unlike Amdahl’s Law, Gustafson’s Law suggests that speed-up scales linearly with if the problem size grows accordingly.

171
Q

What is weak scaling in parallel computing?

A

Weak scaling refers to maintaining a constant execution time while increasing the problem size proportionally to the number of processors. Example:

4 processors → 256×256 total size

16 processors → 512×512 total size

64 processors → 1024×1024 total size
Weak scaling is described by Gustafson’s Law.

172
Q

How do strong scaling and weak scaling differ?

A

Strong Scaling: Fixed problem size; increases processors to reduce execution time. Governed by Amdahl’s Law.

Weak Scaling: Increases problem size proportionally to processors; execution time remains constant. Governed by Gustafson’s Law.

173
Q

What factors can contribute to performance degradation in parallel computing?

A

Performance degradation in parallel computing can be caused by:

Starvation: Insufficient parallel work or uneven distribution of work among processors.

Latency: Time taken for data to travel within the system, e.g., memory access or message passing delays.

Overhead: Additional work required apart from computation, e.g., starting/stopping OpenMP regions.

Waiting: Threads/processes competing for shared resources, such as memory or network bandwidth.

174
Q

What is a barrier in parallel computing, and why is it important?

A

A barrier is a synchronization mechanism ensuring all threads reach a specific point before proceeding. It prevents race conditions and ensures correctness by enforcing execution order in parallel programs.

175
Q

How do implicit barriers function in OpenMP?

A

OpenMP includes implicit barriers at the end of parallel regions and work-sharing constructs (e.g., #pragma omp parallel for). This ensures all threads complete their assigned work before moving to the next section of code.

176
Q

How does an explicit barrier differ from an implicit barrier in OpenMP?

A

An explicit barrier (#pragma omp barrier) is manually inserted to synchronize threads at a specific point. Implicit barriers occur automatically at the end of work-sharing constructs unless removed using nowait.

177
Q

What is the purpose of the nowait clause in OpenMP?

A

The nowait clause removes an implied barrier at the end of work-sharing constructs, allowing threads to continue execution without waiting. This can improve performance but requires caution to avoid race conditions.

178
Q

What is load balancing in parallel computing?

A

Load balancing ensures computational work is evenly distributed among processors or threads to prevent some from being idle while others are overloaded. It enhances efficiency and minimizes waiting times.

179
Q

Why is loop scheduling important for load balancing?

A

Loop scheduling distributes loop iterations among threads to prevent idle time. When iterations require different amounts of work, proper scheduling helps balance workload and maximize efficiency.

180
Q

What are the different loop scheduling options in OpenMP?

A

OpenMP provides multiple scheduling strategies:

Static, chunk: Iterations are divided into equal-sized chunks and assigned round-robin to threads.

Dynamic, chunk: Iterations are assigned in chunks; threads request more work when they finish.

Guided, chunk: Similar to dynamic but chunk sizes decrease over time, proportional to remaining work.

181
Q

How does dynamic scheduling improve load balancing?

A

Dynamic scheduling assigns chunks of work to threads as they become available, ensuring no thread remains idle. This is useful when workload per iteration varies.

182
Q

What is the difference between guided and dynamic scheduling?

A

Both methods assign work dynamically, but guided scheduling starts with large chunk sizes and gradually reduces them, reducing overhead while maintaining flexibility.

183
Q

How does MPI handle load balancing?

A

In MPI, load balancing involves distributing work among processes. Blocking communications synchronize processes, while techniques like the manager-worker model help dynamically allocate work.

184
Q

What are blocking communications in MPI, and how do they affect performance?

A

Blocking communications force processes to wait until a message is fully sent or received, which synchronizes processes but may introduce idle time if not managed properly.

185
Q

What is an interconnect in the context of MPI programs?

A

An interconnect is the network that links compute nodes, enabling message passing between MPI processes. The efficiency of an interconnect significantly impacts parallel performance.

186
Q

Why is minimizing communication time important in MPI programs?

A

Reducing communication time enhances parallel efficiency and scalability. Excessive time spent on message passing can bottleneck performance, particularly in large-scale computations.

187
Q

What are the common types of interconnect used in HPC clusters?

A

The two commonly used interconnects are:

Gigabit Ethernet: Affordable but relatively slow.

Infiniband: High-speed, low-latency networking used in high-performance computing (HPC) systems like Isca.

188
Q

What considerations influence the choice of interconnect for an HPC system?

A

Interconnect choice is based on cost and workload characteristics. If communication is a minor factor in performance, a cheaper option (e.g., Ethernet) may suffice. For communication-heavy workloads, high-speed interconnects like Infiniband are preferred.

189
Q

How does network topology impact latency?

A

The number of “hops” (intermediate nodes) between compute nodes affects latency. A fully connected topology minimizes hops but is impractical for large networks. More scalable designs balance connectivity and cost.

190
Q

What are common network topologies used in HPC?

A

Fully Connected: Every node is linked to every other node (ideal but costly for large networks).

Bus/Ring: Simple but lacks sufficient connectivity for HPC.

Fat Tree: Provides greater bandwidth at higher levels to manage more traffic and scale effectively.

191
Q

What is a fat tree topology and why is it used in HPC clusters?

A

A fat tree topology has greater bandwidth at higher levels of the tree, allowing better scaling and handling of network traffic. It balances cost and performance effectively.

192
Q

How does MPI process placement affect communication overhead?

A

MPI processes on the same compute node communicate faster than those on different nodes.

Inter-node communication speed depends on network connectivity.

Optimizing process placement can reduce communication time.

193
Q

How can we model the best-case message transmission time?

A

Transmission time can be approximated as:t = L + M/B

Where:

L = Latency (fixed setup time for communication)

M = Message size

B = Bandwidth (data transfer rate)

194
Q

How does latency affect small messages in MPI communication?

A

For small messages, transmission time is primarily determined by latency. The setup time (L) dominates because the message size (M) is small.

195
Q

How does bandwidth affect large messages in MPI communication?

A

For large messages, bandwidth is the dominant factor. The time taken to transmit data depends on the rate at which data can be transferred.

196
Q

What is the estimated transmission time for a 1 kB message with 1μs latency and 100 Gbit/s bandwidth?

A

Using t = L + M / B:

L = 1 × 10⁻⁶ s
M = 1000 bytes
B = 12.5 × 10⁹ bytes/s

t = 1 × 10⁻⁶ + 1000 / (12.5 × 10⁹) = 1.08 × 10⁻⁶ s
Latency dominates in this case.

197
Q

What is the estimated transmission time for a 1 MB message with 1μs latency and 100 Gbit/s bandwidth?

A

Using t = L + M / B:

L = 1 × 10⁻⁶ s

M = 1 × 10⁶ bytes

B = 12.5 × 10⁹ bytes/s

t = 1 × 10⁻⁶ + (1 × 10⁶) / (12.5 × 10⁹) = 8.1 × 10⁻⁵ s
Bandwidth dominates in this case.

198
Q

When is low latency more important than high bandwidth in MPI communication?

A

Low latency is crucial for workloads with frequent small message exchanges (e.g., point-to-point communication). High bandwidth is more important when transmitting large messages (e.g., bulk data transfers).