High performance Computing Flashcards

Question 1

Q

Week 1

Question 2

Q

What new science is a major user of high performance computing?

Answer

A

Life sciences for applications such as genome processing

Question 3

Q

How can we determine the performance of a high performance computer using floating point mathematics

Answer

A

Linpack is a performance benchmark which measures floating point operations per
second (flops) using a dense linear algebra workload
A widely used performance benchmark for HPC systems is a parallel version of Linpack
called HPL (High-Performance Linpack)

Question 4

Q

Why is parallelization important in HPC

Answer

A

High requirement floating point calculations can be run in parallel to make use of multiple cores

Question 5

Q

What is dennard scaling?

Answer

A

Dennard scaling is a recipe for keeping power per unit area (power density) constant as transistors were scaled to smaller sizes

As transistors became smaller they also became faster (delay reduction) and more
energy efficient (reduce threshold voltage)

With very small features limits associated with the physics of the device (e.g. leakage
current) are reached

Dennard scaling has broken down and processor clock speeds are no longer
increasing

Question 6

Q

What is the current most common supercomputer architecture

Answer

A

Current systems are all
based on integrating many
multi-core processors

The dominant architecture is
now the “commodity cluster”
Commodity clusters
integrate off-the-shelf (OTS)
components to make an
HPC system (cluster)

Question 7

Q

Give the proper definition for a Commodity cluster

Answer

A

A commodity cluster is a cluster in which both the network and the compute nodes are commercial products available for procurement and independent application by organisations (end users or separate vendors) other than the original equipment manufacturer.

Question 8

Q

Give four components of a cluster algorithm

Answer

A

Compute nodes: provide the processor cores and memory required to run the workload
* Interconnect: cluster internal network enabling compute nodes to communicate and access storage
* Mass storage: disk arrays and storage nodes which provide user filesystems
* Login nodes: provide access (e.g. ssh) for users and administrators via external network

Question 9

Q

Why do High performance Computers use compiled languages

Answer

A

Maximizes performance.
Compilers parse code and generate executables with optimizations.
Optimizations at compile-time are less costly than at runtime.

Question 10

Q

What are common langauges for High performance computers?

Answer

A

C, C++ and Fortran

Question 11

Q

Why must parellisation be done manually

Answer

A

Parallelization is too complex for compilers to handle automatically.
Programmers add parallel features.

Question 12

Q

Week 2

Question 13

Q

What can we use OpenMP for?

Answer

A

OpenMP provides extensions to C, C++ and Fortran
* These extensions enable the programmer to specify where parallelism should be added and how to add it
* The extensions provided by OpenMP are:
- Compiler directives
- Environment variables
- Runtime library routines

Question 14

Q

What does it mean to say that OpenMP uses a fork join execution model?

Answer

A

Execution starts with a single thread (master thread)
- Worker threads start (fork) on entry to a parallel region
- Worker threads exit (join) at the end of the parallel region

Question 15

Q

What can we use the OMP_NUM_THREADS header for?

Answer

A

We can use the OMP_NUM_THREADS environment variable to control the number of threads forked in a parallel region e.g.

export OMP_NUM_THREADS=4
OMP_NUM_THREADS is one of the environment variables defined in the standard
If you don’t specify the number of threads the default value is implementation defined
(i.e. the standard doesn’t say what it has to be)

Question 16

Q

What does openMP provide that we can call directly from our functions?

Answer

A

OpenMP provides compiler directives, environment variables and a runtime library with functions we can call directly from our programs

Question 17

Q

What header file must be included to use the open.mp library

Answer

A

<omp.h>
</omp.h>

Question 18

Q

Why is conditional compilation useful in programs that use OpenMP?

Answer

A

It ensures that the program can compile and run as a serial version when OpenMP is not enabled, avoiding errors caused by missing OpenMP compiler flags.

Question 19

Q

What is the role of the C pre-processor in the compilation process?

Answer

A

The C pre-processor processes source code before it is passed to the compiler, handling directives such as #include and #ifdef.

Question 20

Q

What does the _OPENMP macro indicate when it is defined?

Answer

A

It indicates that OpenMP is enabled and supported by the compiler.

Question 21

Q

hat is the syntax of the #ifdef directive used for conditional compilation?

Answer

A

ifdef MACRO

// Code included if MACRO is defined #else
// Code included if MACRO is not defined #endif

Question 22

Q

What is the main benefit of using conditional compilation with OpenMP programs?

Answer

A

It allows the same source code to support both serial and parallel execution by enabling or disabling OpenMP-related code.

Question 23

Q

What is one good way to distribute the workload when working in parallel

Answer

A

One way to do this in OpenMP is to parallelise loops

Different threads carry out different iterations of the loop
We can parallelise a for loop from inside a parallel region:
#pragma omp for
We can start a parallel region and parallelise a for loop:
#pragma omp parallel for

Question 24

Q

What changes about the order of loop iterations when they are executed in parallel?

Answer

A

When the loop is parallelised the iterations will not take place in the order specified by the loop iterator
* We can’t rely on the loop iterations taking place in any particular order

Question 25

Q

How do we solve the issue of loops not iterating in order when executed parallelly

Answer

A

The results are stored in arrays which do hold the results in the order we require
* If we want to print out the results of our calculation in order we will need a second sequential (not parallelised) loop

Question 26

Q

What is the correct way to define variable scope for OpenMP loops

Answer

A

In the examples in the previous unit different threads accessed the same copies of arrays x and y (shared)
* The loop index i was different for each thread (private)
The correct functioning of an OpenMP loop requires correct variable scoping
* In this case the default scoping rules did the right thing

Question 27

Q

Why might we want to define the variable scope before running parallelised code?

Answer

A

Explicitly declaring variable scope makes the code much easier to understand and more likely to be correct

We can specify the default scope by adding a clause to the directive which starts the parallel region:
default (shared)
default (none)
The default clause can be followed by a list of private and/or shared variables

Question 28

Q

What is a data race?

Answer

A

Data races are bugs specific to parallel programming and occurs when multiple threads try to update the same variable at the same time

Question 29

Q

How can we solve data race issues in parallel programming?

Answer

A

We need each thread to have it’s own copy of a variable and combine them all at the end of execution

Question 30

Q

How is the reduction clause used for combining multithreaded variables?

Answer

A

When OpenMP encounters the reduction clause:

Each thread gets a private copy of variable, initialized based on the operator (e.g., 0 for + or 1 for *).

Threads compute partial results in parallel.
At the end of the parallel region, OpenMP combines all thread-local results into the global variable using the specified operator.

Question 31

Q

What are data dependencies?

Answer

A

Data dependencies occur when:

One statement reads from or writes to a memory location.

Another statement reads from or writes to the same memory location.

At least one of these operations is a write.

The result depends on the order of execution, leading to potential issues in parallel computing.

Question 32

Q

What are loop-carried dependencies?

Answer

A

Loop-carried dependencies occur when iterations of a loop depend on the results of previous iterations. This can cause issues in parallel execution. There are three main types:

Flow Dependency (Read-after-Write)

Anti-Dependency (Write-after-Read)

Output Dependency (Write-after-Write)

Question 33

Q

What is a flow dependency?

Answer

A

Flow dependency, also known as Read-after-Write (RAW) dependency, occurs when an iteration requires the result from a previous iteration. For example, calculating a[i] might depend on a[i-1] being updated. This dependency is difficult to eliminate and can restrict parallelism.

Question 34

Q

Why is a flow dependency challenging to remove in parallel loops?

Answer

A

Flow dependencies require iterations to be executed in a specific order because the result of one iteration directly affects the next. For example, a[i] = a[i-1] + 1 depends on a[i-1] being calculated first, making it hard to parallelize.

Question 35

Q

What is an anti-dependency?

Answer

A

Anti-dependency, or Write-after-Read (WAR) dependency, happens when an iteration requires a value that another iteration might overwrite. For instance, calculating a[i] needs a[i+1] before it is updated by a later iteration. This can often be resolved by writing results to a different array.

Question 36

Q

How can anti-dependencies be resolved?

Answer

A

Anti-dependencies can be resolved by:

Writing to a different array, ensuring iterations are independent.

Copying the results back to the original array if necessary after the loop completes.

Question 37

Q

What is an output dependency?

Answer

A

Output dependency, or Write-after-Write (WAW) dependency, occurs when multiple iterations write to the same memory location. For example, if every loop iteration writes to x, the final value of x depends on which iteration executes last, leading to unpredictable behavior in parallel.

Question 38

Q

How can output dependencies be addressed?

Answer

A

Output dependencies can be addressed by scoping the variable as lastprivate in OpenMP. This ensures that each thread has a private copy, and the value from the sequentially last iteration is retained after the parallel region ends.

Question 39

Q

What is the role of lastprivate variables in parallel loops?

Answer

A

Lastprivate variables in OpenMP:

Provide each thread with a private copy of the variable.

Retain the value from the sequentially last iteration when the parallel region ends.

This helps resolve output dependencies and ensures correct behavior in parallelized loops.

Question 40

Q

What are the key types of loop-carried dependencies and their solutions?

Answer

A

The three types of loop-carried dependencies and their typical solutions are:

Flow Dependency (Read-after-Write): May require rethinking the algorithm, as it’s challenging to parallelize.

Anti-Dependency (Write-after-Read): Resolved by writing to a different array.

Output Dependency (Write-after-Write): Resolved by using lastprivate variables.

Question 41

Q

Week 3

Question 42

Q

What is a differential equation?

Answer

A

A differential equation (DE) is an equation that contains one or more derivatives. Derivatives represent the rate of change of one variable concerning another. Examples include velocity, which is the rate of change of position with time (dx/dt), and acceleration, which is the rate of change of velocity with time (d²x/dt²).

Question 43

Q

What is a partial differential equation (PDE)?

Answer

A

A partial differential equation (PDE) is a differential equation that contains derivatives concerning more than one variable. For example, if a function u depends on two variables x and t, we write u = u(x, t). The partial derivatives are written using curly symbols (∂), indicating the rate of change with respect to one variable while keeping others constant.

Question 44

Q

Why are partial differential equations important in high-performance computing (HPC)?

Answer

A

Many HPC applications require solving PDEs numerically. Examples include:

Weather and climate modeling

Astrophysical simulations

Engineering problems (e.g., structural analysis, fluid dynamics)
These fields rely on PDEs to describe changes over space and time.

Question 45

Q

What is exponential decay and how is it denoted

Answer

A

When a quantity decreases at a rate which is proportional to the quantity itself it
undergoes exponential decay

The exponential decay of a quantity N(t) is described by the equation:

dN/dT = - λN

Where λ is a positive constant

Question 46

Q

What differential represents Number of nuclei as a function of time

Answer

A

dN/dt = − λNt

Question 47

Q

How to numerically estimate a derivative when you can’t solve the differential equation directly?

Answer

A

We can approximate the derivative using small but finite differences

Question 48

Q

What is the trade-off when choosing a time step (Δt) in numerical solutions(estimating derivatives)

Answer

A

Larger time steps require less computation but result in less accurate solutions, while smaller time steps increase accuracy but require more computation.

Question 49

Q

What is advection?

Answer

A

Advection is the transport of a quantity by a velocity field. Examples include silt in a river, dust in the atmosphere, and salt in the ocean.

Question 50

Q

Week 3 is being sacked off for now - Must do Later

Question 51

Q

Week 4 notes

Question 52

Q

What is an advantage of using MPI over OpenMP in distributed systems?

Answer

A

OpenMP will only work where processors share memory but MPI can work in distributed memory systems such as clusters

Question 53

Q

What are the key steps in setting up and shutting down an MPI environment?

Answer

A

The key steps are:

Initialisation - Call MPI_Init before any MPI functions to set up the MPI environment.

Execution - Execute MPI functions within the environment.

Finalisation - Call MPI_Finalize after all MPI calls to cleanly shut down the MPI environment.

Question 54

Q

What does MPI_Init do?

Answer

A

MPI_Init initializes the MPI execution environment, passing command-line arguments to all MPI processes.

Syntax:

int MPI_Init(int argc, char **argv);

argc: Pointer to the number of arguments.

argv: Pointer to the argument array.

Returns an integer error code.

Question 55

Q

What does MPI_Finalize do?

Answer

A

MPI_Finalize closes down the MPI execution environment. It takes no arguments and returns an integer error code. Syntax:
int MPI_Finalize();

Question 56

Q

What is the purpose of MPI_Comm_size and how are processes ranked?

Answer

A

MPI_Comm_size reports the number of MPI processes in a specified communicator. Syntax:

int MPI_Comm_rank(MPI_Comm comm, int* rank);

comm: Communicator (e.g., MPI_COMM_WORLD).

rank: Stores the rank of the calling process.

Each MPI process has a rank between 0 and size -1

Question 57

Q

What is a communicator in MPI?

Answer

A

A communicator defines a group of processes that can communicate with each other. The predefined MPI_COMM_WORLD includes all processes in an MPI program. Custom communicators can also be created if needed.

Question 58

Q

How do you compile an MPI program?

Answer

A

Use the mpicc compiler wrapper:

mpicc -o hello_mpi hello_mpi.c

This calls the backend compiler and handles include paths, library paths, and flags.

Question 59

Q

How do you run an MPI program?

Answer

A

Use mpirun to start multiple instances of the executable:

mpirun -np 4 hello_mpi

-np 4 specifies running the program with 4 processes.

Question 60

Q

What is the function of mpicc?

Answer

A

mpicc is a wrapper compiler for MPI programs. It calls the backend compiler, managing include paths, library paths, and additional flags to simplify compilation.

Question 61

Q

What is the function of mpirun?

Answer

A

mpirun is used to launch an MPI program with multiple processes. It ensures correct execution across multiple processes and nodes.

Question 62

Q

What is MPI_COMM_WORLD?

Answer

A

MPI_COMM_WORLD is the default communicator that includes all processes in an MPI program, enabling communication between them.

Question 63

Q

What are the main MPI housekeeping functions?

Answer

A

The main MPI housekeeping functions are:

MPI_Init (Initialize environment)

MPI_Finalize (Shut down environment)

MPI_Comm_size (Get number of processes)

MPI_Comm_rank (Get process rank)

Question 64

Q

What is point-to-point communication in MPI?

Answer

A

Point-to-point communication is a type of MPI communication that occurs between a pair of processes in a communicator. It allows direct sending and receiving of messages between specific processes.

Answer 60

A

The two basic functions are:

MPI_Send - Used to send a message to another process.

MPI_Recv - Used to receive a message from another process.

Answer 61

A

MPI_Send is used to send a message and has the following syntax:

int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm);

buf: Pointer to the data to be sent.

datatype: Type of data being sent.

dest: Rank of the destination process.

tag: User-defined label for communication.

comm: Communicator (e.g., MPI_COMM_WORLD).

Answer 62

A

MPI_Recv is used to receive a message and has the following syntax:

int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status);

buf: Pointer to the buffer where received data will be stored.

datatype: Type of data being received.

source: Rank of the sending process.

tag: User-defined label for communication.

comm: Communicator (e.g., MPI_COMM_WORLD).

status: Structure containing details about the received message (source rank and tag).

Answer 63

A

The tag parameter acts as a user-defined label for communication, helping to distinguish between different messages. The receiver can filter messages based on their tag value.

Answer 64

A

MPI_Status provides information about the received message, such as:

The rank of the sending process.

The message tag.

Additional error codes if applicable.

Answer 65

A

MPI point-to-point communication occurs between two processes.

MPI_Send and MPI_Recv are the fundamental functions.

Messages are identified using rank, tag, and communicator.

Message lengths are given in MPI data types, not bytes.

MPI_Status provides metadata about received messages.

Answer 66

A

Collective communication refers to communication operations that involve a group of processes defined by a communicator. It simplifies MPI programming by providing built-in functions for data exchange between multiple processes, making it more efficient than using point-to-point communication.

Answer 67

A

The main types of MPI collective functions include:

MPI_Bcast (Broadcast)

MPI_Scatter (Scatter)

MPI_Gather/MPI_Allgather (Gather and Allgather)

MPI_Reduce/MPI_Allreduce (Reduce and Allreduce)

Answer 68

A

MPI_Bcast broadcasts data from one process (root) to all other processes within a communicator. Every process receives the same data.

Function signature:

int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm);

buffer: data to send

datatype: type of data

root: rank of the sending process

comm: communicator

Answer 69

A

MPI_Bcast is useful when distributing the same data to all processes, such as when initializing parameters for computations that must be synchronized across multiple processes.

Answer 70

A

MPI_Scatter distributes different portions of data from one root process to all other processes in a communicator. Unlike MPI_Bcast, each process receives a unique subset of the data.

Function signature:

int MPI_Scatter(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm);

sendbuf: buffer containing data to send

sendcount: number of items sent per process

sendtype: type of data

recvbuf: buffer to receive data

recvcount: number of items received

recvtype: type of data

root: rank of sending process

comm: communicator

Answer 71

A

MPI_Scatter is useful when dividing a large dataset into smaller parts, such as distributing different work portions to parallel processes.

Answer 72

A

MPI_Gather collects data from multiple processes and sends it to a single root process.

Function signature:

int MPI_Gather(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm);

sendbuf: buffer containing data to send

sendcount: number of elements sent per process

sendtype: type of data

recvbuf: buffer to collect data (only relevant at root)

recvcount: number of elements received per process

recvtype: type of data

root: rank of receiving process

comm: communicator

Answer 73

A

MPI_Gather is useful when collecting results from multiple processes for centralized processing, such as collecting partial computations for final aggregation.

Answer 74

A

MPI_Allgather is similar to MPI_Gather but sends the collected data to all processes instead of just one root process.

Function signature:

int MPI_Allgather(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm);

Same parameters as MPI_Gather but without a root process

Answer 75

A

MPI_Allgather is useful when all processes need a complete dataset after individual contributions, such as in global synchronization scenarios.

Answer 76

A

MPI_Reduce performs a reduction operation (e.g., sum, max, min) on data across multiple processes and sends the result to a designated root process.

Function signature:

int MPI_Reduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype type, MPI_Op op, int root, MPI_Comm comm);

sendbuf: data to reduce

recvbuf: buffer for the result (only relevant at root)

type: type of data

op: reduction operator

root: rank of process receiving the result

comm: communicator

Answer 77

A

Common reduction operators include:

MPI_MAX: Maximum value

MPI_MIN: Minimum value

MPI_SUM: Sum of values

MPI_PROD: Product of values

MPI_LAND: Logical AND

MPI_LOR: Logical OR

MPI_BAND: Bitwise AND

MPI_BOR: Bitwise OR

Answer 78

A

MPI_Allreduce is similar to MPI_Reduce but distributes the final reduced result to all processes instead of a single root.

Function signature:

int MPI_Allreduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype type, MPI_Op op, MPI_Comm comm);

Same parameters as MPI_Reduce but without a root process

Answer 79

A

MPI_Allreduce is useful when all processes require the reduced result, such as computing a global sum or average that must be available to every process.

Answer 80

A

MPI collective communication patterns include:

One-to-all: MPI_Bcast, MPI_Scatter

All-to-one: MPI_Gather, MPI_Reduce

All-to-all: MPI_Allgather, MPI_Allreduce

Answer 81

A

Blocking communication in MPI means that the sender or receiver function does not return until the operation is complete. For example, MPI_Send blocks until the message is received or it is safe to modify the send buffer, and MPI_Recv blocks until the message has been received.

Answer 82

A

Blocking communication can lead to deadlocks if the send and receive operations are not correctly matched. If a send operation blocks waiting for a corresponding receive, but the receive operation is also blocked waiting for a send, the program can become stuck indefinitely.

Answer 83

A

If both processes call MPI_Send before either calls MPI_Recv, they will be waiting indefinitely, resulting in a deadlock. Small messages might be buffered, avoiding deadlocks in some cases, but increasing the message size can reintroduce the issue.

Answer 84

A

Non-blocking communication allows the program to continue executing other instructions while the message is being sent or received. Since MPI_Isend and MPI_Irecv do not require immediate synchronization, they can prevent deadlocks by decoupling the send and receive operations.

Answer 85

A

MPI_Isend is the non-blocking version of MPI_Send. It initiates a send operation but does not wait for completion, allowing computation to continue. Its syntax is:

int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request);

The request handle must later be checked using MPI_Wait or MPI_Waitall to ensure completion.

Answer 86

A

MPI_Irecv is the non-blocking version of MPI_Recv. It initiates a receive operation but does not wait for completion. The syntax is:

int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request);

Like MPI_Isend, it requires MPI_Wait or MPI_Waitall to check completion.

Answer 87

A

Since MPI_Isend and MPI_Irecv do not block, the program must ensure that the communication has completed before accessing the buffers again. MPI_Wait blocks execution until a specified non-blocking operation is finished.

Answer 88

A

MPI_Wait waits for a specific non-blocking request to complete. The syntax is:

int MPI_Wait(MPI_Request *request, MPI_Status *status);

The request specifies the operation to wait for, and status returns the operation’s status.

Answer 89

A

MPI_Waitall waits for multiple non-blocking requests to complete simultaneously. This is useful when handling multiple communications at once. The syntax is:

int MPI_Waitall(int count, MPI_Request array_of_requests[], MPI_Status array_of_statuses[]);

It takes an array of requests and ensures all are completed before proceeding.

Answer 90

A

What are collectives in MPI, and how do they behave?

Answer 91

A

Non-blocking collectives, introduced in MPI version 3, allow processes to initiate collective operations without waiting for all participants. This helps avoid synchronization overhead and improves efficiency.

Answer 92

A

Domain decomposition is a method used in MPI to distribute computational work among multiple MPI processes. It involves dividing the computational domain into smaller subdomains, each assigned to a different process. This technique is commonly used in parallel computing for solving Partial Differential Equations (PDEs), where each process is responsible for a portion of the data and computation.

Answer 93

A

Domain decomposition divides the entire computational domain into smaller regions, with each MPI process handling a separate region. This distribution allows parallel execution by ensuring that each process works on its assigned portion of the domain. Communication is required between processes to share boundary information where subdomains interact.

Answer 94

A

The rank zero process in an MPI program typically handles the following:

Reading input parameters from a file and broadcasting them to all processes.

Reading initial conditions from a file and distributing them using MPI_Scatter.

Collecting results from all processes using MPI_Gather and writing the final output to a file.

Answer 95

A

In a parallel PDE solver, all MPI processes must update their solutions at the same time step to maintain consistency. If the time step is constrained by stability conditions (e.g., the Courant condition), different regions of the domain may have different maximum allowable time steps. To ensure stability, the smallest time step across all regions must be used, which is determined using MPI_Allreduce with the MPI_MIN operator.

Answer 96

A

Each MPI process computes its local minimum allowable time step based on local conditions. The global minimum time step is then determined using the MPI_Allreduce function with the MPI_MIN operator. This ensures that all processes use the smallest time step from across the entire computational domain.

Answer 97

A

Finite difference methods require values from neighboring grid points to compute derivatives. When the computational domain is decomposed among multiple MPI processes, some required values may reside in neighboring subdomains. To ensure correct calculations, processes must exchange boundary values with their neighbors through MPI communication.

Answer 98

A

A “halo” is a region of copied data from neighboring MPI processes, stored within each process’s local memory. Halos ensure that each process has access to the necessary boundary values from adjacent subdomains without constantly requesting them during calculations. This reduces communication overhead and improves efficiency.

Answer 99

A

Halos are exchanged using non-blocking point-to-point MPI communications. Typical steps include:

Sending boundary values to neighboring processes using MPI_Isend.

Receiving boundary values from neighboring processes using MPI_Irecv.

Using MPI_Waitall to ensure that all communications complete before computation continues.
This ensures data consistency across subdomains without introducing unnecessary synchronization delays.

Answer 100

A

The following MPI functions are commonly used for halo exchange:

MPI_Isend: Sends boundary data to neighboring processes in a non-blocking manner.

MPI_Irecv: Receives boundary data from neighboring processes in a non-blocking manner.

MPI_Waitall: Ensures that all non-blocking communications complete before computation resumes.
These functions enable efficient data exchange between processes without unnecessary blocking.

Answer 101

A

Real numbers include fractions and decimals, which cannot be accurately represented using only integer values. Computing often requires handling non-integer values, which necessitates floating-point representation.

Answer 102

A

Floating point representation is needed to handle very large and very small numbers efficiently.

Answer 103

A

A floating point number follows the format
𝑑.𝑑𝑑𝑑×𝛽^𝑒
where:

d.ddd is the significand (mantissa)

e is the exponent

β is the base (commonly 2 for binary systems)

p is the precision (number of significant digits stored)
For example,
9.109×10−^31kg follows this representation.

Answer 104

A

IEEE 754 is the most widely used standard for floating point arithmetic. It specifies:

Number representations (single, double precision)
Arithmetic operations (addition, subtraction, multiplication, division, square root)
Handling of special cases like infinity and NaN (Not a Number)

Answer 105

A

Single precision: 32-bit (float), commonly used in graphics, machine learning
Double precision: 64-bit (double), used in scientific and engineering computations
Double precision provides higher accuracy but requires more storage and computational power

Answer 106

A

Double precision (64-bit) provides higher accuracy, which is crucial for scientific and engineering applications where small errors can accumulate significantly.

Answer 107

A

A double precision number (64 bits) consists of:

1 bit for sign (positive/negative)
11 bits for exponent (determining the range)
52 bits for the mantissa (determining precision)
It follows the formula:

𝑥=±(1.𝑏1𝑏2…𝑏52)×2(𝑎1𝑎2…𝑎11)^−1023

where the exponent has a bias of 1023.

(ngl probably look this up a little more)

Answer 108

A

Smallest normalised number:
2^−1022≈10^−308

Largest normalised number:
(2−2−^52)×2^1023≈10^308

This defines the range of numbers that can be accurately represented.

Answer 109

A

Machine epsilon is the smallest difference between 1 and the next largest representable floating point number. For double precision, it is:
𝜖=2^−52≈10^−16

It determines the precision limit and affects numerical stability.

Answer 110

A

Floating point numbers can only approximate real numbers due to their finite precision. Rounding occurs when a number cannot be exactly represented, leading to small inaccuracies that can accumulate in calculations.

Answer 111

A

Truncation error: From approximating continuous functions (e.g., finite difference methods).
Numerical method error: Errors from iterative or approximate solutions.
Round-off error: Due to the finite representation of floating point numbers.
If the numerical method error is smaller than the floating point round-off error, the solution is said to be computed to “machine precision.”

Answer 112

A

Errors accumulate with each floating point operation. In high-precision calculations, rounding errors can propagate and significantly impact results, requiring careful numerical analysis.

Answer 113

A

Use higher precision formats (e.g., double instead of float)
Rearrange calculations to minimize subtraction of nearly equal numbers
Use numerically stable algorithms that minimize error accumulation

Answer 114

A

Floating point exceptions occur when floating point arithmetic encounters an issue, such as division by zero, overflow, or operations on undefined values. Examples include:

1.0 / 0.0
(infinity)

0.0 / 0.0
(undefined, NaN)

Square root of a negative number (NaN)

Assigning a result too large/small to a floating point format (overflow/underflow)

Answer 115

A

A normalised floating point number has an exponent that is neither all zeros nor all ones. It follows the IEEE 754 representation:
𝑥=±(1.𝑏1𝑏2…𝑏52)2×2^(𝑎1𝑎2…𝑎11)−^1023

^DOUBLE CHECK

This ensures efficient use of bits and maintains precision.

Answer 116

A

All zeros → Subnormal number (reduced precision, gradual underflow)
All ones → Exceptional value (Infinity or NaN)

Answer 117

A

subnormal number is when the exponent is all zeros. It follows:

𝑥=±(0.𝑏1𝑏2…𝑏52)2×2^−1022

Subnormal numbers allow for gradual underflow instead of sudden zeroing.
However, they have leading zeros in the mantissa, meaning precision is lost.

Answer 118

A

Infinity (±∞) occurs when all exponent bits are ones and all mantissa bits are zeros.
NaN (Not a Number) occurs when all exponent bits are ones and at least one mantissa bit is nonzero.
Common causes:

1.0 / 0.0 → +∞

-1.0 / 0.0 → -∞

0.0 / 0.0 or sqrt(-1) → NaN

Answer 119

A

Overflow → Result too large, returns +∞ or −∞.
Underflow → Result too small, returns 0 or a subnormal number.
Divide by zero → Returns ±∞.
Invalid operation → Returns NaN (e.g., 0/0, sqrt(-1))
Inexact result → Rounding occurs due to finite precision.

Answer 120

A

EEE 754 sets a hardware flag to indicate the exception.
Some programming languages allow trapping these exceptions for debugging.
Most modern systems automatically handle them (e.g., returning NaN or infinity).

Answer 121

A

Overflow can cause unrealistic results like infinite values.
Underflow can lead to unexpected zeros, affecting precision.
NaN results can propagate and break computations.
Inexact results can cause small but cumulative errors in iterative calculations.

Answer 122

A

Subnormal numbers extend the range of small values, preventing abrupt underflow. However:

They have reduced precision due to leading zeros.
Some hardware handles them more slowly than normal numbers.

Answer 123

A

Peak performance (R peak) is the theoretical maximum performance a system can achieve. It is measured in floating point operations per second (FLOP/s) and is calculated based on hardware specifications.

Answer 124

A

While RPeak is quoted in the Top500 list Rmax(the maximum observed performance) is used for ranking systems

Answer 125

A

The peak performance of a system depends on:

Number of sockets (physical CPU packages).
Number of processor cores per socket (multi-core processors).
Clock frequency (measured in GHz).
Number of floating point operations per cycle (depends on the instruction set and precision).

Answer 126

A

A compute node consists of multiple CPU cores and shared memory. Example:

Zen compute node:
2 sockets
6 cores per socket
12 cores in total
24GB RAM (physically split across processors)
The number of cores and memory layout affect the overall performance.

Answer 127

A

Modern CPUs support vector instructions and fused multiply-add (FMA), which allow multiple floating point operations per cycle.

Vector instructions (e.g., x86_64 AVX) operate on multiple values simultaneously.
FMA combines multiplication and addition into a single instruction.
The number of FLOPs per cycle depends on:

The instruction set (e.g., AVX, SVE).
The precision (single or double).

Answer 128

A

Rpeak = Nsockets x Ncores x Fclock x Noperations

Where:
Nsockets is the number of CPU sockets
Ncores is the number of cores per socket
fclock is the clock speed in GHz
Noperations is the number of floating point operations per cycle

Answer 129

A

RpeakCluster = RpeakNode x Number of compute nodes

Answer 130

A

Single precision (32-bit) may achieve a higher Rpeak than double precision (64-bit) because some processors execute more single-precision FLOPs per cycle.
Scientific computing typically uses double precision for higher accuracy.

Answer 131

A

Actual performance is lower due to memory bottlenecks, instruction scheduling, and other inefficiencies.

Answer 132

A

Memory access is significantly slower than CPU speed. It takes approximately 100 clock cycles to access main memory, creating a performance bottleneck. Optimizing memory access through techniques like caching and memory hierarchy can improve overall efficiency.

Answer 133

A

The memory hierarchy consists of different levels of storage that balance capacity, cost, and access time. It includes registers, cache (L1, L2, L3), RAM, and secondary storage. This structure helps manage data efficiently and optimizes performance by reducing the need for frequent access to slower memory types.

Answer 134

A

Cache memory is a small, fast memory that stores frequently accessed data. It reduces latency by keeping data close to the CPU, thereby decreasing the need to fetch data from slower main memory. Efficient use of cache memory significantly enhances processing speed.

Answer 135

A

Modern CPUs have multiple levels of cache:

L1 Cache: Small (32kB per core), very fast (~4 cycle latency).

L2 Cache: Larger (256kB per core), moderate speed (~11 cycle latency).

L3 Cache: Shared among cores (e.g., 20MB for 16-core processors), slower (~34 cycle latency).
Each level provides a trade-off between speed and storage capacity.

Answer 136

A

Temporal Locality: Data recently accessed is likely to be accessed again soon.

Spatial Locality: Data near recently accessed memory locations is likely to be used soon.
Efficient programs leverage these principles to maximize cache hits and reduce memory latency.

Answer 137

A

A cache line is a fixed-size block of memory transferred between main memory and cache. On Intel x86_64 processors, cache lines are 64 bytes. Programs that access data sequentially (e.g., arrays with stride-1 access) optimize cache usage, reducing cache misses and improving speed.

Answer 138

A

Loop ordering affects memory access patterns. In C, iterating over arrays with the inner loop using the right-most index (e.g., T[i][j] with j as the inner loop) ensures sequential memory access, improving spatial locality and cache efficiency. Incorrect loop ordering can lead to poor cache performance and slower execution.

Answer 139

A

Cache blocking divides data into blocks that fit into cache, ensuring that all required data is accessed from fast memory rather than slower main memory. This technique enhances temporal locality by reusing data within cache before moving to the next block.

Answer 140

A

Numerical computations, such as finite difference methods, can suffer from poor cache performance if data access patterns are inefficient. Cache blocking ensures that computations reuse data stored in cache before fetching new data from main memory, improving execution speed.

Answer 141

A

Small test cases may not reflect real-world performance due to cache effects. Performance tests should use problem sizes that reflect realistic memory footprints while keeping execution time manageable.

Answer 142

A

Compilers can apply optimizations such as loop interchange and cache blocking to improve memory access patterns. Compiler flags like -O3 in GCC enable aggressive optimizations to enhance execution speed.

Answer 143

A

Algorithmic intensity (also called arithmetic intensity or operational intensity) is the ratio of floating-point operations to memory accesses, measured in FLOPs per byte. It helps determine whether an algorithm is limited by computation or memory bandwidth.

Answer 144

A

Different algorithms have different arithmetic intensities. High arithmetic intensity means an algorithm is compute-bound, making better use of floating-point units, while low arithmetic intensity indicates a memory-bound algorithm limited by bandwidth.

Answer 145

A

The Roofline model is a visual representation of floating-point performance based on peak performance, memory bandwidth, and arithmetic intensity. It helps determine whether an application is compute-bound or memory-bound and guides optimization efforts.

Answer 146

A

The Roofline model shows that algorithms with higher arithmetic intensity can achieve higher performance. Data stored in cache has significantly higher bandwidth than DRAM, highlighting the importance of optimizing memory access patterns.

Answer 147

A

Arithmetic intensity can vary with problem size. Some algorithms exhibit higher intensity at larger problem sizes, making them more compute-efficient, while others remain memory-bound regardless of scale.

Answer 148

A

NUMA is a memory architecture where memory access times vary depending on the memory location relative to the processor. Some memory regions are faster to access than others, affecting performance.

Answer 149

A

Modern CPUs integrate memory controllers and use multi-socket configurations. Accessing memory controlled by another socket is slower than accessing local memory, creating non-uniform memory access times.

Answer 150

A

Memory pages are allocated on the first memory controller that accesses them. If an application initializes an array sequentially, all memory may be allocated to a single socket, creating performance imbalances in multi-socket systems.

Answer 151

A

In OpenMP, memory is shared between threads. If a single thread initializes an array, all memory may be assigned to one socket. Other threads from a different socket experience higher access latency, impacting performance.

Answer 152

A

Temporal Locality: Data recently accessed is likely to be accessed again soon.

Spatial Locality: Data near recently accessed memory locations is likely to be used soon.
Efficient programs leverage these principles to maximize cache hits and reduce memory latency.

Answer 153

A

Hybrid parallelism combines MPI and OpenMP. Each MPI process runs within a single NUMA node, launching OpenMP threads that operate locally. This prevents NUMA-related slowdowns in OpenMP programs.

Answer 154

A

Loop ordering affects memory access patterns. In C, iterating over arrays with the inner loop using the right-most index (e.g., T[i][j] with j as the inner loop) ensures sequential memory access, improving spatial locality and cache efficiency. Incorrect loop ordering can lead to poor cache performance and slower execution.

Answer 155

A

Several factors can reduce the idealized peak performance in parallel computing:

Starvation: Not enough parallel work to keep processors busy, leading to inefficiencies.

Latency: The time taken for information to travel across the system (e.g., memory access, message passing).

Overhead: Extra computational work required beyond the main computation (e.g., managing OpenMP parallel regions).

Waiting: Contention for shared resources (e.g., memory or network bandwidth), causing delays.

Answer 156

A

Parallel speed-up measures how much faster a parallel program runs compared to a serial version:

SN = T0/TN where:

SN is the speed-up on processors,

T0 is the execution time of the serial program,

TN is the execution time of the parallel program using processors.
If execution time is halved when using processors, the speed-up is 2.

Answer 157

A

Parallel efficiency measures how effectively computational resources are utilized: EN = SN/N

where:

EN is the efficiency,

SN is the speed-up,

N is the number of processors.
An efficiency of 1 means the computation scales perfectly with the number of processors.

Answer 158

A

Strong scaling refers to reducing execution time while keeping the total problem size fixed as the number of processors increases. Example:

A 1024×1024 problem using domain decomposition:

4 processors → 512×512 sub-domains

16 processors → 256×256 sub-domains

64 processors → 128×128 sub-domains
Strong scaling is described by Amdahl’s Law.

Answer 159

A

Amdahl’s Law states that the speed-up of a parallel program is limited by the fraction of the program that cannot be parallelized:

SN = 1/(s+(P/N))
where:

s is the serial fraction,

p is the parallel fraction,

N is the number of processors.
Even with infinitely many processors, the speed-up is limited by the serial portion .

Answer 160

A

The effectiveness of parallelization depends on the parallel fraction . For large :

If , maximum speed-up is 2.

If , maximum speed-up is 10.

If , maximum speed-up is 100.
A high parallel fraction is necessary for effective parallel computing.

Answer 161

A

Amdahl’s Law suggests that the serial portion of a program limits the maximum possible speed-up, making parallelization less beneficial when a significant portion of the computation cannot be parallelized.

Answer 162

A

Gustafson’s Law argues that in practical scenarios, problem sizes tend to increase with computational power. The speed-up formula is:
SN = s + pN

where:

s is the serial part,

pN is the parallelized workload.

Unlike Amdahl’s Law, Gustafson’s Law suggests that speed-up scales linearly with if the problem size grows accordingly.

Answer 163

A

Weak scaling refers to maintaining a constant execution time while increasing the problem size proportionally to the number of processors. Example:

4 processors → 256×256 total size

16 processors → 512×512 total size

64 processors → 1024×1024 total size
Weak scaling is described by Gustafson’s Law.

Answer 164

A

Strong Scaling: Fixed problem size; increases processors to reduce execution time. Governed by Amdahl’s Law.

Weak Scaling: Increases problem size proportionally to processors; execution time remains constant. Governed by Gustafson’s Law.

Answer 165

A

Performance degradation in parallel computing can be caused by:

Starvation: Insufficient parallel work or uneven distribution of work among processors.

Latency: Time taken for data to travel within the system, e.g., memory access or message passing delays.

Overhead: Additional work required apart from computation, e.g., starting/stopping OpenMP regions.

Waiting: Threads/processes competing for shared resources, such as memory or network bandwidth.

Answer 166

A

A barrier is a synchronization mechanism ensuring all threads reach a specific point before proceeding. It prevents race conditions and ensures correctness by enforcing execution order in parallel programs.

Answer 167

A

OpenMP includes implicit barriers at the end of parallel regions and work-sharing constructs (e.g., #pragma omp parallel for). This ensures all threads complete their assigned work before moving to the next section of code.

Answer 168

A

An explicit barrier (#pragma omp barrier) is manually inserted to synchronize threads at a specific point. Implicit barriers occur automatically at the end of work-sharing constructs unless removed using nowait.

Answer 169

A

The nowait clause removes an implied barrier at the end of work-sharing constructs, allowing threads to continue execution without waiting. This can improve performance but requires caution to avoid race conditions.

Answer 170

A

Load balancing ensures computational work is evenly distributed among processors or threads to prevent some from being idle while others are overloaded. It enhances efficiency and minimizes waiting times.

Answer 171

A

Loop scheduling distributes loop iterations among threads to prevent idle time. When iterations require different amounts of work, proper scheduling helps balance workload and maximize efficiency.

Answer 172

A

OpenMP provides multiple scheduling strategies:

Static, chunk: Iterations are divided into equal-sized chunks and assigned round-robin to threads.

Dynamic, chunk: Iterations are assigned in chunks; threads request more work when they finish.

Guided, chunk: Similar to dynamic but chunk sizes decrease over time, proportional to remaining work.

Answer 173

A

Dynamic scheduling assigns chunks of work to threads as they become available, ensuring no thread remains idle. This is useful when workload per iteration varies.

Answer 174

A

Both methods assign work dynamically, but guided scheduling starts with large chunk sizes and gradually reduces them, reducing overhead while maintaining flexibility.

Answer 175

A

In MPI, load balancing involves distributing work among processes. Blocking communications synchronize processes, while techniques like the manager-worker model help dynamically allocate work.

Answer 176

A

Blocking communications force processes to wait until a message is fully sent or received, which synchronizes processes but may introduce idle time if not managed properly.

Answer 177

A

An interconnect is the network that links compute nodes, enabling message passing between MPI processes. The efficiency of an interconnect significantly impacts parallel performance.

Answer 178

A

Reducing communication time enhances parallel efficiency and scalability. Excessive time spent on message passing can bottleneck performance, particularly in large-scale computations.

Answer 179

A

The two commonly used interconnects are:

Gigabit Ethernet: Affordable but relatively slow.

Infiniband: High-speed, low-latency networking used in high-performance computing (HPC) systems like Isca.

Answer 180

A

Interconnect choice is based on cost and workload characteristics. If communication is a minor factor in performance, a cheaper option (e.g., Ethernet) may suffice. For communication-heavy workloads, high-speed interconnects like Infiniband are preferred.

Answer 181

A

The number of “hops” (intermediate nodes) between compute nodes affects latency. A fully connected topology minimizes hops but is impractical for large networks. More scalable designs balance connectivity and cost.

Answer 182

A

Fully Connected: Every node is linked to every other node (ideal but costly for large networks).

Bus/Ring: Simple but lacks sufficient connectivity for HPC.

Fat Tree: Provides greater bandwidth at higher levels to manage more traffic and scale effectively.

Answer 183

A

A fat tree topology has greater bandwidth at higher levels of the tree, allowing better scaling and handling of network traffic. It balances cost and performance effectively.

Answer 184

A

MPI processes on the same compute node communicate faster than those on different nodes.

Inter-node communication speed depends on network connectivity.

Optimizing process placement can reduce communication time.

Answer 185

A

Transmission time can be approximated as:t = L + M/B

Where:

L = Latency (fixed setup time for communication)

M = Message size

B = Bandwidth (data transfer rate)

Answer 186

A

For small messages, transmission time is primarily determined by latency. The setup time (L) dominates because the message size (M) is small.

Answer 187

A

For large messages, bandwidth is the dominant factor. The time taken to transmit data depends on the rate at which data can be transferred.

Answer 188

A

Using t = L + M / B:

L = 1 × 10⁻⁶ s
M = 1000 bytes
B = 12.5 × 10⁹ bytes/s

t = 1 × 10⁻⁶ + 1000 / (12.5 × 10⁹) = 1.08 × 10⁻⁶ s
Latency dominates in this case.

Answer 189

A

Using t = L + M / B:

L = 1 × 10⁻⁶ s

M = 1 × 10⁶ bytes

B = 12.5 × 10⁹ bytes/s

t = 1 × 10⁻⁶ + (1 × 10⁶) / (12.5 × 10⁹) = 8.1 × 10⁻⁵ s
Bandwidth dominates in this case.

Answer 190

A

Low latency is crucial for workloads with frequent small message exchanges (e.g., point-to-point communication). High bandwidth is more important when transmitting large messages (e.g., bulk data transfers).