Chapter 5 - OpenMP Flashcards
What is OpenMP?
API for shared memory MIMD programming
Open Multi-Processing
View a system using OpenMP as a collective of autonomous cores, all having access to the same memory.
What are some differences between omp and pthreads?
Both are standard APIs for shared-memory programming.
Pthread needs the programmer to explicitly define the behaviour of threads. In omp the programmer can just let the program know that a specefic block of code will be executed by threads. The compiler and run-time system defines the specifics of the thread use.
Pthread is a library, but OMP needs compiler support in addition
Pthread pros: lower level, gives opportunity to program virtually any possible thread behaviour
Pthread Cons: Need to specify every detail of thread behaviour - more difficult to implement
Pros OMP: Simpler to implement - runtime and compiler takes care of the details.
Cons OMP: Some lower level thread behaviour may be more difficult to implement
What is a directives-based shared-memory API?
There are special preprocessor instructions know as pragmas.
What are pragmas?
Added to a system to allow behaviour that aren’t part of the basic C specifications.
Not all compilers support pragmas, and will ignore them.
compilator discovers these during its initial scan. If it understands the text, their functionality is implemented, if not, they are ignored.
How are omp pragmas defined ?
pragma omp
What is the header file of omp?
include <omp.h></omp.h>
List omp directives and what they do
pragma omp parallel
Specifies that the structured block of code that follows, should be executed by multiple threads. Number of threads started is determined by run-time system (typically one thread per core, but algorithm for deciding this is quite complicated).
parallel directive + num_threads clause to specify number of threads. System can’t guarantee that n_threads will be started, because of system limits, but most of the time it will.
Directive that tells the compiler that we need a mechanism to ensure the following block of code is accessed mutually exclusive by threads - only one thread at a time
Parallel for directive forks a team to execute the following block. This structured block must be a for-loop.
Only master thread reaches within this block
What is a clause in omp?
Just some text that modifies a directive
What is a team in omp?
The collection of threads executing the directive(parallel) block
original thread + (n-1) new threads
What happens when #pragma omp parallel is used?
From the start, the program is running a single thread.
When the directive is reached, the original thread continues execution. Then n-1 additional threads are started.
Each thread executes the following block of code in parallel.
When this block is completed, there is an implicit barrier. A thread that has completed the block will wait for all other threads in the team to complete. The children will then terminate and the parent continues executing the following code.
Define these terms in omp:
master, parent, child
master: first thread of execution, thread 0
parent: thread that encountered parallel directive and started a team of threads. This is often the master thread.
child: Each thread started by parent
What data does a child thread have access to
ID: rank, omp_get_thread_num();
number of threads in team: omp_get_num_threads();
Stack, and therefor local variables
How are critical sections handled in omp to avoid condition variables?
pragma omp critical
Following code block is accessed by one block at the time.
What is varible-scope in omp?
Scope of a variable refers to the set of threads that can access the variable in a parallel block.
Shared scope: Accessible by all threads
Private scope: Accessible by a single thread
What is the default scope of variables declared outside a parallel block, and within?
Outside: Global
Within: Private
What is a reduction variable in openmp?
A reduction operator is a binary operation (e.g. add, mul)
A reduction is a computation that repeatedly applies the same reduction operator to sequence of data to get a single result.
A reduction variable is where all the intermediate results of the operation are stored.
How can reduction be used in omp?
pragma omp parallel reduction(+: global_result)
Add reduction clause to parallel directive
This specifies that the global_result is the reduction variable.
What happens is that omp creates private variables for each thread, and run-time stores each thread’s result in this variable. Omp then creates a critical sections and adds all the private variables together.
The private values are initialized to the identity value of the operator:
+ : 0
- : 0
* : 1
&& : 1
…and so on
When is it beneficial to use reduction?
When a function call happens within a critical section, it will be serialized. But using reduction avoids this serialization.
How are for-loops parallelized using the parallel for-directive
Iterations are divided between the threads. The default partitioning of the iterations is done by the system, where in normal parallel directive, the work is partitioned by the threads themselves.
In for loops, a normal partitioning is giving the first m/n_threads iterations to thread 0, and the the next m/n_thread to 1, and so on.
Compiler does not check for dependences between the iterations. This can cause error during execution, programmer needs to take care of this.
What are the default scope of loop variables in a parallel for directive?
private
What types of for loops can be parallelized?
Only loops with canonical form
Not while, do-while
Only for-loops where the number of iterations can be determined from the loop statement itself (i: i<n: i++) and prior to execution of loop
Loops that are infinite, or that has conditional breaks in them cannot
What is a canonical form of a loop?
Loops where number of iterations can be determined prior to the execution of the loop.
What is a loop-carried dependence?
A dependence between loop iterations where a value is calculated in one iteration, and the result is used in a subsequent iteration.
What is the private clause?
pragma omp parallel for private(var_name)
Creates a copy of the global variable var_name and gives it a private scope
What is the default clause?
pragma omp for default(none) reduction(+: global_sum) private(x, factor) shared(n)
Default(none) requires us to explicitly define the scope of variables used in the block, that was defined outside it.
What does the for directive do, and why is it useful?
pragma omp for
Does not fork any new threads, but uses already-forked threads in the coming for-loop block.
If threads are created earlier in the program, and then used in the for loop, it saves overhead by not creating new threads for the for loop.
… code …
for(…){}
What does the schedule clause do?
pragma omp parallel for schedule(<type> [, <chunksize>])</chunksize></type>
Clause specifies how iterations are to be assigned to threads in a for or parallel for directive.
chunksize: A number of iterations executed consecutively in a serial loop. Only used in static, dynamic and guided
Types:
static: Iterations assigned before loop is executed
dynamic/guided: Iterations assigned while loop is running. After a thread completes, it can request more iterations
auto: compiler/run-time decides schedule, good guess when elements have more equal workload
runtime: Schedule determined at run-time by environment variable
What is the static schedule type?
Chunks of chunksize iterations to each thread in a round-robin fashion
Example:
12 iterations, 3 threads
(static, 2):
thread 1: [0, 1], [6, 7]
thread 2: [2, 3], [8, 9]
thread 3: [4, 5], [10, 11]
What is the default schedule type often?
schedule(static, iterations/num_threads)
When should you use the different scheduling types?
Static: When each iterations takes roughly the same time to execute
Dynamic: Iterations takes different times to compute
Guided: Improves balance between loads when later iterations are more compute heavy
What is the dynamic schedule?
Iterations broken into chunksized chunks.
Each thread execute a chunk, and then requests a new one.
First-come, first-served assignment of chunks
A bit more overhead of dynamically assigning chunks. Larger chunk size helps with this
What is the guided schedule?
Similar to dynamic.
But as chunks are completed, the chunk size decreases.
If no chunksize is specified, the chunksize will eventually decrease down to 1. If specified, decreases to chunksize.
Last chunk may be less.
What is the runtime schedule?
Uses environment variable OMP_SCHEDULE.
This var can take any of the static, dyn., or guided values.
You can modify this variable to try out the performance of different scheduling types.
How does omp use barriers?
pragma omp barrier
What is the atomic directive?
pragma omp atomic
omp assumes the computers architecture has some range of atomic operations.
Only used at critical sections consisting of a single C assignment statement of the following form:
var <op>= <expression>;
var++;
\++var;
var--;
--var;</expression></op>
Expression must not reference <op></op>
A thread will complete the expression before any other threads starts executing it.
A critical section only performing a load-modify-store can benefit from this, as a lot of hardware is optimized for atomic load-stores.
What does the critical(name) directive do?
Two blocks protected by critical directives with different names, can execute in parallel.
How are locks used in omp?
Pseudo code:
initialize lock (one thread)
// Multiple threads
attempt to lock - block until ready
critical section
unlock
Destroy lock (one thread)
2 types: simple- and nested locks
What are simple locks?
Can only be set once before it is unset
What are nested locks?
Can be set multiple times by same thread before unset
What is the omp task directive?
pragma omp task
default scope for variables in tasks is private.
Specifies units of computations.
When the program reaches this, a new task is generated by the omp run-time, that will be scheduled for execution, not necessarily immediately.
Tasks must be launched within a parallel block, but this is generally done by only one thread in the team.
What is a common structure of task-programs?
pragma omp parallel ; creates a team of threads
#pragms omp single ; instruct run-time to only launch 1 thread
{
#pragma omp task
}
If single is not used, all of the threads would launch multiple times
What is the taskwait directive?
pragma omp task shared(j)
Operates as a barrier for tasks.
Makes a task wait for all its sub-tasks to complete
j = 1 + 2
result = j + 1
Howcan we create conditional tasks?
pragma omp task shared(i) if (n > 20)
i = func(n)
if i weren’t set to shared, it would have private scope
Why does omp not use signals (condition variables)
OpenMP threads aren’t supposed to be sleeping, and critical sections should be held short as they use busy waiting
What are worksharing directives?
Directives that can split a given workload between threads for you.
Have an implicit barrier at the end
What is functional decomposition?
When work is split by the function of its sub-tasks
e.g. pipelining
What is data decomposition?
Split work by input/output of its sub-tasks
Every threads does the same thing
What are sections?
pragma omp section
Each section only run by one thread
What does the clause nowait do?
Skip the barrier at the end of worksharing directives
What does the clause firstprivate(var) do?
If var was shared, and you are privitizing it, its initial state is given to all the private copies
What does lastprivate(var) do?
When threaded region finishes, one of the private copies are stored back into the shared copy
What are the trade offs between small and big units of work when scheduling iterations in omp for?
Big:
- less scheduling to do
- limit on how big blocks can be
- if an entire iteration space is one block - there is no parallelisation
small:
- more disruptions in memory access pattern, more units to distribute
- greater flexibility to assign work to unemployed threads
What is the blocksize attribute in schedule()
The minimal number of iterations handed to a thread
What is the master/worker pattern?
Way to implement worksharing in queued systems
Keep active thread pool of available worker threads
Keep a queue of finite work packages
assign next package in the queue every time a threads becomes available
What is nested parallelism?
One work-package spawns more wor-packages that should be distributed amongst threads
What is task-based programming?
Alternative way for parallel programming. Generates dependency graphs
take block of work and dispatch it for background execution
record which blocks depend on the others
Assign blocks to the team of threads in anorder that matches their dependencies
What does the omp task directive do?
pragma omp task
Creates arbitrary dependency graphs
Block context queued internally, to be executed at first opportunity
Wait for task:
#pragma omp taskwait
How can you task-ify functions?
pragma omp task
void some_func(int arg1, int arg2)
{
// function body
}
Every call to this will create background tasks
What does #pragma omp taskloop do?
Automates making a task out of every iteration in a loop