OpenMP Flashcards
three primary API components
1 compiler directives
2 runtime library rountines
3 environment variables
ICV thread numbers
OMP_NUM_THREADS
num_threads(n) pragma!
sections
pragma sections, pragma section
- each sections is run by 1 thread
- implicit barrier at end of sections region
for
- no sych at beginning
- implciit barrier at end (nowait to delete it)
- must have no ata dependency
schedule
- used for the FOR loop
1. static: round robin, fixed chunks of n/t
2. dynamic: given one by one at runtime, chunks 1 (good for load balance)
3. guided: start with bigger chunks, then smaller exponentially, one by one at runtime
4. runtime: set at runtime
5. auto: compiler/runtime chooses
sharing attributes
- private
- shared
- default
- firstprivate
- lastprivate
list synchronization pragmas
barrier, masked region, single region. critical section, atomic statement, ordered contruct
barrier
pragma omp barrrier
synchs all threads
can cause load imbalance, use only when needed
masked Construct
pragma omp masked [ filter(integer-expression)
- only primary thread executed code, the others skip
- no implied barrier at either end
single Construct
pragma omp single
- implicit barrier!
- a thread executes it
- clauses like private(list) firstprivate(list)
- like initializeing data structures
ciritical construct
pragma omp critical [(name)
restricts execution of the associated structured block to a single thread at a time
- no implicit barrier?!
- can cause load imbalance, only use when really needed
atomic statement
pragma omp atomic
so a memory location is updated atomically
- like critical, but less overhead, but also only 1 operation, avoid locking
-
ordered construct
pragma omp ordered
block of code that must be executed in sequential order.
- The ordered construct sequentializes and orders the execution of ordered regions while allowing code outside the region to run in parallel.
- clauses: threads (default), simd
- for!!!
omp locks
kinda like mutex
omp_lock_t lockvar; initializes a simple lock
omp_init_lock(&lockvar);
omp_destroy_lock(&lockvar); uninitializes a simple lock
omp_set_lock(&lockvar) waits until a simple lock is available and then sets it
omp_unset_lock(&lockvar) unsets lock
omp_test_lock(&lockvar) tests, if true sets lock
nestable locks possible
reduction
reduction (operator: list)
+ - * & ^ | && || max min
correctness issues
data races: unsyched conflicting access to shared variables
loop dependencies: prevents parallelization in loops
aliasing: hidden loop dependencies,breaks compiler optimizations
types of data races
- read after write
- write after read
- write after write
types of dependecies
- true
- anti
- output
types of loop dependecies
loop carried
loop independent
list loop transformations
4 interchange, distribution, fusion, alignment
loop interchange
swap inner and outer loop
not for loop carried dependencies
used to get better cache use (row first!)
loop distribution
to eliminate loop carried dep,
loop fusion
may create loop carried dep
reduce overhead!!
eliminates need of barrier
loop alignment
put 1 operation before and 1 after
why is work sharing bad?
- load imbalance (both static and dynamic scehduling have drawbacks)
- imbalance because of machine
- limited program flexibility
this means we need tasks!!
tasks
pragma omp task
-inside a single region
- untied: tasks can move to different threads, may lose cache
- if: when to defer task
- sharing: default is firstprivate
- priority influences execution order
taskwait: waits for completion of immediate child tasks
taskyield: current task can be suspended (if takes too long)
dependencies: in, out, inout
PRO: load balance, simple
task granularity:
1. fine: more overhead, better resource use
2. coarse: schedule fragmentation
omp runtime routines
omp_set_num_threads(N)
omp_get_max_threads()
omp_get_num_threads() size
omp_get_thread_num() rank
omp ICVs
OMP_SCHEDUL
OMP_DYNAMIC
OMP_NESTED