Introduction, Motivation, Problems, Models Flashcards
Processor performance
Number of (some kind of) instructions that can be carried out per unit of time
Examples:
• FLOPS: FLoating-Point OPerations per Second (IEEE 64-bit)
• IPS: Instructions per Second
Number of cores x Number of Instructions per Clock x Clock Frequency
FLOPS
FLOPS: FLoating-Point OPerations per Second (IEEE 64-bit)
Historically, floating-point operations were expensive. And account for most
of the work in numerical applications
IPS
IPS: Instructions per Second
Moore’s law
Moore‘s “law” (popular version):
Sequential processor performance doubles every 18 months.
Moore’s “Law” is (only) an empirical observation/extrapolation.
ILP
Instruction-level parallelism (ILP) is a measure of how many of the instructions in a computer program can be executed simultaneously.
ILP must not be confused with concurrency, since the first is about parallel execution of a sequence of instructions belonging to a specific thread of execution
[wikipedia]
SMT/Hyperthreading
Hyper-threading (officially called Hyper-Threading Technology or HT Technology and abbreviated as HTT or HT) is Intel’s proprietary simultaneous multithreading (SMT) implementation used to improve parallelization of computations (doing multiple tasks at once) performed on x86 microprocessors.
[wikipeida]
FPGA, ASIC
A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturing – hence the term “field-programmable”. The FPGA configuration is generally specified using a hardware description language (HDL), similar to that used for an application-specific integrated circuit (ASIC).
An application-specific integrated circuit (ASIC /ˈeɪsɪk/) is an integrated circuit (IC) chip customized for a particular use, rather than intended for general-purpose use.
[wikipedia]
Core
Core: The smallest unit capable of executing instructions according
to a program (used to be called “processor”, “processing unit”)
Processor
Processor: Collection of one or more cores on a single chip (as in
“multi-core processor”)
Multi-core vs Many-core
Multi-core: handful of (standard CPU) cores (2-32)
Many-core: a lot of cores, often in special-purpose processors (GPU)
Bridging models
Model costs translate reasonably into/predicts performance on wide
range of actual hardware
Minimum requirement for a good bridging model:
If Algorithm A performs better than Algorithm B in model, then the implementation of Algorithm A performs better than the implementation of Algorithm B on applicable hardware
Good model • abstracts unnecessary detail, • accounts for the essentials, • leads to interesting results, algorithms and lower bounds, • and “bridging” works
MPI
Message Passing Interface (MPI)
OpenMP
OpenMP (Open Multi-Processing)
C/C++ compiler directives like #pragma omp parallel for
Cilk
Cilk, Cilk++ and Cilk Plus are general-purpose programming languages designed for multithreaded parallel computing. They are based on the C and C++ programming languages, which they extend with constructs to express parallel loops and the fork–join idiom.
Example keywords: spawn, sync, thread
[wikipedia]
MapReduce
[wikipedia]
A MapReduce framework (or system) is usually composed of three operations (or steps):
Map: each worker node applies the map function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of the redundant input data is processed. Shuffle: worker nodes redistribute data based on the output keys (produced by the map function), such that all data belonging to one key is located on the same worker node. Reduce: worker nodes now process each group of output data, per key, in parallel.
The canonical MapReduce example counts the appearance of each word in a set of documents:
function map(documentName, String document): for each word w in document: emit (w, 1)
function reduce(word, partialCounts): sum = 0 for each pc in partialCounts: sum += pc emit (word, sum)
Parallel computing
Parallel computing
The discipline of efficiently utilizing dedicated parallel resources (processors, memories, …) to solve individual, given computational problems.
Typical concerns:
Solving (difficult) problems faster!
Specifically:
Parallel resources with significant inter-communication capabilities,
for problems with non-trivial communication and computational
demands
GPGPU
General-purpose computing on graphics processing units
HPC
High-Performance Computing
Distributed computing
Distributed computing
The discipline of making independent, non-dedicated resources
available and cooperate toward solving specified problem
complexes.
Typical concerns:
Correctness, availability, progress, security, integrity, privacy, robustness, fault tolerance, …
Specifically/typically:
No centralized control
Concurrent computing
Concurrent computing:
The discipline of managing and reasoning about interacting processes that may (or may not) progress simultaneously
Typical concerns:
Correctness (often formal), e.g. deadlock-freedom, starvation-
freedom, mutual exclusion, fairness, …
IPC
Interprocess communication (IPC) refers to the mechanisms an operating system provides to allow the processes to manage shared data.
[wikipedia]
Process calculi
Process calculi such as
Ambient calculus Calculus of communicating systems (CCS) Communicating sequential processes (CSP) Join-calculus π-calculus
Parallel vs. Concurrent computing
Concurrent computing: Focus on coordination of access to/usage of shared resources (to solve given, computational problem)
Parallel computing: Focus on dividing given problem (specification,
algorithm, data) into subproblems that can be solved by dedicated
processors (in coordination)
Aspects of parallelization
•Algorithmic: Dividing computation into independent parts that can be executed in parallel. What kinds of shared resources are necessary? Which kinds of coordination? How can overheads be minimized (redundancy, coordination, synchronization)?
•Scheduling/Mapping: Assigning parts of computation to
processors in good order
- Load balancing: (Re)assigning independent parts of computation to processors such that all resources are utilized to the same extent (evenly and efficiently)
- Communication: When must processors communicate? How?
- Synchronization: When must processors agree/wait?
- Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming language, interface, library)
•Pragmatic/practical: How does the actual, parallel machine look?
What is a reasonable, abstract (“bridging”) model?
•Architectural: Which kinds of parallel machines can be realized?
How do they look?
Levels of parallelism
Implicit (processor)parallelism [Hardware&Compiler, “Free lunch”]
- Gates and functional units
- Instruction Level Parallelism (ILP)
- SIMD/Vector parallelism
- SIMD/Vector parallelism
- Cores/threads/processes
- Communication
- Synchronization
- Multiple applications
- Coupled applications
RAM
Standard sequential model: The RAM (Random-Access Machine)
Processor (ALU, PC, registers) capable of
executing instructions stored in memory on
data in memory
Execution cost model, first assumption: All operations (ALU instructions, memory access) take same (unit) time.
Useful indeed, basis of much of sequential algorithmics
Not realistic: Memory access time is (much) slower than instruction execution
PRAM
Parallel ram
Processors work in lock-step (all same program, or individual programs), all perform an instruction in each step (clock tick)
Unit-time instruction and memory access (uniform)
Assume as many processors per step as convenient
PRAM read write conflict resolution
PRAM conflict resolution: What happens if several processors in the same time step access the same memory location?
- EREW PRAM: Not allowed, neither read nor write
- CREW PRAM: Concurrent reads allowed, concurrent writes not
- CRCW PRAM: Both concurrent read and write
E… Exclusive
C… Concurrent
The read causes no discrepancies while the concurrent write is further defined as:
- COMMON: Conflicting processors must write same value
- ARBITRARY: One write succeeds
- PRIORITY: A priority scheme determines which
UMA/NUMA
UMA (Uniform Memory Access):
Access time to memory location independent of location and accessing processor, e.g., O(1), O(log M)
Examples: RAM, PRAM
NUMA (Non-Uniform Memory Access):
Access time depends on processor and location.
Examples: real processors
Cellular automaton
Yet another parallel architecture model (totally non-RAM):
Cellular automaton, systolic array, … : Simple processors without memory (finite state automata, FSA), operate in lock step on (unbounded) grid, local communication only
Flynn’s taxonomy
•SISD: Single processor, single stream of instructions, operates on
single stream of data. Example: Sequential architecture (e.g.
RAM)
•SIMD: Single processor, single stream of operations, operates on
multiple data per instruction. Example: traditional vector
processor, SIMD-extensions, GPU(?) (PRAM, some variants)
•MISD: Multiple instructions operate on single data stream.
Example: Pipelined architectures, streaming architectures(?),
systolic arrays (70ties architectural idea)
Some say: Empty
•MIMD: Multiple instruction streams, multiple data streams
(PRAM, distributed memory architecture)
SPMD Same Program Multiple Data
branch/thread divergence
if statements:
Processors do something different depending on the condition: MIMD, but under control of same program
SIMD restriction: Both branches are executed, for some processors as noop, depending on condition. On GPU’s this is called branch/thread divergence, and can cause severe performance loss