Introduction, Motivation, Problems, Models Flashcards by User Unknown

Processor performance

Number of (some kind of) instructions that can be carried out per unit
of time

Examples:
• FLOPS: FLoating-Point OPerations per Second (IEEE 64-bit)
• IPS: Instructions per Second

Number of cores x Number of Instructions per Clock x Clock Frequency

How well did you know this?

Not at all

Perfectly

FLOPS

FLOPS: FLoating-Point OPerations per Second (IEEE 64-bit)

Historically, floating-point operations were expensive. And account for most
of the work in numerical applications

How well did you know this?

Not at all

Perfectly

IPS

IPS: Instructions per Second

How well did you know this?

Not at all

Perfectly

Moore’s law

Moore‘s “law” (popular version):
Sequential processor performance doubles every 18 months.

Moore’s “Law” is (only) an empirical observation/extrapolation.

How well did you know this?

Not at all

Perfectly

ILP

Instruction-level parallelism (ILP) is a measure of how many of the instructions in a computer program can be executed simultaneously.

ILP must not be confused with concurrency, since the first is about parallel execution of a sequence of instructions belonging to a specific thread of execution

[wikipedia]

How well did you know this?

Not at all

Perfectly

SMT/Hyperthreading

Hyper-threading (officially called Hyper-Threading Technology or HT Technology and abbreviated as HTT or HT) is Intel’s proprietary simultaneous multithreading (SMT) implementation used to improve parallelization of computations (doing multiple tasks at once) performed on x86 microprocessors.

[wikipeida]

How well did you know this?

Not at all

Perfectly

FPGA, ASIC

A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturing – hence the term “field-programmable”. The FPGA configuration is generally specified using a hardware description language (HDL), similar to that used for an application-specific integrated circuit (ASIC).

An application-specific integrated circuit (ASIC /ˈeɪsɪk/) is an integrated circuit (IC) chip customized for a particular use, rather than intended for general-purpose use.

[wikipedia]

How well did you know this?

Not at all

Perfectly

Core

Core: The smallest unit capable of executing instructions according
to a program (used to be called “processor”, “processing unit”)

How well did you know this?

Not at all

Perfectly

Processor

Processor: Collection of one or more cores on a single chip (as in
“multi-core processor”)

How well did you know this?

Not at all

Perfectly

Multi-core vs Many-core

Multi-core: handful of (standard CPU) cores (2-32)

Many-core: a lot of cores, often in special-purpose processors (GPU)

How well did you know this?

Not at all

Perfectly

Bridging models

Model costs translate reasonably into/predicts performance on wide
range of actual hardware

Minimum requirement for a good bridging model:
If Algorithm A performs better than Algorithm B in model, then the implementation of Algorithm A performs better than the implementation of Algorithm B on applicable hardware

Good model
• abstracts unnecessary detail,
• accounts for the essentials,
• leads to interesting results, algorithms and lower bounds,
• and “bridging” works

How well did you know this?

Not at all

Perfectly

MPI

Message Passing Interface (MPI)

How well did you know this?

Not at all

Perfectly

OpenMP

OpenMP (Open Multi-Processing)

C/C++ compiler directives like
    #pragma omp parallel for

How well did you know this?

Not at all

Perfectly

Cilk

Cilk, Cilk++ and Cilk Plus are general-purpose programming languages designed for multithreaded parallel computing. They are based on the C and C++ programming languages, which they extend with constructs to express parallel loops and the fork–join idiom.

Example keywords: spawn, sync, thread

[wikipedia]

How well did you know this?

Not at all

Perfectly

MapReduce

[wikipedia]

A MapReduce framework (or system) is usually composed of three operations (or steps):

Map: each worker node applies the map function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of the redundant input data is processed.
Shuffle: worker nodes redistribute data based on the output keys (produced by the map function), such that all data belonging to one key is located on the same worker node.
Reduce: worker nodes now process each group of output data, per key, in parallel.

The canonical MapReduce example counts the appearance of each word in a set of documents:

function map(documentName, String document):
  for each word w in document:
    emit (w, 1)

function reduce(word, partialCounts):
  sum = 0
  for each pc in partialCounts:
    sum += pc
  emit (word, sum)

How well did you know this?

Not at all

Perfectly

Parallel computing

Study These Flashcards

Parallel computing

The discipline of efficiently utilizing dedicated parallel resources (processors, memories, …) to solve individual, given computational problems.

Typical concerns:
Solving (difficult) problems faster!
Specifically:
Parallel resources with significant inter-communication capabilities,
for problems with non-trivial communication and computational
demands

GPGPU

Study These Flashcards

General-purpose computing on graphics processing units

HPC

Study These Flashcards

High-Performance Computing

Distributed computing

Study These Flashcards

Distributed computing

The discipline of making independent, non-dedicated resources
available and cooperate toward solving specified problem
complexes.

Typical concerns:
Correctness, availability, progress, security, integrity, privacy, robustness, fault tolerance, …

Specifically/typically:
No centralized control

Concurrent computing

Study These Flashcards

Concurrent computing:

The discipline of managing and reasoning about interacting processes that may (or may not) progress simultaneously

Typical concerns:
Correctness (often formal), e.g. deadlock-freedom, starvation-
freedom, mutual exclusion, fairness, …

IPC

Study These Flashcards

Interprocess communication (IPC) refers to the mechanisms an operating system provides to allow the processes to manage shared data.

[wikipedia]

Process calculi

Study These Flashcards

Process calculi such as

Ambient calculus
Calculus of communicating systems (CCS)
Communicating sequential processes (CSP)
Join-calculus
π-calculus

Parallel vs. Concurrent computing

Study These Flashcards

Concurrent computing: Focus on coordination of access to/usage of shared resources (to solve given, computational problem)

Parallel computing: Focus on dividing given problem (specification,
algorithm, data) into subproblems that can be solved by dedicated
processors (in coordination)

Aspects of parallelization

Study These Flashcards

•Algorithmic: Dividing computation into independent parts that can be executed in parallel. What kinds of shared resources are necessary? Which kinds of coordination? How can overheads be minimized (redundancy, coordination, synchronization)?

•Scheduling/Mapping: Assigning parts of computation to
processors in good order

Load balancing: (Re)assigning independent parts of computation to processors such that all resources are utilized to the same extent (evenly and efficiently)
Communication: When must processors communicate? How?
Synchronization: When must processors agree/wait?
Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming language, interface, library)

•Pragmatic/practical: How does the actual, parallel machine look?
What is a reasonable, abstract (“bridging”) model?

•Architectural: Which kinds of parallel machines can be realized?
How do they look?

Levels of parallelism

#Implicit (processor)parallelism [Hardware&Compiler, “Free lunch”] - Gates and functional units - Instruction Level Parallelism (ILP) - SIMD/Vector parallelism #Explicit (“fine-grained”) parallelism: [Parallel computing “parallelism”] - SIMD/Vector parallelism - Cores/threads/processes - Communication - Synchronization #Large-scale (“coarse-grained”) parallelism: [“Trivial” parallelization, distributed computing] - Multiple applications - Coupled applications

RAM

Standard sequential model: The RAM (Random-Access Machine) Processor (ALU, PC, registers) capable of executing instructions stored in memory on data in memory ``` Execution cost model, first assumption: All operations (ALU instructions, memory access) take same (unit) time. ``` Useful indeed, basis of much of sequential algorithmics Not realistic: Memory access time is (much) slower than instruction execution

PRAM

Parallel ram Processors work in lock-step (all same program, or individual programs), all perform an instruction in each step (clock tick) Unit-time instruction and memory access (uniform) Assume as many processors per step as convenient

PRAM read write conflict resolution

PRAM conflict resolution: What happens if several processors in the same time step access the same memory location? - EREW PRAM: Not allowed, neither read nor write - CREW PRAM: Concurrent reads allowed, concurrent writes not - CRCW PRAM: Both concurrent read and write E... Exclusive C... Concurrent The read causes no discrepancies while the concurrent write is further defined as: * COMMON: Conflicting processors must write same value * ARBITRARY: One write succeeds * PRIORITY: A priority scheme determines which

UMA/NUMA

UMA (Uniform Memory Access): Access time to memory location independent of location and accessing processor, e.g., O(1), O(log M) Examples: RAM, PRAM NUMA (Non-Uniform Memory Access): Access time depends on processor and location. Examples: real processors

Cellular automaton

Yet another parallel architecture model (totally non-RAM): Cellular automaton, systolic array, ... : Simple processors without memory (finite state automata, FSA), operate in lock step on (unbounded) grid, local communication only

Flynn's taxonomy

•SISD: Single processor, single stream of instructions, operates on single stream of data. Example: Sequential architecture (e.g. RAM) •SIMD: Single processor, single stream of operations, operates on multiple data per instruction. Example: traditional vector processor, SIMD-extensions, GPU(?) (PRAM, some variants) •MISD: Multiple instructions operate on single data stream. Example: Pipelined architectures, streaming architectures(?), systolic arrays (70ties architectural idea) Some say: Empty •MIMD: Multiple instruction streams, multiple data streams (PRAM, distributed memory architecture) --------------------- SPMD Same Program Multiple Data

branch/thread divergence

if statements: Processors do something different depending on the condition: MIMD, but under control of same program SIMD restriction: Both branches are executed, for some processors as noop, depending on condition. On GPU’s this is called branch/thread divergence, and can cause severe performance loss

Introduction, Motivation, Problems, Models Flashcards

(32 cards)