1. ̥Single instr -only one instruction stream is being acted on by the CPU during any one clock cycle. 2. Single data: only one data stream is being used as input during any one clock cycle. 3. Deterministic Execution

1. ̥most common type of parallel computer. 2. Execution can be synchronous or asynchronous, deterministic or non-deterministic 3. Representatives: Most current supercomputers, networked parallel computer "grids" Sand multi-processor SMP computers - including some types of PCs.

1. ̥Few actual examples of this class of parallel computer have ever existed. 2. A single data stream is fed into multiple processing units. 3. Representatives: Systolic Arrays

1. ̥A common example of a hybrid model is the combination of the message passing model (MPI) (Communications between processes on different nodes occurs over the network )with the threads model (OpenMP)( perform computationally intensive kernels using local, on-node data). 2. Another example is using MPI with CPU-GPU (Graphics Processing Unit) programming. --MPI tasks run on CPUs using local memory and communicating with each other over a network. --Computationally intensive kernels are off-loaded to GPUs on-node. --Data exchange between node-local memory and GPUs uses CUDA (or something equivalent).

1. ̥Single program : All tasks execute their copy of the same program simultaneously. This program can be threads, message passing, data parallel or hybrid. 2. Multiple data : All tasks may use different data 3.SPMD programs usually have the necessary logic programmed into them to allow different tasks to branch or conditionally execute only those parts of the program they are designed to execute. (only a portion of it.) 4.using message passing or hybrid programming,most commonly used parallel programming model for multi-node clusters.

intro to parallel computing Flashcards by Nitz M

explain sequential computing

Traditionally, software has been written for serial computation:
1. A problem is broken into a discrete series of instructions.
2. Instructions are executed one after another.
3. Only one instruction may execute at any moment in time.

How well did you know this?

Not at all

Perfectly

parallel computing

simultaneous use of multiple compute resources to solve a computational problem.
1. ̥run using multiple CPUs
2. problem is broken into discrete parts that can be solved concurrently
3. Each part is further broken down to a series of instructions
4. Instructions from each part execute simultaneously on different CPUs

How well did you know this?

Not at all

Perfectly

what are the different parallel comp memory architecture

̥shared mem
distributed mem
hybrid

How well did you know this?

Not at all

Perfectly

what are the parallel programming langs

*Shared memeory API : OpenMP(Open multi processing)
– C,C++,Fortran
* Distributed memory API:MPI(Message passing interface)
–C,C++,Fortran,Java,Python
* CLIK– customized C lang
* CUDA(computer unified device architecture)-fro Nvidia GPU
* pThreads

How well did you know this?

Not at all

Perfectly

Explain SISD

̥Single instr -only one instruction stream is being acted on by the CPU during any one clock cycle.
Single data: only one data stream is being used as input during any one clock cycle.
Deterministic Execution

How well did you know this?

Not at all

Perfectly

Explain SIMD

type of paralle computer
specialized problems characterized by a high degree of regularity, such as image processing.
Two varieties: Processor Arrays and Vector Pipelines

Array processor
1. ̥Single Computer with Multiple parallel processors
2. Processing Units are designed to work together under the supervision of a single control unit.
3. Results in a single instruction stream and multiple data streams.

How well did you know this?

Not at all

Perfectly

Explain MIMD

̥most common type of parallel computer.
Execution can be synchronous or asynchronous, deterministic or non-deterministic
Representatives: Most current supercomputers, networked parallel computer “grids” Sand multi-processor SMP computers - including some types of PCs.

How well did you know this?

Not at all

Perfectly

Explain MISD

̥Few actual examples of this class of parallel computer have ever existed.
A single data stream is fed into multiple processing units.
Representatives: Systolic Arrays

How well did you know this?

Not at all

Perfectly

what are the different parallel computing models?

Shared Memory (without threads)
Threads Model
Distributed mem/meassge passing model
Data Parallel Model
Hybrid Model
Single Program Multiple Data
Multiple Program Multiple Data (SPMD):

How well did you know this?

Not at all

Perfectly

Explain Distributed mem/meassge passing model

Distributed mem/meassge passing model
1. ̥A set of tasks that use their own local memory during computation.
2. Multiple tasks can reside on the same physical machine and/or across an arbitrary number of machines.
3. Tasks exchange data through communications by sending and receiving messages.
4. Data transfer usually requires cooperative operations to be performed by each process.

How well did you know this?

Not at all

Perfectly

Explain Data Parallel Model

Data Parallel Model
1. ̥AKA Partitioned Global Address Space (PGAS) model.
2. Address space is treated globally
3. Most of the parallel work focuses on performing operations on a data set typically organized into a common structure, such as an array or cube
4. A set of tasks work collectively on the same data structure, however, each task works on a different partition of the same data structure.
5. Tasks perform the same operation on their partition of work
6. On shared memory architectures, all tasks may have access to the data structure through global memory.
7. On distributed memory architectures the data structure is split up and resides as “chunks” in the local memory of each task.

How well did you know this?

Not at all

Perfectly

Explain hybrid model

̥A common example of a hybrid model is the combination of the message passing model (MPI) (Communications between processes on different nodes occurs over the network )with the threads model (OpenMP)( perform computationally intensive kernels using local, on-node data).
Another example is using MPI with CPU-GPU (Graphics Processing Unit) programming.
–MPI tasks run on CPUs using local memory and communicating with each other over a network.
–Computationally intensive kernels are off-loaded to GPUs on-node.
–Data exchange between node-local memory and GPUs uses CUDA (or
something equivalent).

How well did you know this?

Not at all

Perfectly

Explain SPMD

̥Single program : All tasks** execute their copy of the same program simultaneously.** This program can be threads, message passing, data parallel or hybrid.
Multiple data : All tasks may use different data
3.SPMD programs usually have the necessary logic programmed into them to allow different tasks to branch or
conditionally execute only those parts of the program they are designed to execute. (only a portion of it.)
4.using message passing or hybrid programming,most commonly used
parallel programming model for multi-node clusters.

How well did you know this?

Not at all

Perfectly

Explain MPMD

1.MULTIPLE PROGRAM: Tasks may execute different programs simultaneously. The programs can be threads, message passing, data parallel or hybrid. ̥
2.MULTIPLE DATA: All tasks may use different data
3.

How well did you know this?

Not at all

Perfectly

Design issues of Parallel computing

1.Partitioning: Splitting to Smaller Problem
2.Mapping: Distributing to Multiple processor
3.Communication: if Required (Depend on Topology)
4.Consolidating : The Final result

How well did you know this?

Not at all

Perfectly

Designning parallel programs

Study These Flashcards

characteristically been a very manual process.The programmer is typically responsible for both identifying and actually implementing parallelism.
*manual time consuming, complex,error-prone and iterative process.
various tools have been available to assist the programmer with converting serial programs into parallel programs. Common type- a parallelizing compiler or pre-processor.
Works in two ways–
1. ̥Fully Automatic
– The compiler analyzes the source code and identifies opportunities for parallelism.
–analysis includes identifying inhibitors to parallelism and possibly a cost weighting on whether or not the
parallelism would actually improve performance.
–Loops (do, for) are the most frequent target
2. Programmer Directed
–Using “compiler directives” or possibly compiler flags,
–May be able to be used in conjunction with some degree of automatic parallelization also.
Most common compiler generated parallelization is done using on-node shared memory and threads (such as OpenMP).

Explain thread level parallellism

Study These Flashcards

̥Multi-threading
Hyper-threading
Simultaneous Multi-threading
–threads of different functional unit can run simultaneously
Multi-core
–threads can run on separate cores

Explain instr level paralellism

Study These Flashcards

̥pipelining–Several instructions are simultaneously at different stages of their execution
super pipelining
–several instructions are simultaneously at the same and different stages of their execution
super scalar
–several instructions are simultaneously at the same stages of
their execution
–CPU can execute more than one Instructions per clock cycle
vector & array processing
VLIW
–Compiler is used to Identify Independent Instruction and Bundle the instructions
–Drawback - if compiler cant find independent instr…Need to Recompile the code …Need to insert NOP’s
EPIC

What probs cannot be solved by parallel computing?

Study These Flashcards

when there is dependency between the Instructions

Formulae for parallel computing

Study These Flashcards

̥T(N)= (T-F)+ F/N
T = Total time of serial execution
T - F = Total time of non-parallizable part
F = Total time of parallizable part (when executed serially, not in parallel)
Amdahl’s law
– Speedup= 1/(1-F)+(F/N)
–maximum speedup possible in parallelizing an algorithm is limited by
the sequential portion of the code.
–limitations …doesn’t consider size of prob…..ignores commucation cost
Gustafson’s Law
–Scaled Speedup = N-(N-1)*S
S-percenatge of serial section
–Law: The proportion of the computations that are sequential, normally decreases as the problem size increases.

Limitations of single core

Study These Flashcards

̥Wrt to power
–limit on scaling of clock speeds
–ability to handle on-chip heat has reached a physical limit
Wrt to memory
–need for bigger cache size
–Memory access latency still not in line with processor speeds
Wrt to ILP
–Identifying Implicit Parallelism within the threads is limited in many Application
–Hardware Restrictions such as, Instruction Window Size
–Dependency

Exaplain homogeneous multi core architecture

Study These Flashcards

̥has multiple cores on a single chip, and all those cores are identical
CPUs share memory space
Some kind of communication facility between the CPUs is provided this is normally through shared memory, but accessed through the API of the OS

Explain heterogeneous multi core architecture

Study These Flashcards

̥Multiple cores on a single chip, but those cores might be different designs.
2.Each core will have different capabilities.
3.Heterogeneous system improve performance and efficiency by exposing to programmers architectural features, such as low latency software-controlled memories and inter-processor interconnect

Explain multi core architecture

Study These Flashcards

̥ Multi-core is a design in which single physical processor contains the core logic of more than one processor.
each core has its own execution pipeline. And each core has the resources required to run without blocking resources needed by the other software threads.
Adv
–Better performance - each core is able to run at a lower frequency, dividing among them the power normally given to a single core

What if false sharing ?

If two or more processors are writing data to different portions of the same cache line, then a lot of cache and bus traffic might result for effectively invalidating or updating every cached copy of the old line on other processors.

Limitations of multi core processors

1. ̥Expensize than solitary center processor 2. presentation of multicore processor relies on how the client utilizes the PC 3. burn through greater power 4. processor become hot while doing more work

CPU vs GPU

CPU 1. ̥Optimized for **low-latency access to cached data sets** 2. Control logic for out-of-order and speculative execution GPU 1. ̥Optimized for **data-parallel,throughput computation** 2. tolerant of memory latency 3. dedicated to computation

Explain the two main components of GPU Architecture

1. ̥Global memory * analogus to RAM * Accessible by both CPU and GPU * Currently upto 6 GB 2. Streaming multiprocessor * perform actual computation * each SM has its own: control units,registers,execution pipeline,caches

what are the three ways to accelarate applications ?

1. ̥Libraries * NVIDIA cuBLAS * NVIDIA NPP * NVIDIA cuFFT 2. OpenACC Directives * ? 3. Programming lang

intro to parallel computing Flashcards

(29 cards)