finals Flashcards by RION SEALTIEL GARCIA

the process contains more than
one thread, and the process is accomplishing a number of
things at the same time

multithreaded processes

How well did you know this?

Not at all

Perfectly

is the unit of execution within a process.

Thread

How well did you know this?

Not at all

Perfectly

each ____ has a separate memory address space, which means that a process runs independently and is isolated from other processes.

Process

How well did you know this?

Not at all

Perfectly

There can be multiple instances of a single program, and each instance of
that running program is a _______.

Process

How well did you know this?

Not at all

Perfectly

Speedup Formula

1 / [(1 - P ) + P/N ]
P (Parallel) = Parallel / Total Sequential Time
Total Sequential Time = Seq + Par + Seq
N = number of processors

How well did you know this?

Not at all

Perfectly

It seems that we can infinitely increase the number of processors and thus make the system run as fast as possible.

This algorithm might look like:
1. Divide pile up into stacks and hand out one stack to each person (Serial)
2. Everyone looks for the “concurrency” card (Parallel)
3. Give the card to a separate pile (Serial)

Amdahl’s law

How well did you know this?

Not at all

Perfectly

What are Challenges in Concurrent/Parallel

-Complexity of designing parallel algorithms:
* Removing task dependencies
* Can add large overheads
-Limited by memory access speeds
-Execution speed is sensitive to data
-Real world problems are most naturally described with mathematical recurrences

How well did you know this?

Not at all

Perfectly

What are the Pros of Parallel

Improved Performance
Better Resource Utilization
Handles Large-Scale Problems
Scalability
Energy Efficiency
Fault Tolerance

How well did you know this?

Not at all

Perfectly

is when tasks actually run in parallel in multiple CPUs

Parallelism

How well did you know this?

Not at all

Perfectly

is when multiple tasks can run in overlapping
periods.

Concurrency

How well did you know this?

Not at all

Perfectly

Cons of Sequential Computing

Limited Performance
Poor Scalability
Inefficient for Complex Problems
Resource Underutilization
Longer Execution Time

How well did you know this?

Not at all

Perfectly

In the sequential computing, no communication or synchronization is required between different steps of the program execution. But there is an indirect ________ of the underutilization of available processing resources

Overhead

How well did you know this?

Not at all

Perfectly

considered scalable if its performance improves after adding more processing
resources. In the case of sequential computing, the only way to scale the system is to increase the performance of system resources used – CPU, memory, etc.

Scalability

How well did you know this?

Not at all

Perfectly

It is a straightforward approach, with a clear set of step-by-step instructions about what to do and when to do it.

Simplicity

How well did you know this?

Not at all

Perfectly

The serial execution of tasks is a sort of chain, where the first task is followed by the second one, the second is followed by the third, and so on. The important point here is that tasks are physically executed without overlapping time periods

Sequential Computing

How well did you know this?

Not at all

Perfectly

machines (loosely coupled multiprocessor
systems) all PEs have a local memory

Distributed memory MIMD

How well did you know this?

Not at all

Perfectly

(tightly coupled multiprocessor
systems), all the PEs are connected to a single global memory and they all
have access to it.

shared memory MIMD

How well did you know this?

Not at all

Perfectly

Multicomputers
Multiprocessors

MIMD

How well did you know this?

Not at all

Perfectly

Vector processors
Fine-grained data parallel

SIMD

How well did you know this?

Not at all

Perfectly

May be pipelined computers

MISD

How well did you know this?

Not at all

Perfectly

Traditional von Neumann
single-CPU computers

SISD

How well did you know this?

Not at all

Perfectly

is a multiprocessor machine which is
capable of executing multiple instructions on multiple data sets.

MIMD

How well did you know this?

Not at all

Perfectly

computing system is a multiprocessor machine capable of executing different instructions on different PEs but all of them operating on the same dataset.

MISD

How well did you know this?

Not at all

Perfectly

is a multiprocessor machine capable of
executing the same instruction on all the CPUs but operating on different data streams.

SIMD

How well did you know this?

Not at all

Perfectly

is a uniprocessor machine which is capable of executing a single instruction, operating on a single data stream. All the instructions and data to be processed have to be stored in primary memory

SISD

is a classification of computer architectures, proposed by Michael J. Flynn in 1966 and extended in 1972.

Flynn's Taxonomy

slowed down, creating a need for parallel computing to handle larger datasets and modern application requirements. Sequential performance improvements are slowing down due to the end of ______.

Moore's Law

Time Frame: 2005 to 20??

Third Crisis Problem: Sequential performance is left behind by Moore’s law

does not require the programmers to know anything about the processors to get good speedups

Moore’s law

’80s and ’90s * Problem: Inability to build and maintain complex and robust applications requiring multi-million lines of code developed by hundreds of programmers

Object Oriented Programming * C++, C# and Java Better tools * Component Libraries Better software engineering methodology * Design patterns, specification, testing, code reviews

High Level Languages for von-Neumann Machines * Fortran and C * Provide “common machine language” for uniprocessors

60’s and 70s Problem: Assembly Language Programming

“To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem”-- E. Dijkstra, 1972 Turing Award Lecture

The Software Crisis

is an elegant way to manage mutexes and ensure that they are properly locked and unlocked, preventing potential deadlocks and race conditions. locks the mtx mutex when the guard object is created.

std::lock_guard

Many atomic operations are lock-free, meaning they don't use mutexes and are faster

Efficiency

std::atomic makes sure that multiple threads can safely read and write shared data

Thread Safety

These are operations that complete without interruption, preventing race conditions.

Atomic Operations

is a C++ feature that ensures safe access to shared data in multi-threaded programs without using locks.

std::atomic

Attempts to lock the mutex without blocking. Returns true if successful, false otherwise

try_lock()

Releases the mutex, allowing other threads to acquire it.

unlock()

Blocks the calling thread until it successfully locks the mutex

lock()

stands for mutual exclusion. It ensures that only one thread can access a critical section at any one time is a synchronization primitive used to protect shared data from being accessed simultaneously by multiple threads.

mutex

is an undesirable situation that occurs when a device or system attempts to perform two or more operations at the same time, but because of the nature of the device or system, the operations must be done in the proper sequence to be done correctly

race condition

Swaps the thread objects t1 and t2.

t.swap(t2) / std::swap(t1, t2)

Suggests the OS reschedule threads (allowing others to run)

std::this_thread::yield()

Blocks the execution of the current thread for at least the specified sleep_duration

std::this_thread::sleep_for(relTime)

Pauses the current thread until the absolute time point absTime.

std::this_thread::sleep_until(absTime)

Returns the number of concurrent threads supported by the system.

std::thread::hardware_concurrency()

Returns the unique identifier for the current thread.

std::this_thread::get_id()

Returns the unique identifier for thread t.

t.get_id()

Returns true if thread t is still joinable (not detached or already joined).

t.joinable()

permits the thread to execute independently from the thread handle

detach()

waits for the thread to finish its execution

join()

manages a separate thread

thread (C++11)

std::thread with support for auto-joining and cancellation

thread (C++20)

suggests that the implementation reschedule execution of threads

yield

returns the thread id of the current thread

get_id()

stops the execution of the current thread for a specified time duration

sleep_for

stops the execution of the current thread until a specified time point

sleep_until

header for threads

Executors std::jthread Atomic smart pointers std::future extensions Latches and barriers Coroutines Transactional memory Task blocks

C++20/23 (2020/2023)

C++17 (2017)

Parallel STL

C++14 (2014)

Reader-writer locks

C++11 (2011)

Memory model Threads Mutexes and locks Thread local data Condition variables Tasks

looks similar to a deadlock in the sense that two threads are blocking each other from making progress, but the difference is that the threads are actively trying to resolve the problem. two or more threads are designed to respond to the actions of each other.

Livelock

generally, higher priority threads will be scheduled to execute more often and that can leave low priority threads feeling “hungry”.

Cause of Starvation

occurs when a thread is unable to gain access to a necessary resource, and is therefore, unable to make progress.

Starvation

If one thread or process acquires a lock, and then terminates because of some unexpected reason, it may not automatically release the lock before it disappears.

Abandoned

They're both stuck waiting on the other thread to release the other lock to make progress Each member of a group is waiting for some other member to take action, and as a result, neither member is able to make progress.

Deadlock

is a synchronization primitive that allows multiple threads to read from a shared resource simultaneously while ensuring exclusive access for writing.

std::shared_mutex

what to use when you have a lot more threads that will be reading from the shared data than the number of threads that will be writing to it, such as certain types of database applications?

shared reader-writer lock

can improve a program's performance versus using a standard mutex. But they are more complicated to implement and they typically use more resources to keep track of the number of readers

Read-write Lock

Since only one thread can have the write lock at a time, all other threads wanting to read or write will have to ...

wait until the lock becomes available again

that limits access to only one thread at a time, allowing that thread to safely write to the shared resource.

exclusive write mode

that allows multiple threads that only need to read simultaneously to lock it.

shared read mode

to protect a critical section of code to defend against data races, which can occur when multiple threads are concurrently accessing the same location in memory and at least one of those threads is writing to that location.

locker mutex

allows you to attempt to acquire a lock without blocking. If the lock is already acquired by another thread, try_lock will immediately return false.

try lock

is a non-blocking version of the lock or acquire method

Try lock or try enter

is a type of mutex that allows the same thread to acquire the lock multiple times without causing a deadlock. This can be useful in scenarios where the same thread needs to re-enter a critical section

std::recursive_mutex

Many programmers like using reentrant locks because

it can make things easier

is a particular type of mutex that can be locked multiple times by the same process or thread.

Reentrant mutex

are used for thread synchronization, allowing threads to wait until a particular condition is met. They are part of the library.

Conditional Variables

There is no possibility to ______ or ______ the counter, which makes the latch a single-use barrier.

increase,reset

A class that defines an exception object thrown by functions in the thread library dealing with asynchronous execution and shared states (e.g., std::future, std::promise) upon failure.

std::future_error

An enumeration that represents the status of a std::future object, as returned by the wait_for and wait_until member functions. Possible values are std::future_status::reaady, std::future_status::timeout, and std::future_status::deferred

std::future_status

A class template that wraps a callable object (like a function or lambda) and allows its result to be retrieved asynchronously via a std::future.

std::packaged_task

A class template similar to std::future, but allows multiple threads to wait for and retrieve the result of a shared asynchronous operation.

std::shared_future

A class template that allows a thread to provide a value or an exception to a std::future object. It is used to set the result of an asynchronous operation.

std::promise

is an enumeration that specifies the launch policy for a task executed by std::async.

std::launch

is a class template that provides a mechanism to access the result of asynchronous operations.

std::future

is a specified function asynchronously and returns a std::future object to retrieve the result.

std::async

is a technique that enables your program to start a potentially long-running task and still be responsive to other events while that task runs.

asynchronous programming

decrements the counter and blocks until it reaches zero.

arrive_and_wait

blocks until the counter reaches zero.

wait

tests if the internal counter equals zero.

try_wait

decrements the counter in a non-blocking manner.

count_down

header is used for latches.

When may threads block on a latch?

Threads may block on the latch until the counter is decremented to zero.

When is the value of the counter initialized in a latch?

The value of the counter is initialized on creation.

are a downward counter type that can be used to synchronize threads.

Latches

Decrements the initial expected count for all subsequent phases by one, and then decrements the expected count for the current phase by one.

arrive_and_drop

arrives at barrier and decrements the expected count by one, then blocks until current phase completes

arrive_and_wait

blocks at the phase synchronization point until its phase completion step is run

wait

arrives at barrier and decrements the expected count

arrive

is a function object that is called on every phase completion step.

completion

an unspecified object type meeting requirements of MoveConstructible, MoveAssignable, and Destructible.

arrival_token

this header is used for barriers in parallel computing.

is a group of threads or processes in the source code that means any thread/process must stop at this point and cannot proceed until all other threads/processes reach this barrier.

Barrier

returns the maximum possible value of the internal counter.

max

tries to decrement the internal counter, blocking until a point in time.

try_acquire_until

tries to decrement the internal counter, blocking for up to a duration time.

try_acquire_for

tries to decrement the internal counter without blocking.

try_acquire

increments the internal counter and unblocks acquirers.

release

destructs the counting_semaphore.

destructor

C++20.

semaphores

have a value that ranges over an unrestricted domain and is used to control access to a resource that has multiple instances.

Counting Semaphore

also known as a mutex lock, can have only two values - 0 and 1. Its value is initialized to 1 and is used to implement the solution of critical section problems with multiple processes.

Binary Semaphore

What are the two types of semaphores?

Binary Semaphore and Counting Semaphore.

used to solve the critical section problem and to achieve process synchronization in a multiprocessing environment.

Semaphore

The fourth and final stage of our parallel design process, where we specify where each of the tasks we established will actually execute. really becomes a factor if you're using a distributed system or specialized hardware with lots of parallel processors for large scale problems, like in scientific computing applications.

Mapping

Splits the program into a small number of large tasks

Coarse-grained Parallelism

a benefit where lots of small tasks can be more evenly distributed among processors to maximize their usage

Load Balancing

A program is broken down into a large number of small tasks

Fine-grained Parallelism

In the first two stages of our parallel design process we partitioned the problem into a set of separate tasks, and then established communication to provide those tasks with the data they needed.

Agglomeration

is the amount of data that can be communicated per unit of time, expressed in some unit of bytes per second

Bandwidth

the time it takes for a message to travel from point A to B expressed in units of time like microseconds

Latency

Often referred to as nonblocking communications because, after a task sends an asynchronous message, it can begin doing other work immediately, regardless of when the receiving task actually gets that message.

Asynchronous Communications

Sometimes called blocking communications because all tasks involved have to wait until the entire communication process is completed to continue doing other work.

Synchronous Communications

process is to establish communication, which involves figuring out how to coordinate execution and share data between the task.

Communication

Rather than focusing on the data being manipulated, functional decomposition begins by considering all of the computational work that a program needs to do and then divides that into separate tasks that perform different portions of the overall work.

Functional Decomposition

Focuses on dividing the data associated with the problem into lots of small and if possible, equally sized partitions. Programmers typically start with this because it forms the foundation for a lot of parallel algorithms

Domain Decomposition

is about breaking down the problem into discrete chunks of work that can be distributed to multiple tasks.

partitioning

OpenACC is meant to be easy to use and easy to learn. Programmer remains in familiar C, C++, or Fortran. No reason to learn low-level details of the hardware.

Low Learning Curve

Rebuild the same code on multiple architectures. Compiler determines how to parallelize for the desired machine. Sequential code is maintained.

Single Source

Maintain existing sequential code. Add annotations to expose parallelism. After verifying correctness, annotate more of the code.

Incremental

directives tell the compiler or runtime to...

- Generate parallel code for GPU - Allocate GPU memory and copy input data - Execute parallel code on GPU - Copy output data to CPU and deallocate GPU memory ## Footnote Here’s an even simpler way to remember it: "Copy → Run → Copy Back → Free" Copy → Send data to GPU Run → Execute parallel code on GPU Copy Back → Get results from GPU Free → Release GPU memory

parallel program designed for performance and portability

OpenAcc

designed to simplify GPU/Multicore programming

OpenACC

is a programming standard for parallel computing on accelerators (mostly on NIVDIA GPU).

OpenAcc

CPU and GPU has _______ memories

Seperated

between CPU and GPU is required for programming.

Data Transfer

Computer program can be _____ and ______ on GPU.

parallelized and accelerated

Nowadays GPUs are used for tasks that were formerly the domain of CPUs, such as scientific computation. This kind of GPU is called general-purpose GPU.

GPGPU

Dedicated for manipulating computer graphics and image processing. Traditionally known as a 'video card'.

GPU

What are the requirements for pervasive systems?

• Embrace contextual changes. • Encourage ad hoc composition. • Recognize sharing as the default.

Once a transaction commits, the changes are permanent.

Durable

Concurrent transactions do not interfere with each other.

Isolated

The transaction does not violate system invariants.

Consistent

To the outside world, the transaction happens indivisibly.

Atomic

What are some false assumptions made by first-time developers?

• The network is reliable. • The network is secure. • The network is homogeneous. • The topology does not change. • Latency is zero. • Bandwidth is infinite. • Transport cost is zero. • There is one administrator.

Scalability Problems

No machine has complete information about the system state. Machines make decisions based only on local information. Failure of one machine does not ruin the algorithm. There is no implicit assumption that a global clock exists.

Doing routing based on complete information.

Centralized Algorithms

A single online telephone book.

Centralized Data

A single server for all users.

Centralized Services

Hides the failure and recovery of a resource.

Failure

Hides that a resource may be shared by several competitive users.

Concurrency

Replication Hides that a resource is replicated.

Replication

Hides that a resource may be moved to another location while in use.

Relocation

Hides that a resource may move to another location.

Migration

Hides where a resource is located.

Location

Hides differences in data representation and how a resource is accessed.

Access

A layer that extends over multiple machines and offers each application the same interface.

Middleware

A collection of independent computers that appears to its users as a single coherent system.

Distributed System

finals Flashcards

(162 cards)