Lecture 10 Flashcards

1
Q

What is the maximum speedup on P processors?

A

Let S = the fraction of sequential execution that is inherently sequential.

speedup <= 1 / (S + (1 - S)/P)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the key goals related to achieving high performance?

A

Optimising the performance of parallel programs is an iterative process of refining choices for decomposition, assignment, and orchestration.

Key goals:
- Balance workload onto available execution resources.
- Reduce communication (to avoid stalls).
- Reduce extra work (overhead) performed to increase parallelism, manage assignment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

After the initial design, what are good questions to ask to check for performance improvements?

A

First, always implement the easiest/simplest solution first, then:
Measure.
Decide if you need to do performance improvements:
- Will you have large workloads (amount of computing resources and time it takes to complete a task)?
- Will you have large input data?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to balance workload?

A
  1. Identifying enough concurrency in decomposition, and overcoming Amdahl’s Law.
  2. Deciding how to manage the concurrency (statically and dynamically).
  3. Determining the granularity (the volume of instructions that exist in a unit that is allocated for parallel execution) at which to exploit the concurrency.
  4. Reducing serialisation and synchronisation cost.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How to identify enough concurrency?

A

Look to see if problem has data parallelism and function parallelism.

Data parallelism: same function performed on all the data.

Function parallelism: entirely different calculations are performed concurrently on either the same or different data (e.g. pipelining).

If both data and function parallelism are available in the application, we need to choose which to parallelise.

Function parallelism does not usually grow with the size of the problem being solved because of dependencies that may exist.

Data parallelism grows with the size of the problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How to manage concurrency?

A

Best case: all processors are computing all the time during program execution. They are computing simultaneously, and they finish their portion of the work at the same time.

Static assignment:
One of the most basic techniques, where work is allocated to the threads by the programmer.
Programmer needs to know the cost (execution time) and the amount of work (how many tasks).

Near-static assignment:
Do an assignment once, then re-adjust if needed after profiling the application.
Use prediction based on the execution time of previous tasks.

Dynamic assignment:
Assignment determined at runtime to ensure well balanced load.
Execution time of tasks and total number of tasks is unpredictable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some rules for managing concurrency?

A

Useful to have many more tasks than processors (many small tasks enables good workload balance via dynamic assignment). This motivates small granularity tasks.

But we want as few tasks as possible to minimise overhead of managing the assignment. This motivates large granularity tasks.

Ideal granularity depends on many factors:
- you must know your machine and your workload.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Distributed work queues?

A

Costly synchronisation occurs during stealing but not every time a thread takes on new work. Stealing occurs when necessary to ensure good load balance.

Leads to increased locality (utilising data that are physically or logically close to the processing units that require them).

A distributed queue separates the work producer from the work consumer, allowing you to perform work on a different process.

Common case: threads work on tasks they create (producer-consumer locality- producer generates data or tasks and consumer processes or consumes them).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Distributed schedulers in clusters?

A

Users submit their jobs to a cluster’s scheduler.

Jobs are queued.

Jobs in queue considered for allocation whenever state of a machine changes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

FCFS?

A

First come first serve.

If the machine’s free capacity cannot accommodate the first job, it will not attempt to start any subsequent job.

No starvation but poor utilisation.

Processing power is wasted if the first job cannot run.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Backfilling?

A

Allows small jobs from the back of the queue to execute before larger jobs that arrived earlier.

Requires job runtimes to be known in advance - often specified as runtime upper-bound.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Fragmentation?

A

The division of memory or storage into small, non-contiguous blocks, making it challenging to allocate contiguous blocks of memory for larger objects or processes. It arises when allocated memory blocks are de allocated or freed at arbitrary locations, leaving gaps of unused memory in between.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to reduce unnecessary work?

A

Sometimes the sequential program is better than the parallel program.

If needed, compute your own data values rather than have one process compute them and communicate them to the others. May be a good trade off when the cost of communication is high.

If the redundant computation can be performed while the processor is otherwise idle due to load imbalance, its cost can be hidden.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Communication model 1?

A

T(n) = T_0 + n/B

T(n) = transfer time (overall latency)
T_0 = start-up latency (time until first bit arrives)
n = bytes transferred in operation
B = transfer rate (bandwidth of the link)

Assumption: processor does no other work while waiting for transfer to complete…
Effective bandwidth = n/T(n)
Effective bandwidth depends on transfer size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Communication model 2?

A

Communication time = overhead + occupancy + network delay.

Overhead: time spent on the communication by a processor.
Occupancy: time spent for data to pass through the slowest component of the system.
Network delay: everything else.

More applicable in situations that involve a combination of mechanisms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly