Closed Exercises Flashcards

1
Q

Let’s consider a single-issue processor that can manage up to 4 simultaneous threads.
What are the values of the ideal CPI and the ideal per-thread CPI? (SINGLE ANSWER)
1 point

• Answer 1: Ideal CPI = 1 & Ideal per-thread CPI = 0.25
• Answer 2: Ideal CPI = 0.25 & Ideal per-thread CPI =0.25
• Answer 3: Ideal CPI = 0.5 & Ideal per-thread CPI = 2 Answer 4: Ideal CPI = 0.5 & Ideal per-thread CPI = 1
• Answer 5: Ideal CPI = 1 & Ideal per-thread CPI = 4

A

Answer 5: Ideal CPI = 1 & Ideal per-thread CPI = 4

I need to manage the number of threads devised by the number of issues

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Let’s consider a dual-issue SMT processor that can manage up to 4 simultaneous threads.
What are the values of the ideal CPI and the ideal per-thread CPI? (SINGLE ANSWER)
1 point

• Answer 1: Ideal CPI = 1 & Ideal per-thread CPI = 0.25
• Answer 2: Ideal CPI = 0.25 & Ideal per-thread CPI =0.25
• Answer 3: Ideal CPI = 0.5 & Ideal per-thread CPI = 2
• Answer 4: Ideal CPI = 0.5 & Ideal per-thread CPI = 1
• Answer 5: Ideal CPI = 0.25 & Ideal per-thread CPI = 4

A

Answer 3: Ideal CPI = 0.5 & Ideal per-thread CPI = 2

Threads decided by the number of issues

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Let’s consider a 4-issue SMT processor that can manage up to 4 simultaneous threads.
What are the values of the ideal CPI and the ideal per-thread CPI? (SINGLE ANSWER)
1 point
• Answer 1: Ideal CPI = 1 & Ideal per-thread CPI = 0.25
• Answer 2: Ideal CPI = 0.25 & Ideal per-thread CPI =0.25
• Answer 3: Ideal CPI = 0.25 & Ideal per-thread CPI = 1
• Answer 4: Ideal CPI = 0.25 & Ideal per-thread CPI = 4
• Answer 5: Ideal CPI = 4 & Ideal per-thread CPI = 0.25

A

Answer 3: Ideal CPI = 0.25 & Ideal per-thread CPI = 1

Threads devided by the number of issues

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the Flynn’s taxonomy?

A

• SISD: Single instruction stream, single data stream
- Uniprocessors (including scalar processors like MIPS, but also ILP processors such as superscalars)

• SIMD: Single instruction stream, multiple data streams - Vector architectures
- Multimedia extensions
- Graphics processor units

• MISD: Multiple instruction streams, single data stream - No practical usage => no commercial implementation

• MIMD: Multiple instruction streams, multiple data streams - Tightly-coupled MIMD (with thread-level parallelism)
- Loosely-coupled MIMD (with request-level parallelism)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the idea of SIMD and its advantages

A

Central controller send the same instruction to multiple processing elements (PEs) for multiple data.

• Synchronized PEs with single Program Counter

• Each Processing Element (PE) has its own set of data
– Use different sets of register addresses

• Motivations for SIMD:
– Cost of control unit shared by all execution units
– Only one copy of the code in execution is necessary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the types of SIMD machines?

A

Vector architectures

SIMD extensions

Graphics Processor Units (GPUs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does a Vector architecture works?

A

Basic idea:
– Load sets of data elements into vector registers
– Operate on vector registers
– Write the results back into memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the difference between scalar and vector registers?

A

See picture 23

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are GPUs specialized in

A

GPUs are specialized for parallel intensive computation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does GPU parallelize its computations

A

All operations are performed in parallel by the GPU using a huge number of threads processing all data independently

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does the interaction between CPUs and GPUs occur?

A

The GPU (device) serves as a coprocessor for the CPU (host)

CPU and GPU are separate devices, with separate memory space addresses

The GPU has its own high-bandwidth memory

Serial parts of a program run on the CPU (host)

Computation-intensive and data-parallel parts are offloaded to the GPU (device)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the main bottleneck in the CPU and GPU interaction?

A

Data movement between CPU and GPU is the main bottleneck
- Low bandwidth with respect to internal CPU and GPU since it exploits PCI Express (12-14GB/s)
- Relatively high latency
- Data transfer can take more than the actual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is operation chaining

A

Concept of forwarding extended to vector registers:
– A vector operation can start as soon as each element of its
vector source operand become available
– Even though a pair of operations depend on one another,
chaining allows the operations to proceed in parallel on separate elements of the vector.
– In this way, we don’t need anymore to wait for the last element of a load to start the next dependent instruction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does the execution time of vector architectures depend on

A

• Execution time depends on three factors:
– Length of operand vectors (number of elements)
– Structural hazards (how many vector functional units)
– Data dependencies (need to introduce operation chaining)

• VMIPS functional units consume one element per clock cycle
– So, the execution time of one vector instruction is approximately given by the vector length

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are convoys?

A

• Simplification: to introduce the notion of convoy
– Set of vector instructions that could potentially execute
together partially overlapped (no structural hazards)

• Sequences with read-after-write dependency hazards can
be in the same convoy via chaining

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are chimes?

A

Chime is a timing metric corresponding to the unit of time to execute one convoy
– m convoys execute in m chimes (or time units)
– Simply stated: for a vector length of n, and m convoys in a program, n x m clock cycles are required
– Chime approximation ignores some processor-specific
overheads

17
Q

LV V1,Rx
MULVS.D V2,V1,F0
LV V3,Ry
ADDVV.D V4,V2,V3
SV Ry,V4

Let assume to have 1 LV/SV Unit & 1 ALU V Unit with chaining: how can we devise the above instructions into convoys? How many chimes does it take?

A

3 convoys:
1) LV MULVS.D (chaining)
2) LV ADDVV.D (chaining)
3) SV
Last store can not be put in the last convoy once we have only one LV/ST unit

3 convoys (3 chimes): 2 FP ops per element => 1.5 cycles per FP ops per element;

See picture 24

18
Q

How can a vector processor execute a vector faster than one element per clock cycle?

A

See picture 25

19
Q

How to handle a number of loop iterations not equal to the vector length?

What do you do when the vector length in a program is not exactly 64?
1. Vector length smaller than 64
2. Vector length unknown at compile time and maybe greater than MVL

A

See picture 26

20
Q

What to do when there is an IF statement inside the FOR loop code to be vectorized?

A

See picture 27

21
Q

How to handle bi-dimensional matrices?

A

See picture 28

22
Q

How to handle sparse matrices in vector processors?

A

See picture 29