Closed Exercises Flashcards

Question 1

Q

Let’s consider a single-issue processor that can manage up to 4 simultaneous threads.
What are the values of the ideal CPI and the ideal per-thread CPI? (SINGLE ANSWER)
1 point

• Answer 1: Ideal CPI = 1 & Ideal per-thread CPI = 0.25
• Answer 2: Ideal CPI = 0.25 & Ideal per-thread CPI =0.25
• Answer 3: Ideal CPI = 0.5 & Ideal per-thread CPI = 2 Answer 4: Ideal CPI = 0.5 & Ideal per-thread CPI = 1
• Answer 5: Ideal CPI = 1 & Ideal per-thread CPI = 4

Answer

A

Answer 5: Ideal CPI = 1 & Ideal per-thread CPI = 4

I need to manage the number of threads devised by the number of issues

Question 2

Q

Let’s consider a dual-issue SMT processor that can manage up to 4 simultaneous threads.
What are the values of the ideal CPI and the ideal per-thread CPI? (SINGLE ANSWER)
1 point

• Answer 1: Ideal CPI = 1 & Ideal per-thread CPI = 0.25
• Answer 2: Ideal CPI = 0.25 & Ideal per-thread CPI =0.25
• Answer 3: Ideal CPI = 0.5 & Ideal per-thread CPI = 2
• Answer 4: Ideal CPI = 0.5 & Ideal per-thread CPI = 1
• Answer 5: Ideal CPI = 0.25 & Ideal per-thread CPI = 4

Answer

A

Answer 3: Ideal CPI = 0.5 & Ideal per-thread CPI = 2

Threads decided by the number of issues

Question 3

Q

Let’s consider a 4-issue SMT processor that can manage up to 4 simultaneous threads.
What are the values of the ideal CPI and the ideal per-thread CPI? (SINGLE ANSWER)
1 point
• Answer 1: Ideal CPI = 1 & Ideal per-thread CPI = 0.25
• Answer 2: Ideal CPI = 0.25 & Ideal per-thread CPI =0.25
• Answer 3: Ideal CPI = 0.25 & Ideal per-thread CPI = 1
• Answer 4: Ideal CPI = 0.25 & Ideal per-thread CPI = 4
• Answer 5: Ideal CPI = 4 & Ideal per-thread CPI = 0.25

Answer

A

Answer 3: Ideal CPI = 0.25 & Ideal per-thread CPI = 1

Threads devided by the number of issues

Question 4

Q

What is the Flynn’s taxonomy?

Answer

A

• SISD: Single instruction stream, single data stream
- Uniprocessors (including scalar processors like MIPS, but also ILP processors such as superscalars)

• SIMD: Single instruction stream, multiple data streams - Vector architectures
- Multimedia extensions
- Graphics processor units

• MISD: Multiple instruction streams, single data stream - No practical usage => no commercial implementation

• MIMD: Multiple instruction streams, multiple data streams - Tightly-coupled MIMD (with thread-level parallelism)
- Loosely-coupled MIMD (with request-level parallelism)

Question 5

Q

What is the idea of SIMD and its advantages

Answer

A

Central controller send the same instruction to multiple processing elements (PEs) for multiple data.

• Synchronized PEs with single Program Counter

• Each Processing Element (PE) has its own set of data
– Use different sets of register addresses

• Motivations for SIMD:
– Cost of control unit shared by all execution units
– Only one copy of the code in execution is necessary

Question 6

Q

What are the types of SIMD machines?

Answer

A

Vector architectures

SIMD extensions

Graphics Processor Units (GPUs)

Question 7

Q

How does a Vector architecture works?

Answer

A

Basic idea:
– Load sets of data elements into vector registers
– Operate on vector registers
– Write the results back into memory

Question 8

Q

What is the difference between scalar and vector registers?

Answer

A

See picture 23

Question 9

Q

What are GPUs specialized in

Answer

A

GPUs are specialized for parallel intensive computation

Question 10

Q

How does GPU parallelize its computations

Answer

A

All operations are performed in parallel by the GPU using a huge number of threads processing all data independently

Question 11

Q

How does the interaction between CPUs and GPUs occur?

Answer

A

The GPU (device) serves as a coprocessor for the CPU (host)

CPU and GPU are separate devices, with separate memory space addresses

The GPU has its own high-bandwidth memory

Serial parts of a program run on the CPU (host)

Computation-intensive and data-parallel parts are offloaded to the GPU (device)

Question 12

Q

What is the main bottleneck in the CPU and GPU interaction?

Answer

A

Data movement between CPU and GPU is the main bottleneck
- Low bandwidth with respect to internal CPU and GPU since it exploits PCI Express (12-14GB/s)
- Relatively high latency
- Data transfer can take more than the actual

Question 13

Q

What is operation chaining

Answer

A

Concept of forwarding extended to vector registers:
– A vector operation can start as soon as each element of its
vector source operand become available
– Even though a pair of operations depend on one another,
chaining allows the operations to proceed in parallel on separate elements of the vector.
– In this way, we don’t need anymore to wait for the last element of a load to start the next dependent instruction.

Question 14

Q

What does the execution time of vector architectures depend on

Answer

A

• Execution time depends on three factors:
– Length of operand vectors (number of elements)
– Structural hazards (how many vector functional units)
– Data dependencies (need to introduce operation chaining)

• VMIPS functional units consume one element per clock cycle
– So, the execution time of one vector instruction is approximately given by the vector length

Question 15

Q

What are convoys?

Answer

A

• Simplification: to introduce the notion of convoy
– Set of vector instructions that could potentially execute
together partially overlapped (no structural hazards)

• Sequences with read-after-write dependency hazards can
be in the same convoy via chaining

Question 16

Q

What are chimes?

Answer

Study These Flashcards

A

Chime is a timing metric corresponding to the unit of time to execute one convoy
– m convoys execute in m chimes (or time units)
– Simply stated: for a vector length of n, and m convoys in a program, n x m clock cycles are required
– Chime approximation ignores some processor-specific
overheads

Question 17

Q

LV V1,Rx
MULVS.D V2,V1,F0
LV V3,Ry
ADDVV.D V4,V2,V3
SV Ry,V4

Let assume to have 1 LV/SV Unit & 1 ALU V Unit with chaining: how can we devise the above instructions into convoys? How many chimes does it take?

Answer

Study These Flashcards

A

3 convoys:
1) LV MULVS.D (chaining)
2) LV ADDVV.D (chaining)
3) SV
Last store can not be put in the last convoy once we have only one LV/ST unit

3 convoys (3 chimes): 2 FP ops per element => 1.5 cycles per FP ops per element;

See picture 24

Question 18

Q

How can a vector processor execute a vector faster than one element per clock cycle?

Answer

Study These Flashcards

A

See picture 25

Question 19

Q

How to handle a number of loop iterations not equal to the vector length?

What do you do when the vector length in a program is not exactly 64?
1. Vector length smaller than 64
2. Vector length unknown at compile time and maybe greater than MVL