CPU Parallelism Flashcards

1
Q

What is the Roofline model?

A

• A visual aid to understand software performance
• Relates processor performance to off-chip memory traffic (bandwidth often the bottleneck)
• Horizontal bound (peak performance) is the maximum capability of the processor. Is either 𝑅𝑝𝑒𝑎𝑘 or 𝑅𝑚𝑎𝑥
• The angled bound represents the maximum memory bandwidth available and is either the theoretical (from processor spec) or measured bandwidth (see STREAM benchmark). To plot this, multiply 𝐼 ∗ 𝑏𝑦𝑡𝑒/𝑠 (bandwidth)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

If a program has no floating point operations, what arithmetic intensity does it have?

A

A low arithmetic intensity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the Von Neumann architecture?

A

Single, centralised control (CPU) that consists of a control unit and an arithmetic/logic unit.

• Separate storage area where both instructions and data are stored.
• Instructions are executed by the CPU, so they must be brought from the memory into the CPU.
• The data must also be brought from memory into the CPU in order to be acted upon.
• The CPU contains registers which acts as a scratchpad for temporary storage.
• Data and instructions share a common bus, so an instruction fetch, and a data operation can not happen at the same time – the von Neumann bottleneck.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the steps of the Von Neumann cycle? (Fetch execute)

A
  • Fetch the instruction corresponding to the
    Program Counter from memory
  • Decode the instruction
  • Fetch data from memory
    - Execute the instruction
  • Write back results

(Inherently serial)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How many clock cycles does the fetch execute cycle take?

A

If we assume each step takes one clock cycle (more in reality), an operation that requires 1 clock cycle (at the EX stage), now takes 5 clock cycles before we can move on to the next instruction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is out of order execution?

A

If instruction 3 has its data ready before instruction 2, the processor will execute instruction 3 first, but later reorder the results to appear as if instruction 2 was executed first.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is speculative execution?

A

If the insruction is yet to be executed, the processor might guess which of the two sides of the branch might be taken and start executing the next instruction on that branch speculatively.
It might have to backtrack later if its guess turns out to be wrong

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is fused-multiply-add?

A

Implement 𝑎 = 𝑎 + (𝑏 ∗ 𝑐) as a single operation
(common in operations like convolutions, matrix multiplications etc)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What can we do to improve the Von Neumann bottleneck?

A
  • Parallelism!!

• Add some fast memory on the chip – Cache
• The bigger the cache, the slower it is.
Solution?
Have multiple hierarchical caches – L1, L2, L3
and so on. L1 is faster but smaller than L2, and so on.
• Caches exploit spatial locality and temporal locality in a program

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is spatial locality?

A

If a memory location is accessed, it is very likely that its neighbours will be accessed soon

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is temporal locality?

A

If a variable (memory location) is accessed, it is very likely to be needed again soon.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What kind of parallelism does pipelining give?

A

Instruction-Level Parallelism (ILP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Single Instruction Multiple Data? (SIMD)

A

• If one addition operation takes on clock cycle, the code on the right runs in 1000 clock cycles.
• Instead of doing one iteration of this loop in one clock cycle, we could do four (for example)
• This code would run in 250 clock cycles
• This is called vector processing –Also known as Single Instruction, Multiple Data (SIMD)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Simultaneous Multithreading?

A

• Intel brands it as Hyperthreading
• Allows a core to execute multiple threads at the same time.
• Software sees more virtual cores than there are physical cores
• Works well if there are different workloads – less so if two threads are doing similar operations. They may be competing for resources on the core and may run slower than just 1 thread on the core
• In HPC, workloads are usually similar. Hence Hyperthreading is disabled on Barkla.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why might more cores not make the program faster?

A

• Cache contention
• What if one core changes some data that another core needs soon after.
• Advanced Reading: False sharing
• Memory contention – locks/semaphores
• Interconnects get slower the more the number of cores, more traffic.
• To effectively utilise multiple cores, programs must be written differently.
• Programmers must think in parallel
• Load balancing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly