Accelerators Flashcards
What are the current technology trends for single core performance?
Still more transistors, now into 5nm. Talk about going to 3nm, but beyond that would be very challenging. So though we are currently getting more transistors, this trend will vary off.
Frequency has flatten out since early 2000.
Single thread performance has also levelled off since early 2000. Power has also flattened out.
Number of logical cores has increased.
What is the main reason for single thread performance, typical power and frequency flattening out?
End of Dennars scaling. We have become power limited.
What are the technology trends for multi-core performance?
Clock frequency are flat
After Dennard scaling stopped, started doing multicore (early 2000, around 2005).
For a number of years after that, it was still possible to increase the number of cores, up to 8 cores around 2015.
Describe the development of performance from 1978 to 2019
Early days (78-86):
- CISC
- lot of innovations in the ISA themselves
- 2x increase of performance every 3.5 years
Mid 80s (86-2003):
- RISC introduced
- easier to optimise this ISA for HW
- 2x increase every 1.5 years
Early 2000 (2003-2011):
- End of Dennard scaling
- Static power (leakage) became so big we weren’t able to continue scaling processors
- 2x increase every 3.5 years
- Begin using multicore
Early 2010s (2011-2015):
- Amdahl’s law limits performance increase
- If we continue increasing the number of cores, at some point the SW reach a limit for how much it can be parallelized
- 2x increase every 6 years
Currently (2015-2018, maybe now):
- 2x increase every 20 years
As technology nodes decreases in size, what happens to the amount of power/nm^2?
Size of the technology nodes decreases through the years.
Power per square nm increases.
In the early 2000s, the power increased more slow, after around 2009 it started to increase more rapidly. This is when relative power per nm^2 crosses size in nm.
The power for each transistor no longer scales as well as the reduction in size. This means that each transistor now needs to use more power.
What does the term Dark Silicon mean?
Dennard scaling described that if a technology node was 25% of the original size, 4 of these nodes could be powered by the same amount of power as one node of size 100%.
(OBS, make sure this is the correct definition)
Now, the power for each node increases, meaning the power for a node does not decrease as much as its size deacreases. Meaning that if we have the same amount of power as a node of size 100% needs, now that we have nodes of size 25%, these requires more than 1/4 of the original power, though they are 25% of the size. So if such nodes were to be used, the same power budget would only be enough to power a chip of e.g. 40% the total size. Meaning it is not enough for even 2 nodes of the new size. (See figure in video Accelerator on time 6:11)
So, now that we have become power limited, we are not able to have as many tech nodes powered on on the same time, on each chip, as there is space for. If we were to power all nodes that fit on the chip, we would draw too much power.
Because of this, computer architecture has become power limited.
When executing a RISC instruction, how much power does it take (example, don’t know for which node exactly)
125 pJ
The Overhead of executing the instruction takes most of the power (fetch, decode, …)
ALU takes some of the power
When executing a Load/store instruction, how much power does it take (example, don’t know for which node exactly)
150 pJ
Accessing Data cache takes some power, even more on the lower cache levels
Overhead takes some power
ALU takes some power
When executing a 32-bit addition, how much power does it take (example, don’t know for which node exactly)
7 pJ
So when only looking at the addition, this takes very little energy compared to a whole instruction
When executing a 8-bit addition, how much power does it take (example, don’t know for which node exactly)
0.2-0.5 pJ
When executing a SP (single point, 32 bit) floating point operation, how much power does it take (example, don’t know for which node exactly)
15-20 pJ
What are some advances microarchitectural techniques used to push performance, end are these energy efficient?
These tend to be very energy inefficient.
OoO:
- Large instruction window (ROB)
- complicated schedulers (reservation station, wake-up logic)
Pipelines - becoming larger (15+ stages)
- logic and control (hazards)
- pipeline registers
Branch prediction
- becoming larger
- flushing of pipeline wastes energy
Complex memory hierarchy
- large caches
- multi level
- deep levels of hierarchy are wasteful for example streaming programs where one line of data is only fetched once and never used again
Prefetchers
- On misspredict - wasted energy usage
multithreading
multiprocessing
What can be done to try and improve energy efficiency?
SIMD extension units/vectorisation:
- Instead of reading one instruction for each data element, apply one instruction to multiple data elements
- this allows for only needing to do the fetch/decode etc. once
GPUs
Describe some properties of GPUs
Uses symmetric multithreading, and SIMD like behaviour. Fetches one instruction and executes it on multiple data.
This reduces energy overhead per data element.
How does vectorisation effect energy efficiency?
Reducing the amount of instructions, reduces the amount of overhead and thereby becomes more energy efficient.