Lectures 1 and 2 - Introduction and Fundamentals Flashcards
Implementation or Microarchitecture
Logical organisation of the inner structure of the computer.
Instruction Set Architecture
Defines the functional behaviour of the processor and hardware/software interface. It enables compatible implementations to be made that make different trade offs.
Realisation
Physical structure embodying the implementation.
Four major players in computer architecture arena
Applications, architectures, technology, markets
Design goals and constraints are imposed by the ….
Target market, available technology and target applications.
Important metrics
Energy (Joules) Power (W) Performance Cost/Performance Power Efficiency Reliability (Mean Time to Failure) Flexibility
How were historic performance gains achieved?
Technology Scaling
Gates per clock
Instructions Per Cycle Increase
Instruction Count Decrease
Technology Scaling Performance Improvement
Provides 1.4x transistor performance improvement per generation.
7 historic process generations provided a 10.5x performance improvement.
Gates per clock performance improvement
Reduction from 100 to 10 gate delays so 10x performance improvement. 4x of this came from pipelining with an increase from 5 to 20 stages. 2.5x from circuit level advances e.g. new logic families.
IPC and instruction count performance gain
~5-8x improvement in SPECint/MHz
Advances in compiler technology and impact of increased bus widths.
VLSI technology scaling
Fabrication processes are characterised by feature sizes so improvements allow transistor and wire dimensions to be reduced. A linear reduction in transistor sizes enables quadratic increase in transistor count. Smaller transistors are also faster as resistance is independent and capacitance decreases with feature size.
Porting a design to a new technology
Root 2 performance increased
Area reduced by a factor of 2
Dynamic power consumption reduced by half as capacitance scales by S and P=CV^2*f (as f increases by 1.4 to increase performance).
Interconnect versus transistor scaling
Smaller transistor are faster/lower power but wires don’t scale in the same way. Resistance increases and capacitance per micron remains the same. Adding fat wires on upper levels can help mitigate this.
Architectural Implications of poor wire scale
Can reach less state in a single clock cycle so decentralised structures work better. A bypass network between functional units may be preferable.
Amdahl’s Law
Performance improvements are limited by the fraction of time the proposed enhancement can be employed.
Speed up = 1/((1-fraction_enhanced)+ (fraction_enhanced/speedup_enhanced))
Law of diminishing returns
Incremental improvements in speedup gained by enhancing just a portion of the computation diminish as improvements are added.
Are all enhancements worthwhile?
No, they consume design and implementation resources potentially slowing the common case. This may be indirectly by meaning less time is spent optimising the common case or directly by causing the cycle time to be extended.
Typical program behaviour
Locality of reference - exploited by memory hierarchy
Predictable flow control - branch prediction exploits
Predictable data values - exploited at higher levels e.g. Memoization
Principle of locality
Programs tend to resume data and instructions that they have used recently.
Temporal locality
Recently accessed data are likely to be accessed again in the near future.
Spatial locality
Accesses to nearby memory locations often occur close together in time.
Locality in instruction reference stream
Temporal - loops and function calls
Spatial - instructions executed sequentially in absence of branch instructions. Many branches are to nearby instructions
Locality in data reference stream
Temporal - Widely used program variables & the call stack.
Spatial - Access arrays sequentially, process streams, function calls and the stack frame
CPU performance equation
1/performance = time/program = instructions/program x cycles/instruction x time/cycle
Improving performance
Shorten clock cycle time - circuit design style/Microarchitecture
CPI - Microarchitecture and ISA
Instruction count - ISA and compiler technology
Dynamic power
1/2 x Capacitative Load x voltage^2 x frequency switched
Static power
Current-static x voltage
Reducing power consumption
Reduce performance to save energy by scaling voltage/frequency.
Reduce wastage by reducing superfluous switching with clock gating and operant isolation, by extracting and optimising the common-car and by power gating.
Components of the ISA
Word size, operations, registers, operant types, addressing modes, instruction encodings, memory architecture, trap and interrupt architecture
Changing ISAs in embedded processors
ISA has significant impact as area and cost constraints mean transistor budget and design budget are limit. Recompilation is not a big hurdle as many embedded devices may run one set of binaries for their life, they have short lifetimes and we want to squeeze as much as possible out of the compiler.
How can we break link between applications and ISA?
JIT VM technologies such as .NET and the JVM
Alternatively support conventional source ISA and translate at run time to target ISA.
Problems with microcode controlling
Efficiently controlling a highly concurrent data path using microcode is complex.
Handling exceptions in a pipelined CISC machine is complicated
Microcode engine represents an unnecessary overhead especially when executing simple instructions
RISC Design Goals
Choose common instructions and addressing modes
Target efficient pipelined implementation by ensuring instructions go through the same pipeline stages and executing them in a single cycle if possible.
Assume use of high-level languages - the compiler optimises register usage and instruction schedule.
Produce simple high-frequency implementation.
Load-store architecture
Memory can only be accessed with load and store instructions. This permits simple fixed length instructions simplifying decoding.
It also simplifies pipelining as instructions take a similar time to execute and memory is accessed at most once in one pipeline stage.
Register calling conventions
Simple convention specifies temporary and saved registers. This gives both the caller and callee a chance to reduce unnecessary register spilling. Temporary registers aren’t preserved by callee while callee saves saved registers.
Register Windows
Register windows are used by some processors which improve performance of procedure call/return sequence by avoiding the need to explicitly spill registers to memory.
Addressing modes
Register
Immediate
Register Indirect with displacement
Register indirect
Condition registers and flags
Options are condition codes, condition register and compare and branch.
Trade-offs:
Impact on local code scheduling optimisations
Use of general purpose registers
Number of instructions required to implement conditional branch
Cycle time implications
Encoding an instruction set
Balance:
Number of registers supported
Number of addressing modes supported
Against:
Size of instructions and compiled program
Fetch and decoding logic complexity and pipeline complexity in general
Options are:
Variable length
Fixed length
Hybrid format
16-bit instruction set extensions
Address a subset of operations, addressing modes and registers but allow for static code size reductions of 25-40%.
They also have a 10-20% performance penalty.