final dump Flashcards
UMA
same amount of memory for each processor
uniform(same) latency for memory accesses
large latency in a network with many processors and memory
crossbar interconnect used to connect processor and memory
NUMA
local memory for each processor
low latency when accessing local memory
temporal locality
When an information item, instruction or data is first needed, it should be brought back / kept close to the cache as it will likely be needed again
spatial locality
Most time in programs is spent looping through the same block of instructions thus. Its useful to fetch several items located at adjacent address as they are likely to be used together
fully associative cache
any data block anywhere
+ very flexible
- slow to search
direct mapping
each memory block mapped to a specific cache block
+ fast to search
- inflexible
n-way set-associative
n memory blocks in a set
+/- reasonably fast to search
+/- reasonably flexible
LRU
cache keeps time stamps of accesses
replace least recently used block
DMA
process of transferring data between main memory and hardware subsystem (i/o) without involvement of the processor
cycle stealing
a method DMA applies to avoid competition on the memory bus between the CPU and the DMA engine
DMA +/-
+ delivers high bulk data performance
+ frees CPU form doing bulk data transfers from device-> memory
- can interfere with CPU memory access
- extra intelligence (standalone chip) in devices to access memory
DMA stages
- CPU programs the DMA engine of the storage device
- internal processing
- DMA transfer from the device
- interrupt to the CPU
memory mapped io
uses same address space as to address both memory and i/o devices
memory address refers to portion of RAM or memory and registers of IO devices
port mapped io
io devices mapped to separate address space
-> different set of signal lines to indicate a memory access vs a port access
memory mapped io +/-
+ requires less internal logic -> cheaper
+ easier to build, faster, consumes less power and can be physically smaller
- address bus must be fully decoded for every device
port mapped io +/-
+ less logic needed to decode discrete address and therefore costs less to add hardware to machine
- more instructions required to accomplish same task
RISC
reduced instruction set computer
less and simpler instructions
emphasis on software
small code, high cycles
transistors used more for complex instructions
more power efficient
single word I
increases I per program, reduces CPI
CISC
complex instruction set computer
lots of complex instructions
memory-to-memory operations
multiple word I
instructions may take multiple clock cycles
more energy hungry
increases CPI, reduces I per program
ISA trade-offs
complexity, performance, energy use, security
bubble
one clock cycle of idle time
What recognises data dependencies?
control
superscalar processors
CPU that implements a form of parallelism called instruction-level parallelism within a single processor
increased throughput
contain multiple execution units
superscalar vs pipelining
multiple I executed in parallel using multiple I units
<->
multiple I executed in parallel by the same units dividing them into phases
crossbar
network that provides a direct link between any pair of units connected to the network
-> used in UMA to connect processors
enables simultaneous transfers if target is not experiencing multiple requests