CSCI 223 Virtual Memory, Instruction Pipelining, Parallelism Flashcards
used in “simple” systems like embedded microcontrollers in devices like cars, elevators, and digital picture frames
a system using physical addressing
used in all modern servers, desktops, laptops, mobile processors; one of the great ideas in computer science; address translation process
a system using virtual addressing
why virtual memory? (4 major benefits)
- uses main memory efficiently
- simplifies memory management
- isolates address spaces (memory protection/security)
- makes programming so much easier
virtual memory as a tool for caching
VM is an array of N contiguous bytes used when compiled programs; programs are stored on disk; the contents of the array on disk are cached in physical memory (DRAM), these cache blocks are called pages
data movement between main memory and disk
page
consequences of page
large page size (typically 4-8KB), fully associative (any VP can be placed in any PP), highly sophisticated and expensive replacement algorithms (too complicated and open-ended to be implemented in hardware), write-back rather than write-through
page fault penalty is
enormous b/c disk is ~10,000x slower than DRAM
page table
an array of page table entries (PTE’s) that maps virtual pages to physical pages (per process kernel data structure in DRAM)
page hit
reference to VM word that is in physical memory
page fault
reference to VM word that is not in physical memory
handling a page fault
causes an exception handled by the system: page fault handler selects a victim to be evicted and the offending instruction is restarted, yielding a page hit
virtual memory works because of
locality
working set
at any point in time, these are the set of active virtual pages that programs tend to access (programs with better temporal locality will have smaller working sets)
if (working set size < main memory size)
good performance for one process after compulsory misses
if (SUM(working set sizes) > main memory size)
thrashing: performance meltdown where pages are swapped in and out continuously
VM as a tool for memory management
key idea: each process has its own virtual address space (it can view memory as a simple linear array; mapping function scatters addresses through physical memory)
memory allocation: each virtual page can be mapped to any physical page; a virtual page can be stored in different physical pages at different times
sharing code and data among processes: map virtual pages to the same physical page
address translation: page hit
- processor sends virtual address to MMU
2-3. MMU fetches PTE from page table in memory - MMU sends physical address to cache/memory
- cache/memory sends data word to processor
address translation: page fault
- processor sends virtual address to MMU
2-3. MMU fetches PTE from page table in memory - valid bit is zero, so MMU triggers page fault exception
- handler identifies victim (and, if dirty, pages it out to disk)
- handler pages in new page and updates PTE in memory
- handler returns to original process, restarting faulting instruction
speeding up translation with a
translation lookaside buffer (TLB)
TLB contains
complete page table entries for a small number of pages
TLB hit
eliminated memory access
TLB miss
incurs an additional memory access (the PTE)
this is rare b/c locality
VM as a tool for memory protection
extends PTE’s with permission bits; page fault handler checks these before remapping, and, if violated, send process SIGSEGV (segmentation fault)
instruction flow
read instruction at address specified by PC, process through different stages, update program counter (hardware executes instruction sequentially)
execution stages
- fetch (read instruction from instruction memory)
- decode (understand instruction and registers)
- execute (compute value or address)
- memory (read or write data)
- write back (write program registers)
stage computation: arithmetic/logical operations
formulate instruction execution as sequence of simple steps, use same general form for all instructions
fetch: read instruction byte, read register byte, compute next PC
decode: read operand A, read operand B
execute: perform ALU operation, set condition code register memory
write back: write back result
instruction execution limitations
too slow to be practical; hardware units only active for fraction of clock cycle
solution to instruction execution limitations
instruction pipelining
idea of instruction pipelining
divide process into independent stages; move objects through stages in sequence; at any given times, multiple objects are being processed
limitation of instruction pipelining
nonuniform delays (throughput limited by slowest stage; other stages sit idle for much of the time; challenging to partition system into balanced stages)
pipeline speedup
if all stages are balanced (i.e., take the same time),
ETpipeline = ETnonpipelined / (# stages + time for filling and exiting)
max speedup
number of stages
speedup due to
increased throughput
execution time of each instruction (decreases/increases) and why
increases slightly due to stage imbalance and increased hardware complexity
ETunpipelined/ETpipelined =
ETinst * # inst / [(ETinst * # inst) / (# stages + Tfilling&exiting)] = # stages
hazards
situations that prevent starting the next instruction in the next cycle
3 types of hazards
- structural hazard
- data hazard
- control hazard
structural hazard
a required resource does not exist or is busy
data hazard
need to wait for previous instruction to complete its data read/write
control hazard
deciding on control action depends on previous instruction
cause and solution of structural hazard
cause: conflict for use (or lack) of a hardware resource
solution: add more hardware
cause and solution of data hazard
cause: data dependence (an instruction depends on the completion of data access by a previous instruction)
solution: forwarding, code scheduling/reordering
4 possible data relations
RAW (read after write; true data dependence)
WAW (write after write; ouput, name data dependence)
WAR (write after read; anti, name data dependence)
RAR (read after read; not a dependence)
forwarding (AKA bypassing)
use result when it is computed; don’t wait for it to be stored in a register, requires extra connections in the hardware circuit
load-use data hazard
can’t always avoid stalls by forwarding (if value not computed when needed; can’t forward back in time)
code scheduling/reordering to avoid stalls
reorder code to avoid use of load result in the next instruction; can be done by compiler or hardware, lengthened distance
cause and solution of control hazard
cause: control flow instruction (branch determines flow of control)
solution: predict branch outcome, fetch predicted, fetch new if wrong > penalty (look at history/past outcomes to predict)
stall on branch
wait until branch outcome determined before fetching next instruction
branch prediction
longer pipelines can’t readily determine branch outcome early; control hazard stall penalty becomes unacceptable
predict outcome of branch; only stall if prediction is wrong
static branch prediction
done by compiler before execution; based on typical branch behavior; ex: loop and if-statement branches
dynamic branch prediction
hardware measures actual branch behavior (e.g., record recent history of each branch); assume future behavior will continue the trend (when wrong, stall while re-fetching and update history)
dynamic branch prediction idea
branch prediction buffer (history table) is introduced to keep track of recent branch behavior
l-bit branch predictor uses l bits to keep track
pipelining improves performance by
increasing instruction throughput (executes multiple instructions in parallel; each instruction has the same latency)
to achieve better performance that with single issue processors (IPC <= 1), we need to ?, so the solution is
complete multiple instructions per clock cycle; multiple issue processors
two types of multiple issue processors (same goal but different approaches)
superscalar processors and VLIW (very long instruction word) processors
superscalar processor types
statically scheduled - @ compile time
dynamically scheduled - @ run time (know more so better)
superscalar processors
issue varying # instructions per clock cycle, execute either in-order or out-of-order (OOO), hardware takes care of all dependences, complex hardware
example of superscalar processors
most desktop/laptop processors
VLIW processors
issue a fixed # of instructions formatted as one large instruction bundle scheduled by the compiler; compiler checks dependences and groups instructions into bundles; complex compiler but simpler hardware
example of VLIW processors
Inten Itanium, some GPUs and DSPs
3 types of parallelisms
ILP (Instruction Level Parallelism) - instruction pipelining, branch prediction, multiple issue, etc.
TLP (Thread (or Task) Level Parallelism) - multi-core, etc.
DLP (Data Level Parallelism) - vector processor, GPUs, etc.
Flynn’s Taxonomy
- SISD (single instruction single data): uniprocessor
- SIMD (single instruction multiple data): vector processor, multimedia extension, GPU
- MISD (multiple instruction single data): no commercial processor
- MIMD (multiple instruction multiple data): multicore
TLP motivation
power and silicon cost grew faster than performance in exploiting ILP in 2000-2005
Thread-Level Parallelism
have multiple program counters, uses MIMD model, targeted for highly-coupled shared-memory multiprocessors; for n processors, need n threads
two types of procsesors for TLP
symmetric multiprocessors (SMP), distributed shared memory (DSM)
symmetric multiprocessors (SMP)
small number of cores, share single memory with uniform memory latency, UMA architecture (uniform memory accsess)
distributed shared memory (DSM)
memory distributed among processors, processors connected via direct (switched) and nondirect (multi-hop) interconnection networks, NUMA architecture (nonuniform memory access)
the term shared memory associated with both symmetric multiprocessors (SMP) and distributed shared memory (DSM) refers to the fact that
address space is shared
in both symmetric multiprocessors (SMP) and distributed shared memory (DSM), communication between threads occurs through
a shared memory space
in contrast to symmetric multiprocessors (SMP) and distributed shared memory (DSM), the clusters and warehouse scale computer use ? to communicate data among processes
message-passing protocols (MPI: message passing interface)
TLP multicore processors
can execute multiple threads simultaneously; threads are created by programmers or software systems; MIMD; shared memory system (virtual/physical memory) through a shared address space
DLP
comes from simultaneous operations across large sets of data
SIMD architectures can exploit significant data-level parallelism for
matrix-oriented scientific computing, media-oriented image and sound processors
SIMD is (less/more) energy efficient than MIMD because
more; only needs to fetch one instruction per data operation, makes SIMD attractive for personal mobile devices
SIMD allows programmer to continue to think
sequentially
3 variations of DLP architecture
- vector architectures
- SIMD extensions
- Graphics Processor Units (GPUs)
Intel MMX (1996), Streaming (SSE) (1999), Advanced Vector Extensions (AVX) (2010); operands must be in consecutive and aligned memory locations
SIMD Extensions implementations
basic idea of vector architectures
- read sets of data elements scattered in memory into “large sequential vector registers”
- operate on those registers
- disperse the results back into memory
vector architectures: dozens of register-register operations on independent data elements; registers are controlled by ? to ?
compiler; hide memory latency, leverage memory bandwith
why SIMD extensions?
media applications operate on data types different from the native word size
limitations of SIMD extensions compared to vector instructions
number of data operands encoded int op code, no sophisticated addressing modes (strided, scatter-gather), no mask register
GPU
Graphics Processing Unit
GPU main tasks
graphics rendering and GPGPU (General-Purpose computing on GPUs, available on most all modern GPUs (including mobile))
GPGPU basic idea
also known as GPU computing; basic idea:
heterogeneous execution model (CPU host, GPU device), develop a C-like programming language for GPU, programming model is “Single Instruction Multiple Thread (SIMT)”
two dominant programming languages for GPGPU
CUDA (NVIDIA) and OpenCL (industry standard)
benefits to GPU
- accessibility (almost every computer has one)
- interoperability with display API
- small form factor
- scalability (both HW and SW are designed for scalability)
hardware and scalability
an array of independent compute units
software and scalability
an array of thread blocks (work-groups)
basic idea of vector architectures
- read sets of data elements scattered in memory into “large sequential vector registers”
- operate on these registers
- disperse the results back into memory
registers are controlled by
compiler
registers are used to
hide memory latency, leverage memory bandwidth
GPU main tasks
- graphics rendering
2. GPGPU
basic idea of GPGPU
heterogeneous execution model, develop a C-like programming language for GPU, programming mode is Single Instruction Multiple Thread
ILP on ? core, TLP on ? core, DLP on ? core
single, multi, many
benefits to GPU
accessibility, interoperability, small form factor, scalability
program optimizations
algorithm levels (biggest effect), data representation level, coding level, compiler level, hardware level
loop unrolling
replicate the loop body multiple times
decreases # branch-related instruction
less control hazard
better instruction scheduling
disadvantage: increases code size
gdp
text-based GNU debugging tool
gprof
text-based GNU profiling tools; useful to determine the parts in program code that are time consuming