CSCI 223 Virtual Memory, Instruction Pipelining, Parallelism Flashcards

1
Q

used in “simple” systems like embedded microcontrollers in devices like cars, elevators, and digital picture frames

A

a system using physical addressing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

used in all modern servers, desktops, laptops, mobile processors; one of the great ideas in computer science; address translation process

A

a system using virtual addressing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

why virtual memory? (4 major benefits)

A
  1. uses main memory efficiently
  2. simplifies memory management
  3. isolates address spaces (memory protection/security)
  4. makes programming so much easier
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

virtual memory as a tool for caching

A

VM is an array of N contiguous bytes used when compiled programs; programs are stored on disk; the contents of the array on disk are cached in physical memory (DRAM), these cache blocks are called pages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

data movement between main memory and disk

A

page

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

consequences of page

A

large page size (typically 4-8KB), fully associative (any VP can be placed in any PP), highly sophisticated and expensive replacement algorithms (too complicated and open-ended to be implemented in hardware), write-back rather than write-through

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

page fault penalty is

A

enormous b/c disk is ~10,000x slower than DRAM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

page table

A

an array of page table entries (PTE’s) that maps virtual pages to physical pages (per process kernel data structure in DRAM)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

page hit

A

reference to VM word that is in physical memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

page fault

A

reference to VM word that is not in physical memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

handling a page fault

A

causes an exception handled by the system: page fault handler selects a victim to be evicted and the offending instruction is restarted, yielding a page hit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

virtual memory works because of

A

locality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

working set

A

at any point in time, these are the set of active virtual pages that programs tend to access (programs with better temporal locality will have smaller working sets)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

if (working set size < main memory size)

A

good performance for one process after compulsory misses

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

if (SUM(working set sizes) > main memory size)

A

thrashing: performance meltdown where pages are swapped in and out continuously

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

VM as a tool for memory management

A

key idea: each process has its own virtual address space (it can view memory as a simple linear array; mapping function scatters addresses through physical memory)
memory allocation: each virtual page can be mapped to any physical page; a virtual page can be stored in different physical pages at different times
sharing code and data among processes: map virtual pages to the same physical page

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

address translation: page hit

A
  1. processor sends virtual address to MMU
    2-3. MMU fetches PTE from page table in memory
  2. MMU sends physical address to cache/memory
  3. cache/memory sends data word to processor
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

address translation: page fault

A
  1. processor sends virtual address to MMU
    2-3. MMU fetches PTE from page table in memory
  2. valid bit is zero, so MMU triggers page fault exception
  3. handler identifies victim (and, if dirty, pages it out to disk)
  4. handler pages in new page and updates PTE in memory
  5. handler returns to original process, restarting faulting instruction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

speeding up translation with a

A

translation lookaside buffer (TLB)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

TLB contains

A

complete page table entries for a small number of pages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

TLB hit

A

eliminated memory access

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

TLB miss

A

incurs an additional memory access (the PTE)

this is rare b/c locality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

VM as a tool for memory protection

A

extends PTE’s with permission bits; page fault handler checks these before remapping, and, if violated, send process SIGSEGV (segmentation fault)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

instruction flow

A

read instruction at address specified by PC, process through different stages, update program counter (hardware executes instruction sequentially)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
execution stages
1. fetch (read instruction from instruction memory) 2. decode (understand instruction and registers) 3. execute (compute value or address) 4. memory (read or write data) 5. write back (write program registers)
26
stage computation: arithmetic/logical operations
formulate instruction execution as sequence of simple steps, use same general form for all instructions fetch: read instruction byte, read register byte, compute next PC decode: read operand A, read operand B execute: perform ALU operation, set condition code register memory write back: write back result
27
instruction execution limitations
too slow to be practical; hardware units only active for fraction of clock cycle
28
solution to instruction execution limitations
instruction pipelining
29
idea of instruction pipelining
divide process into independent stages; move objects through stages in sequence; at any given times, multiple objects are being processed
30
limitation of instruction pipelining
nonuniform delays (throughput limited by slowest stage; other stages sit idle for much of the time; challenging to partition system into balanced stages)
31
pipeline speedup
if all stages are balanced (i.e., take the same time), | ETpipeline = ETnonpipelined / (# stages + time for filling and exiting)
32
max speedup
number of stages
33
speedup due to
increased throughput
34
execution time of each instruction (decreases/increases) and why
increases slightly due to stage imbalance and increased hardware complexity
35
ETunpipelined/ETpipelined =
ETinst * # inst / [(ETinst * # inst) / (# stages + Tfilling&exiting)] = # stages
36
hazards
situations that prevent starting the next instruction in the next cycle
37
3 types of hazards
1. structural hazard 2. data hazard 3. control hazard
38
structural hazard
a required resource does not exist or is busy
39
data hazard
need to wait for previous instruction to complete its data read/write
40
control hazard
deciding on control action depends on previous instruction
41
cause and solution of structural hazard
cause: conflict for use (or lack) of a hardware resource solution: add more hardware
42
cause and solution of data hazard
cause: data dependence (an instruction depends on the completion of data access by a previous instruction) solution: forwarding, code scheduling/reordering
43
4 possible data relations
RAW (read after write; true data dependence) WAW (write after write; ouput, name data dependence) WAR (write after read; anti, name data dependence) RAR (read after read; not a dependence)
44
forwarding (AKA bypassing)
use result when it is computed; don't wait for it to be stored in a register, requires extra connections in the hardware circuit
45
load-use data hazard
can't always avoid stalls by forwarding (if value not computed when needed; can't forward back in time)
46
code scheduling/reordering to avoid stalls
reorder code to avoid use of load result in the next instruction; can be done by compiler or hardware, lengthened distance
47
cause and solution of control hazard
cause: control flow instruction (branch determines flow of control) solution: predict branch outcome, fetch predicted, fetch new if wrong > penalty (look at history/past outcomes to predict)
48
stall on branch
wait until branch outcome determined before fetching next instruction
49
branch prediction
longer pipelines can't readily determine branch outcome early; control hazard stall penalty becomes unacceptable predict outcome of branch; only stall if prediction is wrong
50
static branch prediction
done by compiler before execution; based on typical branch behavior; ex: loop and if-statement branches
51
dynamic branch prediction
hardware measures actual branch behavior (e.g., record recent history of each branch); assume future behavior will continue the trend (when wrong, stall while re-fetching and update history)
52
dynamic branch prediction idea
branch prediction buffer (history table) is introduced to keep track of recent branch behavior l-bit branch predictor uses l bits to keep track
53
pipelining improves performance by
increasing instruction throughput (executes multiple instructions in parallel; each instruction has the same latency)
54
to achieve better performance that with single issue processors (IPC <= 1), we need to ?, so the solution is
complete multiple instructions per clock cycle; multiple issue processors
55
two types of multiple issue processors (same goal but different approaches)
superscalar processors and VLIW (very long instruction word) processors
56
superscalar processor types
statically scheduled - @ compile time | dynamically scheduled - @ run time (know more so better)
57
superscalar processors
issue varying # instructions per clock cycle, execute either in-order or out-of-order (OOO), hardware takes care of all dependences, complex hardware
58
example of superscalar processors
most desktop/laptop processors
59
VLIW processors
issue a fixed # of instructions formatted as one large instruction bundle scheduled by the compiler; compiler checks dependences and groups instructions into bundles; complex compiler but simpler hardware
60
example of VLIW processors
Inten Itanium, some GPUs and DSPs
61
3 types of parallelisms
ILP (Instruction Level Parallelism) - instruction pipelining, branch prediction, multiple issue, etc. TLP (Thread (or Task) Level Parallelism) - multi-core, etc. DLP (Data Level Parallelism) - vector processor, GPUs, etc.
62
Flynn's Taxonomy
1. SISD (single instruction single data): uniprocessor 2. SIMD (single instruction multiple data): vector processor, multimedia extension, GPU 3. MISD (multiple instruction single data): no commercial processor 4. MIMD (multiple instruction multiple data): multicore
63
TLP motivation
power and silicon cost grew faster than performance in exploiting ILP in 2000-2005
64
Thread-Level Parallelism
have multiple program counters, uses MIMD model, targeted for highly-coupled shared-memory multiprocessors; for n processors, need n threads
65
two types of procsesors for TLP
symmetric multiprocessors (SMP), distributed shared memory (DSM)
66
symmetric multiprocessors (SMP)
small number of cores, share single memory with uniform memory latency, UMA architecture (uniform memory accsess)
67
distributed shared memory (DSM)
memory distributed among processors, processors connected via direct (switched) and nondirect (multi-hop) interconnection networks, NUMA architecture (nonuniform memory access)
68
the term shared memory associated with both symmetric multiprocessors (SMP) and distributed shared memory (DSM) refers to the fact that
address space is shared
69
in both symmetric multiprocessors (SMP) and distributed shared memory (DSM), communication between threads occurs through
a shared memory space
70
in contrast to symmetric multiprocessors (SMP) and distributed shared memory (DSM), the clusters and warehouse scale computer use ? to communicate data among processes
message-passing protocols (MPI: message passing interface)
71
TLP multicore processors
can execute multiple threads simultaneously; threads are created by programmers or software systems; MIMD; shared memory system (virtual/physical memory) through a shared address space
72
DLP
comes from simultaneous operations across large sets of data
73
SIMD architectures can exploit significant data-level parallelism for
matrix-oriented scientific computing, media-oriented image and sound processors
74
SIMD is (less/more) energy efficient than MIMD because
more; only needs to fetch one instruction per data operation, makes SIMD attractive for personal mobile devices
75
SIMD allows programmer to continue to think
sequentially
76
3 variations of DLP architecture
1. vector architectures 2. SIMD extensions 3. Graphics Processor Units (GPUs)
77
Intel MMX (1996), Streaming (SSE) (1999), Advanced Vector Extensions (AVX) (2010); operands must be in consecutive and aligned memory locations
SIMD Extensions implementations
78
basic idea of vector architectures
1. read sets of data elements scattered in memory into "large sequential vector registers" 2. operate on those registers 3. disperse the results back into memory
79
vector architectures: dozens of register-register operations on independent data elements; registers are controlled by ? to ?
compiler; hide memory latency, leverage memory bandwith
80
why SIMD extensions?
media applications operate on data types different from the native word size
81
limitations of SIMD extensions compared to vector instructions
number of data operands encoded int op code, no sophisticated addressing modes (strided, scatter-gather), no mask register
82
GPU
Graphics Processing Unit
83
GPU main tasks
graphics rendering and GPGPU (General-Purpose computing on GPUs, available on most all modern GPUs (including mobile))
84
GPGPU basic idea
also known as GPU computing; basic idea: heterogeneous execution model (CPU host, GPU device), develop a C-like programming language for GPU, programming model is "Single Instruction Multiple Thread (SIMT)"
85
two dominant programming languages for GPGPU
CUDA (NVIDIA) and OpenCL (industry standard)
86
benefits to GPU
1. accessibility (almost every computer has one) 2. interoperability with display API 3. small form factor 4. scalability (both HW and SW are designed for scalability)
87
hardware and scalability
an array of independent compute units
88
software and scalability
an array of thread blocks (work-groups)
89
basic idea of vector architectures
1. read sets of data elements scattered in memory into "large sequential vector registers" 2. operate on these registers 3. disperse the results back into memory
90
registers are controlled by
compiler
91
registers are used to
hide memory latency, leverage memory bandwidth
92
GPU main tasks
1. graphics rendering | 2. GPGPU
93
basic idea of GPGPU
heterogeneous execution model, develop a C-like programming language for GPU, programming mode is Single Instruction Multiple Thread
94
ILP on ? core, TLP on ? core, DLP on ? core
single, multi, many
95
benefits to GPU
accessibility, interoperability, small form factor, scalability
96
program optimizations
algorithm levels (biggest effect), data representation level, coding level, compiler level, hardware level
97
loop unrolling
replicate the loop body multiple times decreases # branch-related instruction less control hazard better instruction scheduling disadvantage: increases code size
98
gdp
text-based GNU debugging tool
99
gprof
text-based GNU profiling tools; useful to determine the parts in program code that are time consuming