CSCI 223 Virtual Memory, Instruction Pipelining, Parallelism Flashcards

1
Q

used in “simple” systems like embedded microcontrollers in devices like cars, elevators, and digital picture frames

A

a system using physical addressing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

used in all modern servers, desktops, laptops, mobile processors; one of the great ideas in computer science; address translation process

A

a system using virtual addressing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

why virtual memory? (4 major benefits)

A
  1. uses main memory efficiently
  2. simplifies memory management
  3. isolates address spaces (memory protection/security)
  4. makes programming so much easier
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

virtual memory as a tool for caching

A

VM is an array of N contiguous bytes used when compiled programs; programs are stored on disk; the contents of the array on disk are cached in physical memory (DRAM), these cache blocks are called pages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

data movement between main memory and disk

A

page

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

consequences of page

A

large page size (typically 4-8KB), fully associative (any VP can be placed in any PP), highly sophisticated and expensive replacement algorithms (too complicated and open-ended to be implemented in hardware), write-back rather than write-through

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

page fault penalty is

A

enormous b/c disk is ~10,000x slower than DRAM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

page table

A

an array of page table entries (PTE’s) that maps virtual pages to physical pages (per process kernel data structure in DRAM)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

page hit

A

reference to VM word that is in physical memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

page fault

A

reference to VM word that is not in physical memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

handling a page fault

A

causes an exception handled by the system: page fault handler selects a victim to be evicted and the offending instruction is restarted, yielding a page hit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

virtual memory works because of

A

locality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

working set

A

at any point in time, these are the set of active virtual pages that programs tend to access (programs with better temporal locality will have smaller working sets)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

if (working set size < main memory size)

A

good performance for one process after compulsory misses

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

if (SUM(working set sizes) > main memory size)

A

thrashing: performance meltdown where pages are swapped in and out continuously

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

VM as a tool for memory management

A

key idea: each process has its own virtual address space (it can view memory as a simple linear array; mapping function scatters addresses through physical memory)
memory allocation: each virtual page can be mapped to any physical page; a virtual page can be stored in different physical pages at different times
sharing code and data among processes: map virtual pages to the same physical page

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

address translation: page hit

A
  1. processor sends virtual address to MMU
    2-3. MMU fetches PTE from page table in memory
  2. MMU sends physical address to cache/memory
  3. cache/memory sends data word to processor
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

address translation: page fault

A
  1. processor sends virtual address to MMU
    2-3. MMU fetches PTE from page table in memory
  2. valid bit is zero, so MMU triggers page fault exception
  3. handler identifies victim (and, if dirty, pages it out to disk)
  4. handler pages in new page and updates PTE in memory
  5. handler returns to original process, restarting faulting instruction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

speeding up translation with a

A

translation lookaside buffer (TLB)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

TLB contains

A

complete page table entries for a small number of pages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

TLB hit

A

eliminated memory access

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

TLB miss

A

incurs an additional memory access (the PTE)

this is rare b/c locality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

VM as a tool for memory protection

A

extends PTE’s with permission bits; page fault handler checks these before remapping, and, if violated, send process SIGSEGV (segmentation fault)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

instruction flow

A

read instruction at address specified by PC, process through different stages, update program counter (hardware executes instruction sequentially)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

execution stages

A
  1. fetch (read instruction from instruction memory)
  2. decode (understand instruction and registers)
  3. execute (compute value or address)
  4. memory (read or write data)
  5. write back (write program registers)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

stage computation: arithmetic/logical operations

A

formulate instruction execution as sequence of simple steps, use same general form for all instructions

fetch: read instruction byte, read register byte, compute next PC
decode: read operand A, read operand B
execute: perform ALU operation, set condition code register memory
write back: write back result

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

instruction execution limitations

A

too slow to be practical; hardware units only active for fraction of clock cycle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

solution to instruction execution limitations

A

instruction pipelining

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

idea of instruction pipelining

A

divide process into independent stages; move objects through stages in sequence; at any given times, multiple objects are being processed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

limitation of instruction pipelining

A

nonuniform delays (throughput limited by slowest stage; other stages sit idle for much of the time; challenging to partition system into balanced stages)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

pipeline speedup

A

if all stages are balanced (i.e., take the same time),

ETpipeline = ETnonpipelined / (# stages + time for filling and exiting)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

max speedup

A

number of stages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

speedup due to

A

increased throughput

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

execution time of each instruction (decreases/increases) and why

A

increases slightly due to stage imbalance and increased hardware complexity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

ETunpipelined/ETpipelined =

A

ETinst * # inst / [(ETinst * # inst) / (# stages + Tfilling&exiting)] = # stages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

hazards

A

situations that prevent starting the next instruction in the next cycle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

3 types of hazards

A
  1. structural hazard
  2. data hazard
  3. control hazard
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

structural hazard

A

a required resource does not exist or is busy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

data hazard

A

need to wait for previous instruction to complete its data read/write

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

control hazard

A

deciding on control action depends on previous instruction

41
Q

cause and solution of structural hazard

A

cause: conflict for use (or lack) of a hardware resource
solution: add more hardware

42
Q

cause and solution of data hazard

A

cause: data dependence (an instruction depends on the completion of data access by a previous instruction)
solution: forwarding, code scheduling/reordering

43
Q

4 possible data relations

A

RAW (read after write; true data dependence)
WAW (write after write; ouput, name data dependence)
WAR (write after read; anti, name data dependence)
RAR (read after read; not a dependence)

44
Q

forwarding (AKA bypassing)

A

use result when it is computed; don’t wait for it to be stored in a register, requires extra connections in the hardware circuit

45
Q

load-use data hazard

A

can’t always avoid stalls by forwarding (if value not computed when needed; can’t forward back in time)

46
Q

code scheduling/reordering to avoid stalls

A

reorder code to avoid use of load result in the next instruction; can be done by compiler or hardware, lengthened distance

47
Q

cause and solution of control hazard

A

cause: control flow instruction (branch determines flow of control)
solution: predict branch outcome, fetch predicted, fetch new if wrong > penalty (look at history/past outcomes to predict)

48
Q

stall on branch

A

wait until branch outcome determined before fetching next instruction

49
Q

branch prediction

A

longer pipelines can’t readily determine branch outcome early; control hazard stall penalty becomes unacceptable
predict outcome of branch; only stall if prediction is wrong

50
Q

static branch prediction

A

done by compiler before execution; based on typical branch behavior; ex: loop and if-statement branches

51
Q

dynamic branch prediction

A

hardware measures actual branch behavior (e.g., record recent history of each branch); assume future behavior will continue the trend (when wrong, stall while re-fetching and update history)

52
Q

dynamic branch prediction idea

A

branch prediction buffer (history table) is introduced to keep track of recent branch behavior
l-bit branch predictor uses l bits to keep track

53
Q

pipelining improves performance by

A

increasing instruction throughput (executes multiple instructions in parallel; each instruction has the same latency)

54
Q

to achieve better performance that with single issue processors (IPC <= 1), we need to ?, so the solution is

A

complete multiple instructions per clock cycle; multiple issue processors

55
Q

two types of multiple issue processors (same goal but different approaches)

A

superscalar processors and VLIW (very long instruction word) processors

56
Q

superscalar processor types

A

statically scheduled - @ compile time

dynamically scheduled - @ run time (know more so better)

57
Q

superscalar processors

A

issue varying # instructions per clock cycle, execute either in-order or out-of-order (OOO), hardware takes care of all dependences, complex hardware

58
Q

example of superscalar processors

A

most desktop/laptop processors

59
Q

VLIW processors

A

issue a fixed # of instructions formatted as one large instruction bundle scheduled by the compiler; compiler checks dependences and groups instructions into bundles; complex compiler but simpler hardware

60
Q

example of VLIW processors

A

Inten Itanium, some GPUs and DSPs

61
Q

3 types of parallelisms

A

ILP (Instruction Level Parallelism) - instruction pipelining, branch prediction, multiple issue, etc.
TLP (Thread (or Task) Level Parallelism) - multi-core, etc.
DLP (Data Level Parallelism) - vector processor, GPUs, etc.

62
Q

Flynn’s Taxonomy

A
  1. SISD (single instruction single data): uniprocessor
  2. SIMD (single instruction multiple data): vector processor, multimedia extension, GPU
  3. MISD (multiple instruction single data): no commercial processor
  4. MIMD (multiple instruction multiple data): multicore
63
Q

TLP motivation

A

power and silicon cost grew faster than performance in exploiting ILP in 2000-2005

64
Q

Thread-Level Parallelism

A

have multiple program counters, uses MIMD model, targeted for highly-coupled shared-memory multiprocessors; for n processors, need n threads

65
Q

two types of procsesors for TLP

A

symmetric multiprocessors (SMP), distributed shared memory (DSM)

66
Q

symmetric multiprocessors (SMP)

A

small number of cores, share single memory with uniform memory latency, UMA architecture (uniform memory accsess)

67
Q

distributed shared memory (DSM)

A

memory distributed among processors, processors connected via direct (switched) and nondirect (multi-hop) interconnection networks, NUMA architecture (nonuniform memory access)

68
Q

the term shared memory associated with both symmetric multiprocessors (SMP) and distributed shared memory (DSM) refers to the fact that

A

address space is shared

69
Q

in both symmetric multiprocessors (SMP) and distributed shared memory (DSM), communication between threads occurs through

A

a shared memory space

70
Q

in contrast to symmetric multiprocessors (SMP) and distributed shared memory (DSM), the clusters and warehouse scale computer use ? to communicate data among processes

A

message-passing protocols (MPI: message passing interface)

71
Q

TLP multicore processors

A

can execute multiple threads simultaneously; threads are created by programmers or software systems; MIMD; shared memory system (virtual/physical memory) through a shared address space

72
Q

DLP

A

comes from simultaneous operations across large sets of data

73
Q

SIMD architectures can exploit significant data-level parallelism for

A

matrix-oriented scientific computing, media-oriented image and sound processors

74
Q

SIMD is (less/more) energy efficient than MIMD because

A

more; only needs to fetch one instruction per data operation, makes SIMD attractive for personal mobile devices

75
Q

SIMD allows programmer to continue to think

A

sequentially

76
Q

3 variations of DLP architecture

A
  1. vector architectures
  2. SIMD extensions
  3. Graphics Processor Units (GPUs)
77
Q

Intel MMX (1996), Streaming (SSE) (1999), Advanced Vector Extensions (AVX) (2010); operands must be in consecutive and aligned memory locations

A

SIMD Extensions implementations

78
Q

basic idea of vector architectures

A
  1. read sets of data elements scattered in memory into “large sequential vector registers”
  2. operate on those registers
  3. disperse the results back into memory
79
Q

vector architectures: dozens of register-register operations on independent data elements; registers are controlled by ? to ?

A

compiler; hide memory latency, leverage memory bandwith

80
Q

why SIMD extensions?

A

media applications operate on data types different from the native word size

81
Q

limitations of SIMD extensions compared to vector instructions

A

number of data operands encoded int op code, no sophisticated addressing modes (strided, scatter-gather), no mask register

82
Q

GPU

A

Graphics Processing Unit

83
Q

GPU main tasks

A

graphics rendering and GPGPU (General-Purpose computing on GPUs, available on most all modern GPUs (including mobile))

84
Q

GPGPU basic idea

A

also known as GPU computing; basic idea:
heterogeneous execution model (CPU host, GPU device), develop a C-like programming language for GPU, programming model is “Single Instruction Multiple Thread (SIMT)”

85
Q

two dominant programming languages for GPGPU

A

CUDA (NVIDIA) and OpenCL (industry standard)

86
Q

benefits to GPU

A
  1. accessibility (almost every computer has one)
  2. interoperability with display API
  3. small form factor
  4. scalability (both HW and SW are designed for scalability)
87
Q

hardware and scalability

A

an array of independent compute units

88
Q

software and scalability

A

an array of thread blocks (work-groups)

89
Q

basic idea of vector architectures

A
  1. read sets of data elements scattered in memory into “large sequential vector registers”
  2. operate on these registers
  3. disperse the results back into memory
90
Q

registers are controlled by

A

compiler

91
Q

registers are used to

A

hide memory latency, leverage memory bandwidth

92
Q

GPU main tasks

A
  1. graphics rendering

2. GPGPU

93
Q

basic idea of GPGPU

A

heterogeneous execution model, develop a C-like programming language for GPU, programming mode is Single Instruction Multiple Thread

94
Q

ILP on ? core, TLP on ? core, DLP on ? core

A

single, multi, many

95
Q

benefits to GPU

A

accessibility, interoperability, small form factor, scalability

96
Q

program optimizations

A

algorithm levels (biggest effect), data representation level, coding level, compiler level, hardware level

97
Q

loop unrolling

A

replicate the loop body multiple times
decreases # branch-related instruction
less control hazard
better instruction scheduling

disadvantage: increases code size

98
Q

gdp

A

text-based GNU debugging tool

99
Q

gprof

A

text-based GNU profiling tools; useful to determine the parts in program code that are time consuming