CSCI 223 Virtual Memory, Instruction Pipelining, Parallelism Flashcards

Question

execution stages

Answer 1

1. fetch (read instruction from instruction memory) 2. decode (understand instruction and registers) 3. execute (compute value or address) 4. memory (read or write data) 5. write back (write program registers)

Answer 2

formulate instruction execution as sequence of simple steps, use same general form for all instructions fetch: read instruction byte, read register byte, compute next PC decode: read operand A, read operand B execute: perform ALU operation, set condition code register memory write back: write back result

Answer 3

too slow to be practical; hardware units only active for fraction of clock cycle

Answer 4

instruction pipelining

Answer 5

divide process into independent stages; move objects through stages in sequence; at any given times, multiple objects are being processed

Answer 6

nonuniform delays (throughput limited by slowest stage; other stages sit idle for much of the time; challenging to partition system into balanced stages)

Answer 7

if all stages are balanced (i.e., take the same time), | ETpipeline = ETnonpipelined / (# stages + time for filling and exiting)

Answer 8

number of stages

Answer 9

increased throughput

Answer 10

increases slightly due to stage imbalance and increased hardware complexity

Answer 11

ETinst * # inst / [(ETinst * # inst) / (# stages + Tfilling&exiting)] = # stages

Answer 12

situations that prevent starting the next instruction in the next cycle

Answer 13

1. structural hazard 2. data hazard 3. control hazard

Answer 14

a required resource does not exist or is busy

Answer 15

need to wait for previous instruction to complete its data read/write

Answer 16

deciding on control action depends on previous instruction

Answer 17

cause: conflict for use (or lack) of a hardware resource solution: add more hardware

Answer 18

cause: data dependence (an instruction depends on the completion of data access by a previous instruction) solution: forwarding, code scheduling/reordering

Answer 19

RAW (read after write; true data dependence) WAW (write after write; ouput, name data dependence) WAR (write after read; anti, name data dependence) RAR (read after read; not a dependence)

Answer 20

use result when it is computed; don't wait for it to be stored in a register, requires extra connections in the hardware circuit

Answer 21

can't always avoid stalls by forwarding (if value not computed when needed; can't forward back in time)

Answer 22

reorder code to avoid use of load result in the next instruction; can be done by compiler or hardware, lengthened distance

Answer 23

cause: control flow instruction (branch determines flow of control) solution: predict branch outcome, fetch predicted, fetch new if wrong > penalty (look at history/past outcomes to predict)

Answer 24

wait until branch outcome determined before fetching next instruction

Answer 25

longer pipelines can't readily determine branch outcome early; control hazard stall penalty becomes unacceptable predict outcome of branch; only stall if prediction is wrong

Answer 26

done by compiler before execution; based on typical branch behavior; ex: loop and if-statement branches

Answer 27

hardware measures actual branch behavior (e.g., record recent history of each branch); assume future behavior will continue the trend (when wrong, stall while re-fetching and update history)

Answer 28

branch prediction buffer (history table) is introduced to keep track of recent branch behavior l-bit branch predictor uses l bits to keep track

Answer 29

increasing instruction throughput (executes multiple instructions in parallel; each instruction has the same latency)

Answer 30

complete multiple instructions per clock cycle; multiple issue processors

Answer 31

superscalar processors and VLIW (very long instruction word) processors

Answer 32

statically scheduled - @ compile time | dynamically scheduled - @ run time (know more so better)

Answer 33

issue varying # instructions per clock cycle, execute either in-order or out-of-order (OOO), hardware takes care of all dependences, complex hardware

Answer 34

most desktop/laptop processors

Answer 35

issue a fixed # of instructions formatted as one large instruction bundle scheduled by the compiler; compiler checks dependences and groups instructions into bundles; complex compiler but simpler hardware

Answer 36

Inten Itanium, some GPUs and DSPs

Answer 37

ILP (Instruction Level Parallelism) - instruction pipelining, branch prediction, multiple issue, etc. TLP (Thread (or Task) Level Parallelism) - multi-core, etc. DLP (Data Level Parallelism) - vector processor, GPUs, etc.

Answer 38

1. SISD (single instruction single data): uniprocessor 2. SIMD (single instruction multiple data): vector processor, multimedia extension, GPU 3. MISD (multiple instruction single data): no commercial processor 4. MIMD (multiple instruction multiple data): multicore

Answer 39

power and silicon cost grew faster than performance in exploiting ILP in 2000-2005

Answer 40

have multiple program counters, uses MIMD model, targeted for highly-coupled shared-memory multiprocessors; for n processors, need n threads

Answer 41

symmetric multiprocessors (SMP), distributed shared memory (DSM)

Answer 42

small number of cores, share single memory with uniform memory latency, UMA architecture (uniform memory accsess)

Answer 43

memory distributed among processors, processors connected via direct (switched) and nondirect (multi-hop) interconnection networks, NUMA architecture (nonuniform memory access)

Answer 44

address space is shared

Answer 45

a shared memory space

Answer 46

message-passing protocols (MPI: message passing interface)

Answer 47

can execute multiple threads simultaneously; threads are created by programmers or software systems; MIMD; shared memory system (virtual/physical memory) through a shared address space

Answer 48

comes from simultaneous operations across large sets of data

Answer 49

matrix-oriented scientific computing, media-oriented image and sound processors

Answer 50

more; only needs to fetch one instruction per data operation, makes SIMD attractive for personal mobile devices

Answer 51

sequentially

Answer 52

1. vector architectures 2. SIMD extensions 3. Graphics Processor Units (GPUs)

Answer 53

SIMD Extensions implementations

Answer 54

1. read sets of data elements scattered in memory into "large sequential vector registers" 2. operate on those registers 3. disperse the results back into memory

Answer 55

compiler; hide memory latency, leverage memory bandwith

Answer 56

media applications operate on data types different from the native word size

Answer 57

number of data operands encoded int op code, no sophisticated addressing modes (strided, scatter-gather), no mask register

Answer 58

Graphics Processing Unit

Answer 59

graphics rendering and GPGPU (General-Purpose computing on GPUs, available on most all modern GPUs (including mobile))

Answer 60

also known as GPU computing; basic idea: heterogeneous execution model (CPU host, GPU device), develop a C-like programming language for GPU, programming model is "Single Instruction Multiple Thread (SIMT)"

Answer 61

CUDA (NVIDIA) and OpenCL (industry standard)

Answer 62

1. accessibility (almost every computer has one) 2. interoperability with display API 3. small form factor 4. scalability (both HW and SW are designed for scalability)

Answer 63

an array of independent compute units

Answer 64

an array of thread blocks (work-groups)

Answer 65

1. read sets of data elements scattered in memory into "large sequential vector registers" 2. operate on these registers 3. disperse the results back into memory

Answer 66

hide memory latency, leverage memory bandwidth

Answer 67

1. graphics rendering | 2. GPGPU

Answer 68

heterogeneous execution model, develop a C-like programming language for GPU, programming mode is Single Instruction Multiple Thread

Answer 69

single, multi, many

Answer 70

accessibility, interoperability, small form factor, scalability

Answer 71

algorithm levels (biggest effect), data representation level, coding level, compiler level, hardware level

Answer 72

replicate the loop body multiple times decreases # branch-related instruction less control hazard better instruction scheduling disadvantage: increases code size

Answer 73

text-based GNU debugging tool

Answer 74

text-based GNU profiling tools; useful to determine the parts in program code that are time consuming

CSCI 223 Virtual Memory, Instruction Pipelining, Parallelism Flashcards

(99 cards)