Final Flashcards
What is the CPE and what does it represent?
Cycles per element, it represents how many clock cycles are required to complete a given round of a function
What are the Latency, Throughput, Capacity for addition and multiplication for Integer and Floating point
Integer:
Addition: Latency 1, Capacity 4, Throughput 0.5
Multiplication: Latency 3, Capacity 1, Throughput 1
Floating Point
Addition: Latency 3, Capacity 1, Throughput 1
Multiplication: 5, Capacity 2, Throughput 0.5
What is the difference between a throughput bound and a latency bound?
A latency bound can be overcome with parallelism and multiple functional units executing at once; However, a throughput bound cannot be overcome as it has to do with how fast memory can be transferred.
How do you draw a instruction sequence diagram?
First write out all the registers that will be used at the top and bottom in boxes in the same order. Then write out the assembly in boxes in order, remembering if they have a statement like (%rbx) or (%rbx, %rcx, 8) that will require a load function. Then, connect the registers to the functions and the functions that feed into each other through arrows. For registers that are referenced in the code but not updated make sure their arrow connects from their top box all the way to their bottom box, with an arrow jutting off where it gets used. For registers that are referenced in the code but do not have their starting value used, make sure they only have an arrow from the function that update their value, not their starting value. (Reference Assignment 1)
To then flatten this into the other form, find the registers that are updated during the code, and thus have an arrow that feeds into the function and come out of the function, but does connect from top to bottom. Write all such registers out. a decent chunk apart from each other. Next, write all relevant functions which those registers connect to and link them together.
How do you find a critical path from the diagram?
First, check for parallelism which could make the obvious critical path incorrect. After parallelism is accounted for, draw a line through the functions that each iteration of the program will have to wait for to proceed.
Usually the critical path is just whatever function everything feeds into
What is k x 1, k x k, and k x 1a loop unrolling and what are its benefits?
k x 1 loop unrolling is where you take k steps in the loop and make them into one single step
So for example in 3x1 loop unrolling where you need to multiply each entry in a by the corresponding entry in b and add them together you would have
a[i] * b[i] + a[i+1] * b[i+1] + a[i+2] * b[i+2]
kxk loop unrolling is where you take k steps in the loop and assign them to k accumulators while are later summed together
kx1a loop unrolling is where you place parenthesis around each individual operation which is being performed then summed in most cases such that they can be performed in parallel before addition. So going back to the kx1 example you would write the same thing except (a[i] * b[i]) + (a[i+1] * b[i+1]) + (a[i+2] * b[i+2])
How do you calculate all the T’s for a disk storage?
T_maxrotation = 1/RPM * 60 sec/minutes
T_avgrotation = T_maxrotation/2
T_avgseek = Usually Given/Can’t be determined without experiment
T_transfer = 1/RPM * 1/avgsectorspertrack * 60sec/minutes
T_access = T_avgrotation + T_avgseek + T_transfer
T_headpos = equation will be given but T_Avgseek + T_avgrotation to be safe
Capacity = bytes/sector * veragesectors/track * tracks/surface * surface/platter * platter/disk
T_bestcasefileread = T_headpos + (SizeOfFile * 1/BlockSize * Blocks/Sector * Sector/Track) * T_maxrotation
T_randomcasefileread = (SizeOfFile * 1/BlockSize * Blocks/Sector) * T_headpos
What is the general structure of a disk?
Sectors are the place where blocks of info are held, sectors are put onto a track (one track takes one rotation to go through) tracks are stored on platters, platters are stored on the disk.
What addresses are memory address from right to left?
Cache Offset, Cache Index, Cache Tag
What are the elements of a cache table and how do you calculate each? What are the fundamental parameters as well?
C = B * E * S
t = m - (s+b)
s = log_2(S)
b = log_2(B)
S: Number of Sets
E: Number of Lines per Set
B: Block Size
m: Number of memory bits
What are the two types of optimizations?
Local optimizations which help a singular step
Global Optimizations that affect the entire workflow pipeline
What are the 12 compiler optimization steps and what they mean?
Fold constants: Any expression with constant inputs should have those inputs folded to be a constant value. So for example if long mask = 0xFF «_space;8 we fold that to 0xFF00
Inline Expressions: Move the body of code from the function that another function is calling into the function that is calling it. Allows for greater optimizations to be made.
Eliminate Common Expressions: Reuse common portions of expressions to calculate other expressions, rather than recalculating every time. For example if val = (i-1) * n and val2 = (i+1) * n, we would rearrange the code to do the expensive i*n calculation and then simply add or subtract n to get the final result. This can only be done for simpler cases for to eliminate variables used multiple times in close succession that don’t need to be recalculated each time.
Restructure loops, move code out of loops: Take constant operations outside the loop so they don’t need to be performed multiple times. So if I need to calculate ni for a step but ni doesn’t change throughout the entire loop, move it outside the loop and create aa variable ni.
Reduce control flow to gotos: Covert loops to goto
Eliminate dead code: Don’t emit any code that will never be executed, don’t emit any code that has its result immediately overwritten.
Reduce operation strength: Replace expensive operations with cheaper ones. Generally target multiplication and division with a shift. For example int a = b*5 goes to int a = (b «_space;2) + b
Select instructions, schedule instructions
Instruction Level Pipelining/Parallelism: Loop Unrolling, more generally scheduling operations one after another to utilize multiple functional units. Blocked by control dependency where outcome it determined by a jump or data dependency a future step relies on the result of a previous step.
Allocate registers
Emit assembly language
What are some optimization blockers?
Procedure calls: treated as a black box by the compiler
Memory aliasing: Counter by having local variables accumulate within loops
What are SIMD registers?
Single Instruction Multiple Data Sets
They are registers that can be 1, 2, 4, and 8 byte. What’s is important is that an operation can be performing on all SIMD registers as the same time.
What is the basic idea and stages of pipelining? what are the limits
The idea of pipelining is that while compiling code, rather than doing each step one at a time for all lines, you compile each line one at a time and immediately feed it into the next step while simultaneously executing the previous step on the next line. The stages are fetch, decode, execute, memory, writeback, and PC update.
You cannot pipeline at times due to data dependencies from function to function