final Flashcards
Fixed-overhead encoding
[mask][instruction list]
Mask is 8 bits, where 1 represents active instruction and 0 represents a nop
[10100111][A][B][C][D][E] => [A][nop][B][nop][nop][C][D][E]
+simple, but (-) has higher overhead
Templated-based encoding
○ Have templates to represent common patterns
○ (+) Relatively low overhead
○ (+) no penalty for short instruction
○ (-) limited set of templates, so still need some nops
○ Example:
§ [temp0][A][B][C]
§ Temp0 = (0->0). (1->2). (2->5). Chain.
§ Expands to:
§ [A][nop][B][nop][nop][C]
§ “Map instruction 0 to location 0”
How does VLIW handle branching?
- Two-Step Branching -
○ Separate comparison from branch instruction (comparison results are stored in a branch-condition register)
○ Between compare and branch, can fill with other instructions
○ Branch instruction requires 2 delays slots
how does VLIW handle predication?
- Partial predication:
○ Execute all individual operations on both sides of the branch
○ Select the final based on a condition
○ Slct R3 = B1, R4, R3
If B1, then R3 = R4, otherwise R3 = R3
How does VLIW handle speculation?
○ Control speculation: move instruction out of branch condition
Data speculation: check if LD/ST are to the same address after loading and storing
Compiler front-end steps
- Scanner - reads tokens
- Parser - gets tokens from scanner and checks syntax (tokens are in right place)
Semantic Analysis - does the sentence make sense? (“aaa” + 10)
register allocation steps
- Web: Determine live ranges for each value
- Interference: Determine overlapping ranges => graph
- Spill cost (how many definitions/uses exist in that range): Compute the benefit of keeping each web in a register (vs memory)
- Allocation: Decide which webs get registers
- Spilling and splitting: split webs if needed
- Assignment: assign actual registers to webs
Code generation
webs
○ All reachable uses and definitions are placed in the same web (i.e. if a def/use case for that variable is not reachable by the web, it is placed in a different web)
interference graph
○ Each node in the graph represents a variable/web
○ Each edge represents two variables that are live at the same time (i.e. A-B means that A and B are live at the same time)
graph coloring heuristic
- Remove nodes that have degree < N and push onto a stack
- For remaining nodes (degree >= N): repeatedly remove and push onto stack the node with the least spill cost
○ ==> result is a stack with the nodes ordered from highest to lowest spill cost to nodes with deg < N
○ This enables us to prioritize coloring nodes with the highest spill code first - To color:
- Pop a node off the top of the stack and color it differently than its neighbors that are already colored
- If no color available, node is assigned to memory
choosing which web to split
○ Has interference degree >= N
○ Has minimal spill cost
register coalescing
if r0 and r1 do not interfere, combine their webs and assign them one register
(-) might create a web that is not colorable
register targeting
○ Some variables need to be in specific registers at a certain time (args to function, return value, etc)
○ Pre-color those webs to eliminate copy instructions
Presplitting of webs
○ Some variables have large “dead” regions (i.e. regions between def and use where the variable is not used)
○ ==> Can presplit the web into two live ranges
Interprocedural register allocation
○ Store registers across procedural boundaries is expensive
○ Can customize procedural call by cloning the function to save loading and storing a register
§ Each clone function is only available for that call site
what is differential register allocation?
Instead of using instruction bits to id a register, use less bits to id the difference between the last and next regsister (i.e. r1, r2 => 1, r2, r5 => 3)
adjacency graph
used in differential register allocation. captures the order in which registers are accessed, An edge from r1 -> r2 with weight 2, means that r2 is accessed after r1 2 times
adjacency graph: Edge coverage
if an edge in the adjacency graph can be encoded with the given number of bits
adjacency graph: cost
the sums of the weights of the uncovered edges
differential remapping algorithm
○ Start with initial register allocation and compute cost
○ Iteratively swap a pair of registers and recompute the cost
§ Make sure to do all possible swappings
§ Pick the new remapping with the lowest cost
Repeat step 2 until there is no more cost reduction
what is the Simple Offset Problem?
- How to most efficiently assign variables on the stack using auto-increment/auto-decrement modes
- Assumes one address register
access graph
used for simple offset problem. Undirected, weighted graph. Each node is a variable. Each edge represents an access a–(3)–b (a and b are accessed after one another 3 times, direction a/b b/a doesnt matter)
access graph: weight
sum of all the edges
access graph: cover
all nodes are connected, but no cycles, each node can have at most 2 neighbors
access graph: cover cost
weight - cost of cover
General Offset Assignment Problem:
- Generalization of the simple offset problem
○ Have multiple address registers
Given N address registers, split the access graph into N covers
Relax Fixed Evaluation order
SOA and GOA assume a fixed-evaluation order => can we transform the access sequence to improve our assignments by swapping commutative and associative ops?
Least Cost Access Sequence (LCAS):
Must keep applying commutative and associative transforms on the access sequence until you get the least cost access sequence
pruning
used for finding the least cost access sequence: trying all possible combos is exhausting:
○ Start with first statement, generate all possible swaps
○ Prune the ones with max cost
○ Add second statement to subset
○ Prune the ones with max cost
Repeat
graph model for LD/ST parallelization
- Each load/store is a node
- Edge between nodes means they can be merged into a parallel LD/ST instruction
○ They can be merged if their operands are in different memories - Pick maximum number of edges that are disjoint (i.e. not sharing a node)
- Edge between nodes means they can be merged into a parallel LD/ST instruction
Moveable Boundary Problem
How can we rearrange the program order to get more parallelized LD/STs?
Pseudo-fixed boundary
○ Move stores as early as possible (assume other instructions are fixed): STORES CAN ONLY BE MOVED UP
Move loads as late as possible (assume other instructions are fixed): LOADS CAN ONLY BE MOVED DOWN
what is variable duplication?
copy a variable from one bank to another to improve parallelization
what is rematerialization?
recompute the value that you want duplicated and then store it into the other memory bank (instead of ld/st-ing)
What is the Dual-Bank Register Assignment Problem?
both operands must come from different banks, but this can lead to problems with other instructions with the same constraints
Dual-Bank Register Assignment is for what kind of instruction
ALU
Register Conflict Graph
For bank conflicts
- Conflict edge -
○ Edge between two registers indicating that they must be in different registers (i.e. both registers appear as source registers in an ALU instruction)
- Edge must exist in interference graph
Register conflict subgraph
Subgraph of the interference graph that only shows the conflict edges
what is a Bipartite Graph?
- Graph that can be divided into two sets such that:
○ There is no edge between nodes in a single set
what is the No conflict law?
A RCG is conflictless (and thus can be allocated) if it is bipartite and A graph is bipartite if there are no odd cycles in RCG
how to detect odd cycles
- Breadth-First Hierarchy:
○ Node level: the shortest path length from a node to the root
○ Parallel edge - if two nodes have the same node level and there is an edge between them, then this indicates an odd cycle
○ Can detect cycles of length 2k+1, where k is the node level
Live Range Splitting
method to split odd cycles: Split a live range into multiple variables to remove an edge from the graph