VLIW Theory Flashcards
What are the behaviors that the compiler cannot predict?
The cash and the branch behaviors
Can we have in the bundle more than one branch instruction?
No, we can’t
What are the requirements of the fetching stage on a multiple issue processor
Higher bandwidth from the instruction cache
How the decision of the instructions to be issued is taken any multiple issue processor
Either by dynamic scheduling or static scheduling
What are the bundles in a VLIW architecture?
They are a single issue packet that represents a wide instruction with multiple independent operations
What is the impact of the fact that in a VLIW architecture there is a shared multi ported register file
The implies in the necessity of having enough ports to read and write the multiple operations issued by each very long single instruction in parallel
Therefore, in the case of a 4 slot bundle we need eight read ports to read eight source registers and four write ports to write four destination register per cycle
What happens in case there is not enough parallelism in the source code to fill in the available operation slots
NOPs are inserted
What is the impact of the number of functional unities compared to the number of slots in the bundle, In the architecture
If each slot is assigned to functional unit the decode unit is a simple decoder otherwise, if there are more parallel functional units than the number of issues, the architecture must have a dispatch network to redirect each operation and the related source operands to the target FU
How does the VLIW processor solves RAW Hazard?
The compiler during the scheduling phase, reorders statically instructions (not involved in the dependency) otherwise the compiler introduces some NOPs
What information must be exposed to the compiler in order to obtain correct instruction execution?
Operation latencies and data dependencies must be exposed to the compiler
How are the WAR and WAW hazard solved in the VLIW processors
They are solved statically by the compiler by correctly selecting temporary slots for the operations or by using register renaming
How are Miss predicted branches solved in the VLIW architecture
A miss predicted branch must be solved dynamically by the hardware by flushing the execution of the speculative instruction in the pipeline
What needs to be done and what is the effect of avoiding structural hazards accessing the RF and WAR/WAW Hazard
It is necessary to keep in order execution, the write back phase of the parallel operations in a bundle must occur at the same clock cycle
This Implies that each bundle is constrained to the latency of the longer latency operation in it
What is the implication of doing out of order execution?
These way, we need to check in the write back phase for RF write accesses (avoid structural hazards) and WAR/WAW Hazzard
What is and what causes register pressure
Register pressure is related to the occupation of the registers
It is caused by the multicycle latency of the operations together with the multiple issue of instructions that generates an increment of the number of register occupied overtime
What are the events that the compiler at static time does not know about
Data cache misses and branch miss predictions
What are basic blocks?
Straight line code sequence with no branches in/out except at the entry/exit point
What is the relationship between basic blocks, data dependencies and the exploitable ILP in a program?
Beyond the fact that the amount of parallelism available within a basic block is small, data dependencies can further limit to the amount of ILP. We can exploit within a basic block.
Therefore, to obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks
What are the main advantages of the LIW architecture
Good performance through extensive compiler optimizations (the compiler can analyze the code on a wider instruction window and the compiler has more time to analyze dependencies and extract parallelism)
Reduced hardware complexity
What are the challenges of the LIW architectures
Need for strong compiler technology
Code size increase
Huge number of registers needed for registering renaming
lw $2, BASEA($4)
addi $2,$2,INC1
lw $3, BASEB($4)
addi $3,$3,INC2
addi $4, $4, 4
bne $4,$7, L1
MIPS 5-stage pipelined scalar scheduling with forwarding and early branch resolution in ID stage
See picture 4S in the ACA album
lw $2, BASEA($4)
addi $2,$2,INC1
lw $3, BASEB($4)
addi $3,$3,INC2
addi $4, $4, 4
bne $4,$7, L1
4-issue VLIW with 4 fully pipelined FUs with:
- LD/ST latency 2
- INT latency 1
- INT to BR latency 2
See picture 5S in the ACA album
lw $2, BASEA($4)
addi $2,$2,INC1
lw $3, BASEB($4)
addi $3,$3,INC2
addi $4, $4, 4
bne $4,$7, L1
2-issue VLIW with 2 fully pipelined FUs with :
- LD/ST latency 2
- INT latency 1
- INT to BR latency 2
See picture 6S in the ACA album
What is a dependency graph regarding VLIW scheduling
They are graphs that capture true, anti, and output dependencies between instructions
Each note represents an operation in the basic block
The connections between notes are related to dependencies
Each note is notated with the longest path to reach it and the node latency
What is the longest critical path in a dependence graph?
Defines the minimum execution time of a basic block
What is the relationship between scheduling and the dependence graph?
The scheduling problem consists of assigning the cycle of the execution to each operation in the graph
What is a ready state
An operation is in the ready state, if all of its predecessors have been scheduled and if the operands are ready
How does the list based scheduling algorithm works
Before scheduling begins, operations on top of the graph are inserted in the ready state
Starting from the first cycle, for each cycle, try to schedule operations from the ready state that can fill the available resources lots
When more notes are in the ready state, select the operation with the highest priority (longest path the end of the graph)
In order to expand parallelism in the static scheduling, what can we do?
The compiler must try to expand basic blocks and scheduling instructions across basic blocks
What is the difference between local and Global scheduling?
Local scheduling techniques operate within a single basic block, while global scheduling techniques operate across basic blocks
What loop and unrolling is and within which technique, does it fits?
It is a local scheduling technique, where the compiler must test if loop iterations are independent to each other
The loop body is replicated multiple times depending on the unrolling factor, adjusting the loop termination code
What are the advantages and disadvantages of loop unrolling
Loop overhead is minimized and extends the length of basic blocks, exposing more instructions that can be effectively scheduled to minimize NOP insertions
Disadvantages: increase the register, pressure, the code size, and the cache misses
What is loop carried dependencies and loop level analysis?
Loop level analysis involves determining what data dependence exist among the operands across the iterations of a loop
Loop carried Dependencies otherwise determines whether data in later iterations are dependent on data values produced in earlier iterations.
What is loop peeling
Loop peeling is a technique used in optimizing compilers and code transformations to improve the performance or simplify the logic of a loop. This technique involves extracting one or more iterations from the beginning or end of a loop and handling them separately from the main loop body.
What is softer pipelining ?
It is a technique for reorganizing loops, such that each iteration of the software pipelines code is made from instructions chosen from different iterations of the original loop
It interleaves instructions from different iterations without unrolling the loop
What is the global scheduling technique?
Global scheduling techniques aims at compacting a code fragment with internal control structures into the shortest possible sequence, preserving data, and control dependencies
How does the global scheduling technique works?
It tries to find Parrales macros conditional branches through two steps
First, trace selection, and then trace compaction
What are super blocks?
It is a group of basic blocks with a single entrance and multiple control exits
See picture 16 in the ACA album and draw its dependence graph
See picture 16S in ACA
Look at the dependence graph on picture 17 on the ACA album
Consider that the latencies for ALU, LS, and MUL operations are 1, 2, and 3 respectively, and:
b, c, e, g => ALU ops
f => LS ops
a, d => MUL ops
2 ALUs, 1 LS and 1 MUL
What is the result using the ASAP scheduling algorithm
See picture 17S
Look at the dependence graph on picture 17 on the ACA album
Consider that the latencies for ALU, LS, and MUL operations are 1, 2, and 3 respectively, and:
b, c, e, g => ALU ops
f => LS ops
a, d => MUL ops
1 ALU, 1 MUL and 1 LS
What is the result using the List based algorithm
See picture 18S
What are the loop carried dependence’s in the picture 19 in the ACA album?
See picture 19S
What are the loop carried dependence’s in the picture 20 in the ACA album?
See picture 20S
What are the loop carried dependence’s in the picture 21 in the ACA album?
See picture 21
What are the loop carried dependence’s in the picture 21 in the ACA album?
See picture 21S