Questions Flashcards
Explain the different instruction set architectures (stack, accumulator, register-memory, and register-register).
By internal storage:
- accumulator architecture - operands are implicitly in accumulator
- stack architecture - operands are implicitly on top of the stack
- general-purpose register architecture - only explicit operands
By accessing memory and registers:
- register-memory architecture - any instruction can access memory
- load-store architecture - load and store instructions to access memory
- memory-memory architecture - not used anymore, all operands in memory
One special type is also extended accumulator or special-purpose register computers - more registers than just accumulator
Currently mostly used are general purpose register architectures - registers are faster than memories.
Give examples of addressing modes. Which are used in a typical RISC?
Acutal address is called effective address
Modes:
- register - value is in register
- immidiate - constants, operand is actual value
- displacement - memory[constant + regValue]
- register indirect - memory[regValue]
- indexed - memory[reg1Value + reg2Value]
- direct/absolute - memory[constant]
- memory indirect - memory[memory[regValue]]
- autoincrement/autodecrement - memory[regValue = regValue + d/regValue = regValue - d]
- scaled - memory[constant + reg1Value + reg2Value*d]
regValue - value in register
reg1Value, reg2Value - multiple registers value
d - displacement
immidiate or displacement both with 12-bit wide fields
register indirect - placing 0 to the 12-bit displacement
absolute adressing - using x0 as base register
byte addressable memory
load-store architecture
aligned and unaligned memory accesses - unaligned access may run extremly slow
Explain the simple 5-stage MIPS/RISC-V pipeline.
Consists of 5 stages - IF, ID, EX, MEM and WB.
In the first stage we fetch the instruction from the instruction memory. It also then increases the program counter by 4 to be ready to read another instruction
Second stage is responsible for reading the register file in order to access the proper registers for source operands. In this stage we also sign extend the immidiate operands.
Third stage is execute, in this stage ALU either computes the effective address if accessing the memory, or performs computation
Special case are branches and jumps which are evaluated and their addresses are calculated in ALU
In memory stage, pipeline reads data from the memory
The last stage is responsible for writing back to the registers
What is a structural hazard? Could there be one in the 5-stage MIPS/RISC-V pipeline? How is it resolved?
structural hazards - hardware cannot support overlapping multiple instructions
- in modern systems occurs only in slow special purpose units - can easily be solved by compiler constructors
If we would use the same memory for instructions and for data we might need to access the memory in instruction fetch phase and another instruction which is in memory read stage. Both would then access the memory which is not supported. It is resolved by using seperate memories for instructions and for data.
What is a control hazard? Solutions to avoid/reduce the penalty?
Control hazard results from branches. Unconditional branches might be easier to resolve since we can fetch the target address. On the other hand conditional branches might or might not be taken. This can lead to the fact that we may load incorrect instruction and we then subsequently need to clear the pipeline and start the execution again.
Penalty in this case is of the size of the pipeline.
This is also called freeze or flush. Freeze stalls the pipeline until branch is resolved.
Another solution is to predict the branch as always take or always untaken.
Delayed branch strategy picks a sequential successor and also branch target if taken.
More advanced solutions are branch predictors.They can be either static or dynamic
Is there a dependency between pipelining and ISA?
Instruction sets are influenced by pipelining. For example exceptions and branches greatly influence the ISA.
Especially problematic are instructions that change the state of the processor during execution (difficult to handle precise exceptions).
What is a direct mapped, set associative, and full associative
cache?
Direct mapped cache - blocks are always stored in the same place according to their address - usually address mod number of blocks. Leads to conflict misses.
Set associative - cache is divided into parts which then contain blocks. These sets are then calculated in similar ways as in direct mapped cache and then these sets are fully associative. Again leads to conflict misses
What is the difference between write-through and write-back cache?
write through - information is written both to the cache and lower-level memory
write back - information is written only to the cahce and then written to the main memory when replaced - uses dirty bit
Write back writes are at speed of the cache memory and uses less bandwidth and less energy - interesting for embedded systems and servers and multiprocessor systems
Write through is easier to implement - always clean memory, no need to write to lower level memories. Simplifies data coherency. Multilevel caches -> write through more viable
What is loop unrolling? Who performs it?
- Loop unrolling
○ Unrolls loop multiple times to adjust loop termination code
○ Usually uses more resources (e.g. more registers)
○ Can greatly increase performance of the code, but can lead to much larger size of the code
○ Scheduling of the unrolled loop increases the performance even more
○ Process of unrolling:
1. Determine that unrolling is actually useful, iterations must be independent
2. Use different registers to avoid unnecessary constraints
3. Eliminate extra tests and branches and adjust loop termination and iteration code
4. Determine if the loads and stores from different iterations are independent and can be interchanged
1) Requires analyzing the memory accesses if they access different locations
5. Schedule the code, preserving any dependences needed to yield the same result as the original code
○ Limits of improvement gain:
1. Decrease in the amount of overhead amortized with each unroll
2. Code size limitations
1) Can lead to more cache misses
2) Shortage of regsiters - register pressure
3. Compiler limitations
Compiler
How is the performance equation extended to include the cache? Are the parameters independent?
???
What is loop unrolling? Who performs it?
- Loop unrolling
○ Unrolls loop multiple times to adjust loop termination code
○ Usually uses more resources (e.g. more registers)
○ Can greatly increase performance of the code, but can lead to much larger size of the code
○ Scheduling of the unrolled loop increases the performance even more
○ Process of unrolling:
1. Determine that unrolling is actually useful, iterations must be independent
2. Use different registers to avoid unnecessary constraints
3. Eliminate extra tests and branches and adjust loop termination and iteration code
4. Determine if the loads and stores from different iterations are independent and can be interchanged
1) Requires analyzing the memory accesses if they access different locations
5. Schedule the code, preserving any dependences needed to yield the same result as the original code
○ Limits of improvement gain:
1. Decrease in the amount of overhead amortized with each unroll
2. Code size limitations
1) Can lead to more cache misses
2) Shortage of regsiters - register pressure
3. Compiler limitations
Compiler
Explain branch prediction and the branch history table.
The goal of branch prediction is to reduce the branch penalty. Branch predictions are trying to resolve if the branch might be taken or not and reduce the stalls of the processor.
simplest version - branch-prediction buffer, or branch history table
- simple buffer of lower bits of addresses with bit that indicates if the branch there was taken
- could be modified also by different branch with the same lower bits of address
- if branch was not taken, then bit is reversed to 0
- improvement is using 2 bit prediction schemes - prediction must miss twice before it changes - 11 taken, 10 taken, 00 untaken, 01 untaken
- also more bits can be used, but practice shows that 2 bits are enough
Explain branch prediction and why one wants to use a 2-bit predictor.
The goal of branch prediction is to reduce the branch penalty. Branch predictions are trying to resolve if the branch might be taken or not and reduce the stalls of the processor.
The main power of the 2-bit predictor are its accuracy and simplicity. In simple branch predictors we might use 1-bit to record history. This might lead to unwanted mispredictions. 2-bit predictor requires two incorrect predictions to change. More general scheme is an n-bit predictor.
What is a branch target buffer?
- Branch-target buffers
○ Stores branch target addresses
○ Sends out the address to PC before decoding, prediction must be correct
○ If PC of the next instruction matches one from the buffer, then predicting PC is used as next PC
○ If matching entry is found in the buffer, fetching begins immediately
○ Matching must be correct, otherwise we get a wrong PC and will worsen the performance
○ Stores only predicted-taken branches
Sketch the hardware for and the data flow of the Tomasulo Algorithm.
Page 198