Now before we begin, we make the following assumptions for this example:
might be compiled into this code in DLX
Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
Loop: LW R5, 0(R2) | I | D | X | M | W | ||||||||
LW R6, 0(R3) | I | D | X | M | W | ||||||||
ADD R7, R6, R5 | I | D | s | X | M | W | |||||||
SW 0(R1), R7 | I | s | D | X | M | W | |||||||
ADDI R1, R1, #4 | I | D | X | M | W | ||||||||
ADDI R2, R2, #4 | I | D | X | M | W | ||||||||
ADDI R3, R3, #4 | I | D | X | M | W | ||||||||
ADDI R4, R4, #4 | I | D | X | M | W |
The average CPI for this is 13 clock cycles / 8 instuctions = 1.625. Several stalls happen due to this order. The third instruction "ADD 57, R6, R6" has to stall once for the R6 to finish loading to forward. This also adds a stall to the store for R7. However with a simple rescheduling we can remove these stalls.
Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
Loop: LW R5, 0(R2) | I | D | X | M | W | |||||||
LW R6, 0(R3) | I | D | X | M | W | |||||||
ADDI R2, R2, #4 | I | D | X | M | W | |||||||
ADD R7, R6, R5 | I | D | X | M | W | |||||||
ADDI R3, R3, #4 | I | D | X | M | W | |||||||
SW 0(R1), R7 | I | D | X | M | W | |||||||
ADDI R3, R3, #4 | I | D | X | M | W | |||||||
ADDI R4, R4, #4 | I | D | X | M | W |
The average CPI for the loop is now 12 clock cycles / 8 instructions = 1.5. The CPI has dropped and there are no longer any stalls from the data. Now that we have seen how rescheduling can lower the CPI, lets try unrolling the loop.
Prev | Next |