Now before we begin, we make the following assumptions for this example:
might be compiled into this code in DLX
Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
Loop: LW R5, 0(R2) | I | D | X | M | W | ||||||||
LW R6, 0(R3) | I | D | X | M | W | ||||||||
ADD R7, R6, R5 | I | D | s | X | M | W | |||||||
SW 0(R1), R7 | I | s | D | X | M | W | |||||||
ADDI R1, R1, #4 | I | D | X | M | W | ||||||||
ADDI R2, R2, #4 | I | D | X | M | W | ||||||||
ADDI R3, R3, #4 | I | D | X | M | W | ||||||||
ADDI R4, R4, #4 | I | D | X | M | W |
The average CPI for this is 13 clock cycles / 8 instuctions = 1.625. Several stalls happen due to this order. The third instruction "ADD 57, R6, R6" has to stall once for the R6 to finish loading to forward. This also adds a stall to the store for R7. However with a simple rescheduling we can remove these stalls.
Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
Loop: LW R5, 0(R2) | I | D | X | M | W | |||||||
LW R6, 0(R3) | I | D | X | M | W | |||||||
ADDI R2, R2, #4 | I | D | X | M | W | |||||||
ADD R7, R6, R5 | I | D | X | M | W | |||||||
ADDI R3, R3, #4 | I | D | X | M | W | |||||||
SW 0(R1), R7 | I | D | X | M | W | |||||||
ADDI R3, R3, #4 | I | D | X | M | W | |||||||
ADDI R4, R4, #4 | I | D | X | M | W |
The average CPI for the loop is now 12 clock cycles / 8 instructions = 1.5. The CPI has dropped and there are no longer any stalls from the data. Now that we have seen how rescheduling can lower the CPI, lets try unrolling the loop.
![]() | ![]() |
Prev | Next |