Now before we begin, we make the following assumptions for this example:

First lets look at a simple example. Recall the example from earlier.
for(int i=0;i<1000;i++)
a[i] = b[i] + c[i];

might be compiled into this code in DLX

-- assume R1 contains the base address for the 'a' array
and R2 has the base address for the 'b' array and
R3 has the base address for the 'c' array and R4 starts at 1000

Loop: LW R5, 0(R2) ; element of b
LW R6, 0(R3) ; element of c
ADD R7, R6, R5 ; make next a
SW 0(R1), R7 ;
ADDI R1, R1, #4 ;
ADDI R2, R2, #4 ; increment addresses
ADDI R3, R3, #4
SUBI R4, R4, #1 ; decrement loop var
BNEZ R4, Loop

Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13
Loop: LW R5, 0(R2) I D X M W                
LW R6, 0(R3)   I D X M W              
ADD R7, R6, R5     I D s X M W          
SW 0(R1), R7       I s D X M W        
ADDI R1, R1, #4           I D X M W      
ADDI R2, R2, #4             I D X M W    
ADDI R3, R3, #4               I D X M W  
ADDI R4, R4, #4                 I D X M W

The average CPI for this is 13 clock cycles / 8 instuctions = 1.625. Several stalls happen due to this order. The third instruction "ADD 57, R6, R6" has to stall once for the R6 to finish loading to forward. This also adds a stall to the store for R7. However with a simple rescheduling we can remove these stalls.

Instruction 1 2 3 4 5 6 7 8 9 10 11 12
Loop: LW R5, 0(R2) I D X M W              
LW R6, 0(R3)   I D X M W            
ADDI R2, R2, #4     I D X M W          
ADD R7, R6, R5       I D X M W        
ADDI R3, R3, #4         I D X M W      
SW 0(R1), R7           I D X M W    
ADDI R3, R3, #4             I D X M W  
ADDI R4, R4, #4               I D X M W

The average CPI for the loop is now 12 clock cycles / 8 instructions = 1.5. The CPI has dropped and there are no longer any stalls from the data. Now that we have seen how rescheduling can lower the CPI, lets try unrolling the loop.


Prev Next