Previous: Part One: In the Home: Contents Next: Clean Up Crew Subsections

Part Two: Variation on a Theme

Now for the tough stuff...

Introduction and Prelude to a Mess

Gee, this piece of code should look familiar:

int i;
double x[100];
for (i=0; i<100; i++) { x[i]*=(double)2.0; }

This is the same code as the first loop, except things are just a little more complicated. We have just a few more lines of startup code, and a few minor modifications.

addi r1, r0, #1000 
addi r3, r0, #100  
addi r2, r0, #2 
movi2fp f0, r2
cvti2d f0, f0
loop: ld f2, 0(r1)	
multd f2, f0, f2
sd 0(r1), f2	
addi r1, r1, #8	
subi r3, r3, #1 
bnez r3, loop
(nop)

Our multiply is going to take 5 cycles in the execute stage (see assumptions) instead of one. Now for the ugly part:

Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

addi r1, r0, #1000 I D X M W

addi r3, r0, #100 I D X M W

addi r2, r0, #2 I D X M W

movi2fp f0, r2 I D X M W

cvti2d f0, f0 I D X M W

loop: ld f2, 0(r1)
I D X M W

multd f2, f0, f2 I D S X X X X X M W

sd 0(r1), f2 I S D S S S S X M W

addi r1, r1, #8 I S S S S D X M W

subi r3, r3, #1 I D X M W

bnez r3, loop I D S X M W

branch delay slot nop I

next instruction I

We have a lot more stalls per loop iteration: one from the load/multd, four between the multiply and store, and one before the branch. This loop has a 13 cycle loop initiation interval (6 inst+6 stalls+1bds), and a 16 cycle loop latency. Over 7 instructions inside the loop, this gives a per-loop CPI of 16/7=2.2. Overall, we have 13*100+5+2=1307 cycles over 7*100+5=705 instructions, for a CPI of 1.85. Judging from all the stalls on the inside, it can stand to be improved.

Unrolling and rescheduling the loop (details excluded), we have:

loop4:
ld f2, 0(r1)
ld f4, 8(r1)
ld f6, 16(r1)
ld f8, 24(r1)
ld f10, 32(r1)
multd f2, f0, f2
multd f4, f0, f4
multd f6, f0, f6
multd f8, f0, f8
multd f10, f0, f10
subi r3, r3, #5 
sd 0(r1), f2	
sd 8(r1), f4	
sd 16(r1), f6	
sd 24(r1), f8	
sd 32(r1), f10	
bnez r3, loop4 
addi r1, r1, #40

This is better: We have 19 instructions per loop, iterated 20 times. We have four stall cycles per multiply with all except the last that overlaps with the subtraction. However, this late multiply will conflict in the MEM stage of the SDs. This is the biggest obstacle in rescheduling with the multicycle pipeline. Not only do the floating point operations take longer, they cause many more stalls. As we will see in the next section, rescheduling becomes complicated very quickly...

The ``I HATE MULTICYCLE'' Canon and Fugue

Deja vu all over again: here is the other loop in a floating point version.

int i;
double x[100], y[100], z[100];
for (i=0; i<100; i++) { z[i]=x[i]*y[i]; }

Here is the basic loop:

	addi r1, r0, #1000  ; r1 contains address of array elt for x
	addi r2, r0, #2000  ; r2 contains address of array elt for y
	addi r3, r0, #3000  ; r3 contains address of array elt for z
	addi r4, r0, #100   ; r4 contains loop counter
loop: ld f0, 0(r1)	    ; load x
	ld f2, 0(r2)        ; load y
	multd f4, f0, f2    ; multiply 
	sd 0(r3), f12	    ; store result
	addi r1, r1, #8	    ; increment array indices
	addi r2, r2, #8 
	addi r3, r3, #8
	subi r4, r4, #1     ; decrement loop counter
	bnez loop, r4       ; if loop counter = 0, exit
	(nop)

Here's a quick rundown of the stalls involved: one between the store and add, one at the subtract when the multiply reaches the mem stage, and an additional stall between the subtract and branch. Overall, this loop has a latency of 16 cycles in 10 instructions, for a CPI of 1.6. The loop initiation interval is 16 instructions, at 100 iterations plus 4 startup and 2 extra cycles, equals 1606 cycles total. The startup code has 4 instructions; from the loop we get 10*100 instructions, so we have 1004 instructions, giving an overall CPI of 1.59.

After unrolling and minor rescheduling, now the code becomes:

loop:ld f0, 0(r1)	   
	ld f2, 0(r2)
	ld f4, 8(r1)	   
	multd f20, f0, f2
	ld f6, 8(r2)
	ld f8, 16(r1)	   
	multd f22, f4, f6
	ld f10, 16(r2)
	ld f12, 24(r1)	   
	multd f24, f8, f10
	ld f14, 24(r2)
	ld f16, 32(r1)	   
	multd f26, f12, f14
	ld f18, 32(r2)
	sd 0(r3), f20	   
	multd f28, f16, f18
	sd 8(r3), f22	   
	sd 16(r3), f24	   
	sd 24(r3), f26	   
	sd 32(r3), f28	   
	addi r1, r1, #40
	subi r4, r4, #5    
	addi r2, r2, #40 
	bnez r4, loop3     
	addi r3, r3, #40

``WAIT,'' you scream, ``WHAT DID YOU DO TO THE MULTIPLIES?!'' Multiplies have four stall cycles when paired up with the load or store of the result, so we include a few instructions between the multiplies. This reduces the number of stalls to two. We now have 25 instructions per loop, 2 stalls for all but the last multiply, and one stall on the last multiply. This takes 25+4*2+1=42 cycles per loop. So we have 20*42+4=844 cycles total, over 25*20+4=504 instructions, giving a CPI of 1.67. This is actually worse than the rolled up loop!

What happened here? We now have more stall cycles per instruction (9/25) than before (3/10). This means that our reordering didn't work as it should have. So let's try changing our reordering, and we'll see what happens. Although we haven't decreased the CPI, we have decreaed the number of instructions.

Grand Finale

Saving the best for last...

What are our obstacles for reordering? First, we need to make sure that the last multiply finishes before the branch. We can put the store in the branch delay slot, so it can finish as late as it wants: the only stall we won't be able to remove are the stalls between multiplies and the memory stall from the last multiply. In order for the multiply to finish in time (without stalls), it must be at least five instructions from the end of the store. We can move the multiplies back further and interleave the other instructions, but we must take care not to add any stalls resulting from back-to-back loads and multiplies. With this in mind, we get:

loop4:	
	ld f0, 0(r1)	   
	ld f2, 0(r2)
	ld f4, 8(r1)	   
	multd f20, f0, f2
	ld f6, 8(r2)
	ld f8, 16(r1)	   
	ld f10, 16(r2)
	multd f22, f4, f6
	ld f12, 24(r1)	   
	ld f14, 24(r2)
	subi r4, r4, #5    
	multd f24, f8, f10
	ld f16, 32(r1)	   
	ld f18, 32(r2)
	sd 0(r3), f20	   
	multd f26, f12, f14
	sd 8(r3), f22	   
	sd 16(r3), f24	   
	sd 24(r3), f26	   
	multd f28, f16, f18
	addi r1, r1, #40
	addi r2, r2, #40 
	addi r3, r3, #40
   	bnez r4, loop4     
	sd 32(r3), f28

Now we have eliminated all but 5 stalls: the contention for the memory stage after each multiply causes a stall. Over 25 instructions, we now have 30 cycles per loop. Over 20 loops and 4 startup cycles, this comes to 604 cycles. We have 25*20+4=504 instructions, giving a new CPI of 1.2. Much much better.

Previous: Part One: In the Home: Contents Next: Clean Up Crew