Now for the tough stuff...
Gee, this piece of code should look familiar:
int i; double x[100]; for (i=0; i<100; i++) { x[i]*=(double)2.0; }
This is the same code as the first loop, except things are just a little more complicated. We have just a few more lines of startup code, and a few minor modifications.
addi r1, r0, #1000 addi r3, r0, #100 addi r2, r0, #2 movi2fp f0, r2 cvti2d f0, f0 loop: ld f2, 0(r1) multd f2, f0, f2 sd 0(r1), f2 addi r1, r1, #8 subi r3, r3, #1 bnez r3, loop (nop)
Our multiply is going to take 5 cycles in the execute stage (see assumptions) instead of one. Now for the ugly part:
Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | |
addi r1, r0, #1000 | I | D | X | M | W | ||||||||||||||||
addi r3, r0, #100 | I | D | X | M | W | ||||||||||||||||
addi r2, r0, #2 | I | D | X | M | W | ||||||||||||||||
movi2fp f0, r2 | I | D | X | M | W | ||||||||||||||||
cvti2d f0, f0 | I | D | X | M | W | ||||||||||||||||
loop: ld f2, 0(r1) |
I | D | X | M | W | ||||||||||||||||
multd f2, f0, f2 | I | D | S | X | X | X | X | X | M | W | |||||||||||
sd 0(r1), f2 | I | S | D | S | S | S | S | X | M | W | |||||||||||
addi r1, r1, #8 | I | S | S | S | S | D | X | M | W | ||||||||||||
subi r3, r3, #1 | I | D | X | M | W | ||||||||||||||||
bnez r3, loop | I | D | S | X | M | W | |||||||||||||||
branch delay slot nop | I | ||||||||||||||||||||
next instruction | I |
We have a lot more stalls per loop iteration: one from the load/multd, four between the multiply and store, and one before the branch. This loop has a 13 cycle loop initiation interval (6 inst+6 stalls+1bds), and a 16 cycle loop latency. Over 7 instructions inside the loop, this gives a per-loop CPI of 16/7=2.2. Overall, we have 13*100+5+2=1307 cycles over 7*100+5=705 instructions, for a CPI of 1.85. Judging from all the stalls on the inside, it can stand to be improved.
Unrolling and rescheduling the loop (details excluded), we have:
loop4: ld f2, 0(r1) ld f4, 8(r1) ld f6, 16(r1) ld f8, 24(r1) ld f10, 32(r1) multd f2, f0, f2 multd f4, f0, f4 multd f6, f0, f6 multd f8, f0, f8 multd f10, f0, f10 subi r3, r3, #5 sd 0(r1), f2 sd 8(r1), f4 sd 16(r1), f6 sd 24(r1), f8 sd 32(r1), f10 bnez r3, loop4 addi r1, r1, #40
This is better: We have 19 instructions per loop, iterated 20 times. We have four stall cycles per multiply with all except the last that overlaps with the subtraction. However, this late multiply will conflict in the MEM stage of the SDs. This is the biggest obstacle in rescheduling with the multicycle pipeline. Not only do the floating point operations take longer, they cause many more stalls. As we will see in the next section, rescheduling becomes complicated very quickly...
Deja vu all over again: here is the other loop in a floating point version.
int i; double x[100], y[100], z[100]; for (i=0; i<100; i++) { z[i]=x[i]*y[i]; }
Here is the basic loop:
addi r1, r0, #1000 ; r1 contains address of array elt for x addi r2, r0, #2000 ; r2 contains address of array elt for y addi r3, r0, #3000 ; r3 contains address of array elt for z addi r4, r0, #100 ; r4 contains loop counter loop: ld f0, 0(r1) ; load x ld f2, 0(r2) ; load y multd f4, f0, f2 ; multiply sd 0(r3), f12 ; store result addi r1, r1, #8 ; increment array indices addi r2, r2, #8 addi r3, r3, #8 subi r4, r4, #1 ; decrement loop counter bnez loop, r4 ; if loop counter = 0, exit (nop)
Here's a quick rundown of the stalls involved: one between the store and add, one at the subtract when the multiply reaches the mem stage, and an additional stall between the subtract and branch. Overall, this loop has a latency of 16 cycles in 10 instructions, for a CPI of 1.6. The loop initiation interval is 16 instructions, at 100 iterations plus 4 startup and 2 extra cycles, equals 1606 cycles total. The startup code has 4 instructions; from the loop we get 10*100 instructions, so we have 1004 instructions, giving an overall CPI of 1.59.
After unrolling and minor rescheduling, now the code becomes:
loop:ld f0, 0(r1) ld f2, 0(r2) ld f4, 8(r1) multd f20, f0, f2 ld f6, 8(r2) ld f8, 16(r1) multd f22, f4, f6 ld f10, 16(r2) ld f12, 24(r1) multd f24, f8, f10 ld f14, 24(r2) ld f16, 32(r1) multd f26, f12, f14 ld f18, 32(r2) sd 0(r3), f20 multd f28, f16, f18 sd 8(r3), f22 sd 16(r3), f24 sd 24(r3), f26 sd 32(r3), f28 addi r1, r1, #40 subi r4, r4, #5 addi r2, r2, #40 bnez r4, loop3 addi r3, r3, #40
``WAIT,'' you scream, ``WHAT DID YOU DO TO THE MULTIPLIES?!'' Multiplies have four stall cycles when paired up with the load or store of the result, so we include a few instructions between the multiplies. This reduces the number of stalls to two. We now have 25 instructions per loop, 2 stalls for all but the last multiply, and one stall on the last multiply. This takes 25+4*2+1=42 cycles per loop. So we have 20*42+4=844 cycles total, over 25*20+4=504 instructions, giving a CPI of 1.67. This is actually worse than the rolled up loop!
What happened here? We now have more stall cycles per instruction (9/25) than before (3/10). This means that our reordering didn't work as it should have. So let's try changing our reordering, and we'll see what happens. Although we haven't decreased the CPI, we have decreaed the number of instructions.
What are our obstacles for reordering? First, we need to make sure that the last multiply finishes before the branch. We can put the store in the branch delay slot, so it can finish as late as it wants: the only stall we won't be able to remove are the stalls between multiplies and the memory stall from the last multiply. In order for the multiply to finish in time (without stalls), it must be at least five instructions from the end of the store. We can move the multiplies back further and interleave the other instructions, but we must take care not to add any stalls resulting from back-to-back loads and multiplies. With this in mind, we get:
loop4: ld f0, 0(r1) ld f2, 0(r2) ld f4, 8(r1) multd f20, f0, f2 ld f6, 8(r2) ld f8, 16(r1) ld f10, 16(r2) multd f22, f4, f6 ld f12, 24(r1) ld f14, 24(r2) subi r4, r4, #5 multd f24, f8, f10 ld f16, 32(r1) ld f18, 32(r2) sd 0(r3), f20 multd f26, f12, f14 sd 8(r3), f22 sd 16(r3), f24 sd 24(r3), f26 multd f28, f16, f18 addi r1, r1, #40 addi r2, r2, #40 addi r3, r3, #40 bnez r4, loop4 sd 32(r3), f28
Now we have eliminated all but 5 stalls: the contention for the memory stage after each multiply causes a stall. Over 25 instructions, we now have 30 cycles per loop. Over 20 loops and 4 startup cycles, this comes to 604 cycles. We have 25*20+4=504 instructions, giving a new CPI of 1.2. Much much better.