Instruction | Issue | Read operands | Execution | Write result |
LD F2,0(R1) | ||||
ADDD F6,F2,F4 | ||||
MULTD F8,F6,F0 | ||||
LD F10,-8(R1) | ||||
ADDD F12,F10,F4 | ||||
MULTD F14,F12,F0 | ||||
SUBI R1,R1,#16 | ||||
BNEZ R1,Loop |
Name | Busy | Op | Fi | Fj | Fk | Qj | Qk | Rj | Rk |
Int1 | No | ||||||||
Int2 | No | ||||||||
Add1 | No | ||||||||
Add2 | Yes | Add | F12 | F10 | F4 | No | No | ||
Mult1 | Yes | Add | F8 | F6 | F0 | Add1 | Yes | Yes | |
Div1 | No |
F0 | F2 | F4 | F6 | F8 | F10 | F12 | F14 |
Mult1 | Add2 |
Note that even though the integer functional units are not busy, we can't issue any more instructions since the next un-issued instruction is a multiply, and we don't have any multiply units available. The second add unit is almost finished; it is one clock cycle behind the first add unit since they couldn't both read the F4 register from the register file. Also, now Rj and Rk are both Yes for the multiply unit, which means it can read the operands on the next clock cycle.
From the time the operands are read, assume loads take 4 cycles to complete, integer ALU operations 3 cycles, floating point add 6 cycles, and floating point multiply 11 cycles. Then the scoreboard machine will take 33 cycles complete the above instructions. Note that for the most part, the scoreboard machine can execute two iterations of the loop in parallel since there are no loop-carried dependences. The only thing limiting this is the lack of a second floating point multiply unit.