One way to maximize the use of the pipeline, is to find an instruction that can be safely exeucted whether the branch is taken or not, and execute that instruction. So, when a branch instruction is encountered, the hardware puts the instruction following the branch into the pipe and begins executing it, just as in predict-not-taken. However, unlike in predict-not-taken, we do not need to worry about whether the branch is taken or not, we do not need to clear the pipe because no matter whether the branch is taken or not, we know the instruction is safe to execute. How do we know? The compiler promised it would be safe.
When the program was compiled, the compiler looked at each branch instruction,
and tried to find something that could be safely executed, whether we take the
branch or not. In the example code we have been using up to now:
LOOP:
LW R1, 0(R2)
SUB R3, R3, R1
BNEZ R3 LOOP
SW R4, 0(R3)
we couldn't execute the SW
R4, 0(R3) each time through the loop without causing any problems,
because the value of R3 is changing in our loop. But, suppose instead we had
the instruction SW
R3, 0(R4) following the BNEZ instruction. Now, we could execute this
instruction each time through the loop with no problems. We would simply be
writing repeatedly to the same memory location, and once we finally do exit
the loop, we will just continue on normally, no harm done. However, we are also
still wasting a cycle for each time we DO go through the loop.
A better strategy is for the compiler to find an instruction either from inside
the loop, or from before the loop, that can safely be rescheduled. For example,
given the following, slightly modified code fragment:
LOOP:
SUB R3, R3, R1
SW R3, 0(R5)
BNEZ R3 LOOP
SW R4, 0(R6)
We could still safely execute the SW
R4, 0(R3) each time through the loop, but that would provide marginal
improvement. A better solution is to move SW
R3, 0(R5) right after BNEZ. We will still execute it each time through
the loop, because it is in the branch-delay slot, the final result will not
be altered, and we get much better overall performance, whether we take the
loop or not. See below for an example run of this code fragment.
So, if this strategy offers improvements irregardless of whether we take or do not take the branch, what is the problem? The problem is trying to find an instruction that can both be safely executed whether the branch is taken or not, and will still improve performance. This is the compiler's job, and so using a branch delay slot makes compilers more complex to program. Also, Hennesy and Patterson mention that using this option does cause one shortcoming, if the hardware is changed so that a delay-branch slot is no longer used, all the old programs will no longer work. C programs would have to be recompiled, and assembly language programs and routines would have to be re-written. So, this method does put a lot more work into the hands of the system programmers.
Branch-Delay Example: |
LOOP: |
SUB R3, R3, R1
|
||||||||||||||||
|
||||||||||||||||||
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SUB R3,R3,R1 |
IF |
ID |
EX |
Mem |
WB |
|
|
|
|
|
|
|
|
|
|
|
|
|
BNEZ R3,Loop |
|
IF |
ID |
EX |
Mem |
WB |
|
|
|
|
|
|
|
|
|
|
|
|
SW R3, 0(R5) |
|
|
IF |
ID |
EX |
Mem |
WB |
|
|
|
|
|
|
|
|
|
|
|
SUB R3,R3,R1 |
|
|
|
IF |
ID |
EX |
Mem |
WB |
|
|
|
|
|
|
|
|
|
|
BNEZ R3,Loop |
|
|
|
|
IF |
ID |
EX |
Mem |
WB |
|
|
|
|
|
|
|
|
|
SW R3, 0(R5) |
|
|
|
|
|
IF |
ID |
EX |
Mem |
WB |
|
|
|
|
|
|
|
|
SW R4, 0(R6) |
|
|
|
|
|
|
IF |
ID |
EX |
Mem |
WB |
|
|
|
|
|
|
You can see from the above, that using this strategy we were able to eliminate all of the delays. Of course, this is assuming that the compiler is able to find a safe, relevant instruction. If the compiler cannot find anything relevant, then we may only get the same improvement as predict-taken or predict-not-taken. Or, if the compiler is unable to find any instruction that it is safe to execute, then it may fill the branch-delay slot with a no-op, and we are back to wasting one cycle through each iteration of the loop, just as in freezing. So, this method does provide for good use of the pipeline, but only in some cases. And it is dependent on the skill of the compiler to find those cases.