Okay, you say you want a longer example? Here it is. So far we've looked at unrolling and rescheduling on the simple 5-stage DLX pipelike pipe. Now we will look at an example using the Multi-cycle DLX.
- Assume Multi-cycle DLX.
- Assume Forwarding is implemented.
- Assume ignore BNEZ command in analysis of CPI.
- Assume ignore contributions to the branch.
- Assume use latencies given in the book. (Figure 3.43)
The DLX code is as follows:
- ADDI R4, R0, #5200 ; make a float 5200
- MVI2FP F4, R4 ;
- CVTI2FP F4, F4 ; F4 has a float constant
- ADD R1, R0, R0 ; init counter to 0
- Loop: LF F2, 100(R1) ; F2 is array element, R1 has offset of lowest unused array element
- LF F3, 500(R1) ; F3 holds array element
- SUBF F5, F3, F2 ; perform subtraction
- ADDF F5, F5, F4 ; perform addition of a constant
- SF 1000(R1), F5 ; store the results
- ADDI R1, R1, #4 ; increment pointer
- SUBI R5, R1, #400 ; check pointer
- BNEZ R5, Loop ; branch while not done
Instruction |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
Loop: LF F2, 100(R1) |
I |
D |
X |
M |
W |
|
|
|
|
|
|
|
|
|
|
|
|
|
LF F3, 500(R1) |
|
I |
D |
X |
M |
W |
|
|
|
|
|
|
|
|
|
|
|
|
SUBF F5, F3, F2 |
|
|
I |
D |
s |
A1 |
A2 |
A3 |
A4 |
M |
W |
|
|
|
|
|
|
|
ADDF F5, F5, F4 |
|
|
|
I |
s |
D |
s |
s |
s |
A1 |
A2 |
A3 |
A4 |
M |
W |
|
|
|
SF 1000(R1), F5 |
|
|
|
|
|
I |
s |
s |
s |
D |
s |
s |
s |
X |
M |
W |
|
|
ADDI R1, R1, #4 |
|
|
|
|
|
|
|
|
|
I |
s |
s |
s |
D |
X |
M |
W |
|
SUBI R5, R1, #400 |
|
|
|
|
|
|
|
|
|
|
|
|
|
I |
D |
X |
M |
W |
The average CPI for this loop is 18 clock cycles / 7 instructions = 2.571. Now lets see what happens when we unroll the loop 4 times, and reschedule to avoid stalls. The new code is as follows.
- ADD R1, R0, RO
- ADDI R4, R0, #5200 ; make a float 5200
- MVI2FP F14, R4 ;
- CVTI2FP F14, F14 ; F14 has a float constant
- ADD R1, R0, R0 ; init counter to 0
- Loop: LF F2, 100(R1) ;
- LF F6, 500(R1)
- LF F3, 100(R1) ;
- LF F7, 504(R1) ;
- SUBF F10, F6, F2 ;
- LF F4, 108(R1) ;
- SUBF F11, F7, F3 ;
- LF F8, 508(R1)
- LF F5, 112(R1)
- LF F9, 512(R1)
- SUBF F12, F8, F4
- SUBF F13, F9, F5
- ADDI R1, R1, #16
- ADDF F10, F10, F14
- ADDF F11, F11, F14
- ADDF F12, F12, F14
- ADDF F13, F13, F14
- SUBI R5, R1, #400
- SF 984(R1), F10
- SF 988(R1), F11
- SF 992(R1), F12
- SF 996(R1), F13
- BNEZ R5, Loop ; branch while not done
Instruction |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
Loop: LF F2, 100(R1) |
I |
D |
X |
M |
W |
|
|
|
|
|
|
|
|
|
|
|
LF F6, 500(R1) |
|
I |
D |
X |
M |
W |
|
|
|
|
|
|
|
|
|
|
LF F3, 104(R1) |
|
|
I |
D |
X |
M |
W |
|
|
|
|
|
|
|
|
|
LF F7, 504(R1) |
|
|
|
I |
D |
X |
M |
W |
|
|
|
|
|
|
|
|
SUBF F10, F6, F2 |
|
|
|
|
I |
D |
A1 |
A2 |
A3 |
A4 |
M |
W |
|
|
|
|
LF F4, 108(R1) |
|
|
|
|
|
I |
D |
X |
M |
W |
|
|
|
|
|
|
SUBF F11, F7, F3 |
|
|
|
|
|
|
I |
D |
A1 |
A2 |
A3 |
A4 |
M |
W |
|
|
LF F8, 508(R1) |
|
|
|
|
|
|
|
I |
D |
X |
s |
M |
W |
|
|
|
LF F5, 112(R1) |
|
|
|
|
|
|
|
|
I |
D |
s |
X |
s |
M |
W |
|
LF F9, 512(R1) |
|
|
|
|
|
|
|
|
|
I |
s |
D |
s |
X |
M |
W |
SUBF F12, F8, F4 |
|
|
|
|
|
|
|
|
|
|
|
I |
s |
D |
A1 |
A2 |
SUBF F13, F9, F5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
I |
D |
A1 |
ADDI R1, R1, #16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
I |
D |
ADDF F10, F10, F14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
I |
Instruction |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
30 |
31 |
32 |
SUBF F12, F8, F4 |
A3 |
A4 |
M |
W |
|
|
|
|
|
|
|
|
|
|
|
|
SUBF F13, F9, F5 |
A2 |
A3 |
A4 |
M |
W |
|
|
|
|
|
|
|
|
|
|
|
ADDI R1, R1, #16 |
X |
M |
W |
|
|
|
|
|
|
|
|
|
|
|
|
|
ADDF F10, F10, F14 |
D |
A1 |
A2 |
A3 |
A4 |
M |
W |
|
|
|
|
|
|
|
|
|
ADDF F11, F11, F14 |
I |
D |
A1 |
A2 |
A3 |
A4 |
M |
W |
|
|
|
|
|
|
|
|
ADDF F12, F12, F14 |
|
I |
D |
A1 |
A2 |
A3 |
A4 |
M |
W |
|
|
|
|
|
|
|
ADDF F13, F13, F14 |
|
|
I |
D |
A1 |
A2 |
A3 |
A4 |
M |
W |
|
|
|
|
|
|
SUBI R5, R1, #400 |
|
|
|
I |
D |
X |
s |
s |
s |
M |
W |
|
|
|
|
|
SF 984(R1), F10 |
|
|
|
|
I |
D |
s |
s |
s |
X |
M |
W |
|
|
|
|
SF 988(R1), F11 |
|
|
|
|
|
I |
s |
s |
s |
D |
X |
M |
W |
|
|
|
SF 992(R1), F12 |
|
|
|
|
|
|
|
|
|
I |
D |
X |
M |
W |
|
|
SF 996(R1), F13 |
|
|
|
|
|
|
|
|
|
|
I |
D |
X |
M |
W |
|
The average CPI for this loop is now 31 clock cycles / 22 instructions = 1.409. This is significantly less then the original 2.571 CPI. Also you hit the BNEZ only 25% of the time of the original.