Solution to problem 1
Use register renaming to eliminate all but true dependencies the following loop:
Rewritten code:
Dependencies in original code::
True dependencies: S1 to S2 carried by F0, S1 to S6 carried by F0, S4 to S6 carried by F4
Output dependencies S2 to S4 carried by F4, antidependencies S2 to S3 carried by F2, S1 to S5 carried by R1, S3 to S6 carried by F8
Solution for problem 2
We are going to consider prediction schemes related to the loop S2. Assume that all prediction schemes are initialized to (Not) taken.
i=0
j=0
Loop 1: i++
Loop2: j++
Load b(i-1) in R1
Load c(j-1) in R2
Load a in R3
Multiply R1 by R2 add to R3
Store R3 in a
If j<4 branch to Loop2
If i<2 branch to Loop1
Assume we initially predict branch not taken.
i=0:
First time (j=1) through we mispredict S2 as not taken (new prediction = Taken). For j=1,2 we correctly predict as taken (new prediction = Taken). For j=3 we mispredict as taken since this is the last iteration and the branch is not taken (new prediction = Not Taken).
i=1
First time (j=1) we mispredict S2 as not taken (new prediction = Taken). For j=1,2 we correctly predict as taken (new prediction = Taken). For j=3 we mispredict as taken since this is the last iteration and the branch is not taken (new prediction = Not Taken).
We initially predict branch is not taken (predictor = (NT,NT)). We use the prediction associated with the second position of this tuple.
i=0
First time (j=1) through we mispredict S2 as not taken (new prediction = (T,NT)). For j=1 we mispredict as not taken (new prediction = (T,T), j=2 we correctly predict as taken (new prediction = (T,T)). For j=3 we mispredict as taken since this is the last iteration and the branch is not taken (new prediction = (NT,T)).
i=1
First time (j=1) we predict S2 as taken (new prediction = (T,T)). For j=1,2 we correctly predict as taken (new prediction = (T,T)). For j=3 we mispredict as taken since this is the last iteration and the branch is not taken (new prediction = NT,T)
Solution to Problem 3
3) Consider the following loop:
for(i=0; i
S1: for(j=0;j<8;j++)
{
x(8*j) = x(8*j) + y(4*j);
} }
Assume that each array element is 4 bytes and that memory addresses are interleaved between banks at the word level (e.g bank 0 gets bytes 0-3, bank 1 gets bytes 4-7, etc).
All elements of x accessed (x(0),x(8),..,x(56)) fall in the same memory bank. Thus, at best we could fetch a single element every 50ns. It will take no less than 400 ns to fetch all elements. All elements of y are assigned to two different memory banks. One memory bank gets y(0), y(8), y(16), y(24); the other memory bank gets y(4), y(12), y(20) and y(28). Thus we can fetch y(0) and y(4) simultaneously, y(8) and y(12) simultaneously, etc. Thus we could obtain all 8 elements of y in 4, 50ns memory cycles or 200ns. Depending on how x(0) and y(0) are assigned to memory banks, we might be able to overlap reading elements of x and y; at best then we could read all necessary elements of x and y in 400ns. For the reasons described above, it would take an additional 400ns to write elements of x. Consequently, the lower bound would be 800 ns.
You could rewrite x and y into temporary arrays so that consecutively accessed elements of x and y were assigned to adjacent array elements. (for instance, y(0), y(8), y(16) …) would be assigned t ytemp(0), ytemp(1), ytemp(2) etc.).