Solutions - More Sample Problems

Solution to problem 1

Use register renaming to eliminate all but true dependencies the following loop:

  1. LD F0,0(R1)
  2. ADDD F4,F0,F2
  3. MULTD F2,F6,F8
  4. ADDD F4,F10,F12
  5. SUBI R1,R1,8
  6. ADDD F8,F4,F0

Rewritten code:

  1. LD F0,0(R1)
  2. ADDD F4,F0,F2
  3. MULTD F32,F6,F8
  4. ADDD F44,F10,F12
  5. SUBI R51,R1,8
  6. ADDD F68,F44,F0

Dependencies in original code::

True dependencies: S1 to S2 carried by F0, S1 to S6 carried by F0, S4 to S6 carried by F4

Output dependencies S2 to S4 carried by F4, antidependencies S2 to S3 carried by F2, S1 to S5 carried by R1, S3 to S6 carried by F8

 

 

 Solution for problem 2

We are going to consider prediction schemes related to the loop S2. Assume that all prediction schemes are initialized to (Not) taken.

  1. S1 for(i=0;i<2;i++)
  2. S2 for(j=0;j<4;j++)
  3. {
  4. a += b(i)*c(j);
  5. }
  1. write assembly pseudocode corresponding to this loop
  2. i=0

    j=0

    Loop 1: i++

    Loop2: j++

    Load b(i-1) in R1

    Load c(j-1) in R2

    Load a in R3

    Multiply R1 by R2 add to R3

    Store R3 in a

    If j<4 branch to Loop2

    If i<2 branch to Loop1

     

  3. assume that we have a one bit prediction scheme - for which iterations do we mispredict S2?
  4.  

    Assume we initially predict branch not taken.

    i=0:

    First time (j=1) through we mispredict S2 as not taken (new prediction = Taken). For j=1,2 we correctly predict as taken (new prediction = Taken). For j=3 we mispredict as taken since this is the last iteration and the branch is not taken (new prediction = Not Taken).

    i=1

    First time (j=1) we mispredict S2 as not taken (new prediction = Taken). For j=1,2 we correctly predict as taken (new prediction = Taken). For j=3 we mispredict as taken since this is the last iteration and the branch is not taken (new prediction = Not Taken).

     

  5. assume that we have a two bit prediction scheme - for which iterations do we mispredict S2?

We initially predict branch is not taken (predictor = (NT,NT)). We use the prediction associated with the second position of this tuple.

 

i=0

First time (j=1) through we mispredict S2 as not taken (new prediction = (T,NT)). For j=1 we mispredict as not taken (new prediction = (T,T), j=2 we correctly predict as taken (new prediction = (T,T)). For j=3 we mispredict as taken since this is the last iteration and the branch is not taken (new prediction = (NT,T)).

i=1

First time (j=1) we predict S2 as taken (new prediction = (T,T)). For j=1,2 we correctly predict as taken (new prediction = (T,T)). For j=3 we mispredict as taken since this is the last iteration and the branch is not taken (new prediction = NT,T)

 

 

 

 

Solution to Problem 3

3) Consider the following loop:

for(i=0; i

S1: for(j=0;j<8;j++)

{

x(8*j) = x(8*j) + y(4*j);

} }

Assume that each array element is 4 bytes and that memory addresses are interleaved between banks at the word level (e.g bank 0 gets bytes 0-3, bank 1 gets bytes 4-7, etc).

  1. Assume that we have 8 independently accessible memory banks. Each bank can produce a datum every 50ns. What is a lower bound on the amount of time required to carry out an iteration of loop S1?
  2.  

     

    All elements of x accessed (x(0),x(8),..,x(56)) fall in the same memory bank. Thus, at best we could fetch a single element every 50ns. It will take no less than 400 ns to fetch all elements. All elements of y are assigned to two different memory banks. One memory bank gets y(0), y(8), y(16), y(24); the other memory bank gets y(4), y(12), y(20) and y(28). Thus we can fetch y(0) and y(4) simultaneously, y(8) and y(12) simultaneously, etc. Thus we could obtain all 8 elements of y in 4, 50ns memory cycles or 200ns. Depending on how x(0) and y(0) are assigned to memory banks, we might be able to overlap reading elements of x and y; at best then we could read all necessary elements of x and y in 400ns. For the reasons described above, it would take an additional 400ns to write elements of x. Consequently, the lower bound would be 800 ns.

  3. How could you rewrite the code to improve this situation? What is your new lower time bound?

 

You could rewrite x and y into temporary arrays so that consecutively accessed elements of x and y were assigned to adjacent array elements. (for instance, y(0), y(8), y(16) …) would be assigned t ytemp(0), ytemp(1), ytemp(2) etc.).