1.1 Amdahl's Law shows that overall speedup is achieved by speeding up some part (lets call that part y) by some amount (lets call that k). The whole program is x+y, so speedup is 1/(x+y/k). Note that the more frequent a bit of code is, it's corresponding y (or x) is bigger, therefore, assuming the "common case" is y, making the common case fast really improves the program. 1.2 YES! R0 is not a "write to" register, so the compiler should be smart enough to never use R0 as a destination, so there will not be any physical errors involving R0. RAR is not a hazard. RAW is not a hazard bacause there is no writing to R0, WAR is not a hazard for the same reason, and WAW is also not a hazard for the same reason. There can be a control hazard if BNEZ loop, R0 is executed and we're assuming taken or BEQZ loop, R0 assuming not taken. 1.3 Advantage: Using offset based addressing allows a rudimentary stack interface (if you want one) or array indexing. Of course, you are simulating these features through code, but you get them. And, you don't need to store an address in the instruction--merely a smaller offset to the PC address. Disadvantage: It's possible to have an alignment error. And, you are limited in the range of addresses that can be referenced with offsets by the length of the offset. 1.4 NO! I can write a microcoded turing machine that will implement any instruction set. More to the point, the pipeline need not be used for the DLX ISA. Therefore we know two hardware interfaces that implement DLX, so DLX doesn't uniquely determine it's hardware, and DLX is a valid ISA. 1.5 Accumulator machines use one register to do all of their work, so it must be at least (1,2) to get any useful work done (GPR machines can do useful work in (0,2). (NOTE: the (m,n) notation requires more than 1 register. The accumulator only has one register; so, it's not a GPR!!) Make the common case fast! If there is one register, and registers are fast, and all work must be done in registers, and memory is slower than the register (virtually guaranteed), the common case isn't fast for an accumulator. GPR machines do work in the fast registers. 1.6 Adding forwarding and bypassing allows some stalls to be removed. On average, at least 1 stall will be removed, so the CPI should be lower (Sp(overall) = pipe depth / (1 + stalls/instruction)). 1.7 Both program A and B run faster. Assuming overall speedup, A runs 2.5 times faster, and B runs 3.14 times faster. Therefore the execution times are SHORTER. However, no information about the relationship between the execution times of A and B can be inferred from this result.
2.1 It doesn't matter what K is, speedup can NOT be 4. If you're only enhancing half of the instructions, the upper bound of the speedup is 2. 2.2 time new == (x + y/K) Since x is the time spent in the unenhancable part and y is the part that can be improved, and K is the enhancement on y. 2.3 Speedup overall = T/Tho = x+y / (x + y/K) This differs from the usual Amdahl's Law equation since x+y is not yet normalized to be 1. let w = x/(x+y) and z = y/x+y), then SPoverall = 1/(w+z/K). 2.4 let x be unenhanceable time let 10y be old floating time let y be new time x + 10y = 50 x + y = 10 subtracting 9y = 40 so y = 40/9 <--- answer Picture: __________________________ before :| x | 10y | == 50 ________ after :| x |y | == 10 by the way, x = 50/9, and 50/9 + 40/9 = 90/9 = 10 50/9 + 400/9 = 450/9 = 50 so it checks too!
Unless you realize that you can actually hard code "0.5" into a memory location and then load it into an FP register, then tony is right. you can't divide by two, or, rather, multiply by one-half. --- mmh
3.1 Assume location of B is at R1; Assume location of C is at R2; Assume location of future X1 is at R3; Assume location of future X2 is at R4; Assume single precision Floating Point; Assume sqrt takes F0, sqrts it,and returns to Floating Point Register F0 as single precision; Assume LFI exists because we can't do a shift on a float and expect meaningful results, and division just plain doesn't work, and I can't come up with a way to generate 0.5. We need 0.5 to do division by 2. LF F1, 0(R1) // Get B, save to F1 LF F2, 0(R2) // Get C, save to F2 ADDI R10, R0, #-4 // Load -4 to R10 MOVI2FP F0, R10 // then put it in the FP regs CVTI2F F0, F0 // and finally turn it into a float MULTF F2, F0, F2 // C = -4C by using -4 calculated before MULTF F3, F1, F1 // Calculate B squared, save to F3 ADDF F0, F3, F2 // Add F3 and F2 aka B^2 + (-4C), save to F0 MOVI2FP F10, R0 // Fetch (int) 0 from the integer regs CVTI2F F10, F10 // and make it (float) 0 LTF F0, F10 // Compare F0 to F10, save result in // magical status register (aka is F0 // negative) BFPT noroots // if compare is true, then there's no roots JAL sqrt // Procedure call sqrt (see assumption) // Note that PC + 4 is saves in R31 for return ADDI R10, R0, #-1 // Load (int) -1 into R10 MOVI2FP F10, R10 // then put it in a FP reg. CVTI2F F10, F10 // and make it (float) -1 MULTF F1, F1, F10 // make B = (-1)*B ADDF F11, F1, F0 // make + numerator in F11, aka 2X1 SUBF F12, F1, F0 // make - numerator in F12, aka 2X2 LFI F10, #0.5 // See comment above MULTF F11, F11, F10 // F11 = F11 * 1/2, so F11 = X1 MULTF F12, F12, F10 // F12 = F12 * 1/2, so F12 = X2 noroots SF 0(R3), F11 // write back X1 SF 0(R4), F12 // write back X2 // Note that garbage was written back when // there were no roots. User's fault, he // should've checked // DONE! 3.2 LB R1, 64(R0) // load 1st 8 bits from 64 LB R2, 65(R0) // load 2nd 8 bits from 65 LB R3, 66(R0) // load 3rd 8 bits from 66 LB R4, 67(R0) // load 4th 8 bits from 67 LB R5, 68(R0) // load 5th 8 bits from 68 LB R6, 69(R0) // load 6th 8 bits from 69 LB R7, 70(R0) // load 7th 8 bits from 70 LB R8, 71(R0) // load 8th 8 bits from 71 SB R8, 64(R0) // store in reverse order SB R7, 65(R0) // " SB R6, 66(R0) // " SB R5, 67(R0) // " SB R4, 68(R0) // " SB R3, 69(R0) // " SB R2, 70(R0) // " SB R1, 71(R0) // " // DONE! 3.3 1 LW R5, 8(R0) // Can't use 10 due to byte alignment error 2 ADD R4, R0, R0 // Can't use Double Precision FP Add here 3 ADD R3, R4, R5 // Doesn't make sense to put displacement // here, ALU can only do one thing at a time 4 ADDI R5, R5, #1 // R0 can't be dest., ADDI requires an // immediate value 5 SW 104(R4), R5 // we don't have indexed mode. 6 correct 7 SW 100(R4), R5 // can't have 103 due to byte alignment 8 ADDI R5, R5, #-32760 // Immediate too big for 16 bits 9 SW 108(R4), R5 // Dest. comes before source
NOTE: This is a badly specified problem. Any one you see will have more information, and will require less speculation than this.
4.1 variables used: CPI = Average Clock Cycle Count Per instruction IC = Instruction count CC = Time for 1 clock cycle a) MFLOPS = Floating IC --------------------- Execution time * 10^6 assume a pure floating point instruction mix T = Execution Time = Floating IC ------------- MFLOPS * 10^6 b) MIPS = Total IC --------------------- Execution time * 10^6 T = Execution Time = Total IC ----------- MIPS * 10^6 4.2 Assume CPI of ALU, Load/Store, Branches == 1 a) CPU time = IC x CPI x 1 / CR CR = Clock rate, measurec in cycles per second (Hz) IC for compiler O = 5+1+1 = 7 * 10^6 IC for compiler HO = 10+1+1 = 12 * 10^6 CPI is unchanged, assume = 1 CR is unchanged, given as 100 MHz = 100 * 10^6 cycles / sec b) MIPS = IC / (CPU time * 10^6) = IC / (IC x CPI x 1/CR * 10^6) = 1 / (CPI x 1/CR * 10^6) // CPI and CR are constant = 1 / (1 x 1/(100 x 10^6) x 10^6) = 1 / (1/100) = 100 c) MFLOPS assumes that there's a floating point insruction, which there isn't, so MFLOPS is undefined ----------------------------------------------- Compiler O Compiler HO a) CPUtime: .007s .012 s b) MIPS: 100 100 c) MFLOPS: undefined undefined 4.3 The new compiler is bad, it produces longer code. The branches are not reduced, the loads are unchanged, but the ALU instructions doubled. That's bad. 4.4 If MIPS of O is higher than HO, I can assume that CPI != 1 for all instuctions, and that the code produced by O have more instructions with lower CPI (for example, software based floating point instead of hardware FP). The reversal of MFLOPS supports the theory above, implying that there are much fewer FP instruction in the O code. Based on these observations, MFLOPS and MIPS are horrible for use as predictors of execution time, since the FP mix is not constant.
5.1 Assume read and write on registers during 1 clock is allowed, write first, read second. Assume 1 read/write port on memory (so resource contention does exist) Control : None (there are no branches) Data : RAW hazard on line 3: LW R5, 20(R0) IF D EX MEM WB ADD R4, R0, R0 -- IF D EX MEM WB SW 100(R4), R5 -- -- IF D st st EX <-- RAW hazard solve by forwarding from LW R5, 20(R0)'s MEM to SW's EX Structural : Hazard on line 4, can't read and write at the same time. - Solve by separate memories or multiple ports contention for instruction and data memory on line 6. Error on line 7 1 LW R5, 20(R0) IF D EX MEM WB 2 ADD R4, R0, R0 -- IF D EX MEM WB 3 SW 100(R4), R5 -- -- IF D st st EX MEM WB 4 ADD R5, R5, R5 -- -- -- IF st st ID EX MEM WB 5 SW 104(R4), R5 -- -- -- -- -- -- IF ID st st EX MEM WB 6 ADD R5, R5, R5 -- -- -- -- -- -- -- IF st st ID 7 SW 106(R4), R5 <-------------------------- can't load word from location 106, structural hazard 8 ADD R5, R5, R5 9 SW 108(R4), R5 5.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 1 IF D EX MEM WB 2 -- IF D EX mem WB 3 -- -- IF D st st EX MEM wb 4 -- -- -- IF st st ID EX mem WB 5 -- -- -- -- -- -- IF ID st st EX MEM wb 6 -- -- -- -- -- -- -- IF st st ID EX mem WB 7 -- -- -- -- -- -- -- -- -- -- IF ID st st EX MEM wb 8 -- -- -- -- -- -- -- -- -- -- -- IF st st ID EX mem WB 9 -- -- -- -- -- -- -- -- -- -- -- -- -- -- IF ID st st EX MEM wb NOTE: I modified the code in lines 7 and 9 to make it legal CPI = clocks / instrutions == 21/9 == 2.333 5.3 1 2 3 4 5 6 7 8 9 10 11 12 13 1 LW R5, 20(R0) IF D EX MEM WB 2 ADD R4, R0, R0 -- IF D EX mem WB 3 SW 100(R4), R5 -- -- IF D EX MEM wb (forward from 2.EX to 3.EX) 4 ADD R5, R5, R5 -- -- -- IF ID EX mem WB (forward from 1.MEM to 4.EX) 5 SW 104(R4), R5 -- -- -- -- IF ID EX MEM wb (forward from 4.EX to 5.MEM) 6 ADD R5, R5, R5 -- -- -- -- -- IF ID EX mem WB (forward from 4.EX to 5.EX) 7 SW 108(R4), R5 -- -- -- -- -- -- IF ID EX MEM wb (forward from 6.EX to 7.MEM) 8 ADD R5, R5, R5 -- -- -- -- -- -- -- IF ID EX mem WB (forward from 6.EX to 8.EX) 9 SW 112(R4), R5 -- -- -- -- -- -- -- -- IF ID EX MEM wb (forward from 8.EX to 9.MEM) CPI = 13/9 == 1.44 No data hazards.Back