Problem 1 --------- A. False. A RISC machine usually supports fewer register address modes that CISC machines (DLX supports 3). Also, more complex modes cause higher CPI. B. True. Data hazards can be solved using forwarding. C. False. See p.44 of textbook. D. False. This is loading a double word, which occupies 8 bytes in memory. 100 mod 8 = 4 and this causes a memory alignment error. E. False. There isn't a strict relationship between pipe depth and CPI. F. True. Rate = time/job, Performance = 1/Rate, Performance = job/time. G. False. See p.33 of textbook. H. True/False. True for a pipelined machine, false for a sequential one. I. False. Structural hazards are fixed by splitting the instruction and data memories (called a Harvard Architecture) or by adding more read/write ports. J. True. M is in memory, and (1,2) means that at most one operand to an ALU instruction can be in memory. K. False. Latency is the number of clock cycles between the first EX stage (M1) and the availability of the sum (M4). 4 - 1 = 3. L. False. Length of the clock cycle is determined by the slowest stage in the pipeline. M. False. The floating-point divide is not pipelined. Problem 2.1 ----------- CRnew = .8 Crold ICnew = .6 ICold Xeq = (CPI*IC)/CR CPInew = 1.3 CPIold Speedup Overall=XeqOld/ XeqNew =(CPIold*ICold*CRnew)/(CRold*CPInew*ICnew) Speedup Overall = .8/(.6*1.3) = 1.026 Very little speedup Problem 2.2 ----------- Old New Inst time inst time ALU .5 XeqOld ALU .25 XeqNew Load .2 XeqOld Load .1 XeqNew Other .3 XeqOld Other .65 XeqNew Time spent on Other inst. Old = Time spent on Other inst. New (Because the problems says other instructions are unaffected by the compiler.) .3 XeqOld = .65 XeqNew XeqNew = .3/.65 XeqOld = .46 XeqOld Speedup Overall = XeqOld/XeqNew = 1/.46 = 2.17 Problem 2.3 ----------- Old Inst. freq. (%) ALU .4 Loads .2 Stores .15 Other .25 = 1.0 A quarter of the ALU instructions cause a load, so the addition of the ALU-L eliminates a quarter of the ALUs ( or .1). The same number of loads must also be eliminated by the ALU-L, so the loads decrease by .1 also. The new frequencies are: Inst. Freq. (%) ALU .3 ALU-L .1 Loads .1 Stores .15 Other .25 =.9 We have to re-normalize^Å Inst. Freq. (%) ALU .3/.9 = .334 ALU-L .1/.9 = .111 Loads .1/.9 = .111 Stores .15/.9 = .167 Other .25/.9 = .277 = 1.0 Problem 3.1 ----------- DLX-LAX is a (0,3) RISC ISA, just like good old DLX. The changes made to the original DLX don't affect the format of the ALU instructions, which the (m,n) notation refers to. Problem 3.2 ----------- ADDI R1, R0, #100 // R1 IS FOR INDEXING LW R2, (R1) // R2 <- MEM[R1] (A) ADDI R1, R1, #4 // INCREMENT FOR NEXT WORD LW R3, (R1) // R3 <- MEM[R1] (B) ADDI R1,R1, #4 // R1 <- R1 +4 LW R4,(R1) // R4 <- MEM[R1] (C) ADDI R1, R1,#4 // R1 <- R1 +4 LW R5, (R1) // R5 <- MEM[R1] (D) ADD R6,R2,R3 // R6 = A + B ADD R7,R4,R5 // R7 = C + D ADD R8, R6,R7 // R8 = A + B + C + D ADDI R1,R0,#200 // R1 <- 200 SW (R1), R8 // MEM[R1] <- R8 Problem 3.3 ----------- The changes made to the DLX architecture to make it DLX-LAX do not affect the way we handle branches. We know an inst is a branch by the ID stage, and we have the necessary data to test by ID as well. So, we can test the branch condition in ID just as in the 5 stage DLX. Problem 3.4 ----------- Structural hazards - resource contention: if for example, ID and ALUWB occur simultaneously (register file can't handle both), or if IF1, IF2 and MEM1, MEM2 occur simultaneously (no shared memory). Data hazards - RAR: no such thing. WAR: may be possible. WAW: if ALUWB completes after a subsequent EX1/LWB, the old value may overwrite new one. RAW: ALU's don't get written until stage 8, but dependent ALU's need them in EX1. Control hazards - just as in 5 stage DLX (see 3.2) Problem 3.5 ----------- The clock cycle time for DLX-LAX would probably be lower, since the longer stages involving memory (fetch, mem) have been split up. CPI on DLX-LAX would probably be slightly higher also (but not conclusively) because the clock cycle time is shorter. IC would increase because the lack of any addressing modes besides register deferred means we need extra ALU operations to calculate address for loads and stores. Problem 4.1 ----------- Assume still no forwarding or [WB, ID], no structural hazards present (because of separation of instruction/data memories and multiple read/write ports). Stage A (Producers) - ALUWB, EX1/LWB Stage B (Consumers) - EX1/LWB, MEM1 1. (EX2, EX1/LWB) 1 2 3 4 5 6 7 8 9 10 11 ADD R3,R1,R1 F1 F2 ID M1 M2 EX1 EX2 WB ADD R4,R1,R3 F1 F2 ID M1 M2 S S EX1 EX2 WB 2 stalls needed 2. (EX2,MEM1) 1 2 3 4 5 6 7 8 9 10 11 ADDI R2,R0,#100 F1 F2 ID M1 M2 EX1 EX2 WB LW R4,(R2) F1 F2 ID S S S S M1 M2 EX1 4 stalls needed 3. (MEM2, EX1/LWB) 1 2 3 4 5 6 7 8 9 10 11 LW R2,(R1) F1 F2 ID M1 M2 EX1 EX2 WB ADD R4,R2,R3 F1 F2 ID M1 M2 EX1 EX2 WB No stalls needed 4. (MEM2, MEM1) 1 2 3 4 5 6 7 8 9 10 11 LW R2,(R3) F1 F2 ID M1 M2 EX1 EX2 WB LW R4,(R2) F1 F2 ID S S M1 M2 EX1 EX2 WB 2 stalls needed Problem 4.2 ----------- Speedup from pipeline = Xeq Unpiped/ Xeq Piped Xeq = CPI*IC*CC Piped Unpiped Same IC Same 8+1= 9ns CC 44ns from overhead sum of all stages 1 CPI 1 ideal because no stalls present note: 1 inst/CC = 1 CC/inst -> 1 CPI Xeq Unpiped = 1 * IC * 44 = 44IC Xeq Piped = 1 * IC * 9 = 9IC Speedup from pipeline = Xeq Unpiped/ Xeq Piped = 44/9 = 4.89 Note: this makes sense because in a perfect world^Å Speedup from pipeline goes like = Time Unpiped/# stages in pipeline = 44/8 = 5.5 the decrease from ideal performance comes from the 1ns overhead in each stage.Back