A)False. The PowerPC use (0,3) RISC ISA.
B) False. The term MIPS means million instructions per second. The term
MFLOPS means million float points instructions per second. The average
throughput means average tasks completed per unit time.
C) False. Not all GPR machines are RISC architectures.
D) False. Make the common case fast means that we should improve the
frequent
case rather than the rare case.
E) False. When we talked about (1,1) machine, we talked about GPR
machine. The accumulator machine has only 1 register. So, it isn't
a
GPR machine.
F) False. According to Amdahl's law,
1
Speedup_overall = ---------------
(1-FE)+ FE/SE , FE=Fraction of Enhancement,
SE=Speedup of the Enhancement. More computation needs to be done
to
compute the amount of time saved, and it is not proportional to 1/FE
or 1/SE, and there is no possible CONSTANT k.
G) False. The instruction set architecture is the key factor in determining
the
average number of clocks per instruction.
H) False. Cache is small fast memory close to CPU which stored the recently
used
code or data, it is introduced because the principle of locality of
reference
which is the data used recently is more likely used in the near future,
also because
of " Make the common case fast.
I) False. Stack machine require no operand in their ALU instruction.
J) False. The principle difference between RISC and CISC architectures
is the ISA.
From the name, RISC, reduced instruction set computer, CISC, complex
instruction
set computer.
K) False. Because of the pipeline hazard.
L) False. Pipeline is designed to make the CPU fast.
________________________________________________________________________
SOLUTION TO PROBLEM 2.1
We know,
Ins Count Clock Rate
MIPS = -------------
= ------------
XEQ * 10^6 CPI * 10^6
MIPS is easy to understand, however it does not always mesaure the
computer performance accurately. MIPS is dependent on ins. set, making
it
difficult to compare MIPS of computers with diff ins. set. More
importantly, MIPS can vary inversely to performance. On the other hand,
MFLOPS is not of much use when there is no floating point operation.
Execution time of CPU is the real measure of performance.
________________________________________________________________________
SOLUTION TO PROBLEM 2.2
Structural Hazard-- Since we have only one memory
port, we will have
the problem of our limited
resource when both "IF" and "MEM" stage
tries to access the memory
at the same time.
Date Hazard--
When some instruction code line is trying to
access the value of a specific
register, which is supposed to be
written after some previous
instruction is executed, BUT not yet
written at that stage, Data
Hazard will occur. Cause the current
instruction will either
read some wrong value or may be
stalled.
Control Hazard-- If there
is a jump or branch instruction, but
the previous instruction
is yet to set the value of PC register
or other appropriate registers,
execution may stall and Control
Hazard occurs.
In general, if hazards are ignored, you get the wrong answers,
rendering your machine useless and untrustworthy.
________________________________________________________________________
SOLUTION TO PROBLEM 2.3
An object of byte S at address A is byte aligned if
A mod S = 0.
In DLX words should be byte aligned. Memory is usually aligned on a
word
or double-word boundary. A misaligned memory access will take more
aligned memory references. So, misalignment will slow the program
execution. And misalignment causes hardware complications. It
may
also cause the system to crash, or for you to get wrong information.
________________________________________________________________________
SOLUTION TO PROBLEM 2.4
We know,
CPU = CPI * IC * CCT
When in design, clock cycle time is the most difficult to get, since it is very hard to get the clock cycle time when the design is not completed. Once the computer is built, it is very easy to get the clock cycle time. The average CPI is also hard to get because it depends on the detailed processor, it is even harder for modern processor using pipelining and memory hierarchies. The instruction count is the easiest, because at this stage, it is usuallysufficient to focus on static IC. The static IC can be computed using code analysis or simulations.
In an existing system, CCT is easiest, as the clock is there already. CPI can be estimated from running benchmarks, or simply code that exercises each instruction. The problem in an existing system is to predict the average dynamic code size to get the run-time IC.
________________________________________________________________________
SOLUTION TO PROBLEM 2.5
Lets assume, CPI, IC, and CCT stands for the original (initial) values of CPI, Ins. Count and Clock Cycle Time.
We know,
CPU TIME (old) = CPI * IC
* CCT
Now, CPI increased 5%, IC decreased by 20% and CLOCK rate is twice means Clock Cycle time is half of old. Remember CCT is inverse of Clock rate.
So we have,
CPU TIME (new) = (CPI + .5*CPI) * (IC- .2*IC) * (.5* CCT)
and we know,
XEQ (old) CPI * IC * CCT
SpeedUp = ----------- =
---------------------------
XEQ (new) (1.05*CPI) * (.8*IC) * (.5*CCT)
Solving the equation yields,
SpeedUp = 1/.42 = 2.580
________________________________________________________________________
SOLUTION TO PROBLEM 3.1
Clock Cycle Time (CCT) increased 10%. This also means Clock Rate decreased. Remember Clock Cycle Time is inverse of Clock Rate. When clock rate decreased, Computer speed will decrease. The computer will seem slower than before.
Again, CPU = CPI * IC * CCT and CPI = (Clock Cycles/Ins Count) An increase in CCT will increase CPU Time. Increase in Clock Cycle Time will take more time per Clock Cycle now. So, it should take less cycle to run a program, which will decrease CPI.
But after all, the performance may deteriorate because CPU time is increasing.
_______________________________________________________________________
SOLUTION TO PROBLEM 3.2
Still under debate--but there will be no such problem on the final--nothing as bizarre as this.
_______________________________________________________________________
Problem 4
4.1.1 Initiate stack
Assume: 1)4000 is the bottom of the stack address.
2)5000 is the start of the
stack address.
3)R31 point to the top of
ADDI R31,R0,#5000 ;R31 <- #5000
LOOP: ADDI R10,R0,#5000
;R10 <- #5000
LW 0(R10), R0
;PUT 0 INTO STACK
SUBI R10,R10,#4 ;CHANGE
POINTER TO THE NEXT STACK
SUBI R11,R10,#4000 ;IS IT THE BOTTOM OF THE
STACK?
BNEZ R11, LOOP ;
4.1.2
ADDI R1,R31,R0
;Get the pointer to the stack
SUBI R2,R1,#4000 ;If the
stack is full?
BENZ R2,STOP
;
LW R2,0(R30)
;Get the value to be pushed
SW 0(R1),R2
;push
SUBI R1,R1,#4
;changed the pointer
ADD R31,R1,R0
;store the pointer value in R31
STOP
4.1.3
Stack is used to store machine state during interrupts and procedure
calls--where the machine state includes the contents of registers, and
perhaps, for exception handling, portions of the pipeline stage
registers, and other internal data. Procedure calls in DLX jump to
the correct subroutine address (using JAL for jump and link). Register
R31 holds the return address of the subroutine, which is typically dumped
on a stack so that the called routine can have access to all the registers.
4.3
4.3.1
1 2 3 4 5 6
7 8 910 11 12 13 14 15 16 17 18 19 20 21 22 23
loop:
LW R2,100(R1) | F D X M W
LW R3,500(R1) | F D X
M W
SUBI R5,R3,R2 |
F S S D X M W
ADD R5,R5,R4 |
F S S D X M
W
SW 1000(R1),R5 |
F S S D X
M W
ADDI R1,R1,#4 |
F D X M W
SUBI R5,R1,#400 |
F S S D X M W
BNEZ R5,Loop |
F D X M W
-----------------------------------------------------------------------------
LW R2,100(R1) |
F D X M W
1) For one iteration, if ignoring BENZ command, one iteration takes
19 clock
cycles, and execute 7 instructions, so
CPI=19/7=2.71
2) The total CPI. if ignoring the contribution of the branch, it takes
17
cycle for one iteration if the branch is untaken, all branches are
untaken
except the last one. The last iteration will take 20 clock cycle.
Also for first iteration
ADDI R1,R0,R0 F D X M W
LW R2,100(R1) F S S D X M W
There need an extra 3 clock cycle to start the loop.
So, the total clock cycle= 17*99+20+3=1706
The total instruction number is 1+100*8=801
Thus we got the total average CPI=1706/801=2.13
4.3.2
1) Rewite the code operate on five pairs of array operands during loop
body.
ADD R1,R0,R0
;init counter to 0
Loop:
LW R2,100(R1)
;R1 has offset of first array element
LW R3,500(R1)
;R3 holds another array element
LW R4,104(R1)
;get the second element
LW R5,504(R1)
LW R6,108(R1)
;get the third element
LW R7,508(R1)
LW R8,112(R1)
;get the fouth element
LW R9,512(R1)
LW R10,116(R1)
;get the fifth element
LW R11,516(R1)
SUBI R12,R3,R2
;perform substraction
SUBI R13,R5,R4
SUBI R14,R7,R6
SUBI R15,R9,R8
SUBI R16,R11,R10
ADD R12,R12,R4
;perform addition of constant
ADD R13,R13,R4
ADD R14,R14,R4
ADD R15,R15,R4
ADD R16,R16,R4
SW 1000(R1),R12 ;store
the result
SW 1004(R1),R13
SW 1008(R1),R14
SW 1012(R1),R15
SW 1016(R1),R16
ADDI R1,R1,#20
;increment pointer
SUBI R20,R1,#400 ;check pointer
BENZ R20,LOOP
;branch while not done
STOP
2) Because in this case there is no data hazard, inside the loop, if
not considering
control hard, the average CPI would be 1.
The overall execution time is 20*28+1=561 clock time.
ADD R1,R0,R0 |
Loop: |
LW R2,100(R1) |F D X M W
LW R3,500(R1) | F D X M W
LW R4,104(R1) | F D X M W
LW R5,504(R1) | F D X M W
LW R6,108(R1) |
F D X M W
LW R7,508(R1) |
F D X M W
LW R8,112(R1) |
F D X M W
LW R9,512(R1) |
F D X M W
LW R10,116(R1) |
F D X M W
LW R11,516(R1) |
F D X M W
ADDI R1,R1,#20 |
F D X M W
SUBI R12,R3,R2 |
F D X M W
SUBI R13,R5,R4 |
F D X M W
SUBI R14,R7,R6 |
F D X M W
SUBI R15,R9,R8 |
F D X M W
SUBI R16,R11,R10|
F D X M W
SUBI R20,R1,#400|
F D X M W
ADD R12,R12,R4 |
F D X M W
ADD R13,R13,R4 |
F D X M W
ADD R14,R14,R4 |
F D X M W
ADD R15,R15,R4 |
F D X M W
ADD R16,R16,R4 |
F D X M W
SW 980(R1),R12 |
F D X M W
SW 984(R1),R13 |
F D X M W
SW 988(R1),R14 |
F D X M W
SW 992(R1),R15 |
F D X M W
BNEZ R20,Loop |
F D X M W
SW 994(R1),R16 |
F D X M W
STOP
|