CACHE QUESTIONS:
Question 1:
Pretend we switch from a cache which requires wait states during operation to a cache
which doesn't require wait states. Which variable from the cache performance formula
have we just reduced?
Question 2:
Now we've changed our main memory over to some incredible new high-speed DRAM
(B-DRAM). Our old DRAM took 60 ns to access, but this new, incredible B-DRAM only
has a 20 ns access time. Additionally, we've added a victim cache which managed to
cut down on misses by 15%. What variables from the cache performance formula have we
altered? How much faster is memory access with this configuration than with the old
one?
Question 3:
Pretend that we have just upgraded our machine from pipelined synchburst to pipelined ZBT
SRAM. Using the chart provided in the cache guide and the CPU performance equation,
tell me what kind of performance improvement we expect out of a READ-WRITE-READ
instruction combination (assuming that the memory has no other effect on the
machine).
Question 4:
You're a technician at Cyrix (one of Intel's big competitors) and your boss has
asked you to research the performance gain which can be achieved by switching to
daughterboard or integrated L2 cache (you're currently using motherboard cache).
Use the definitions in the cache guide along with the CPU performance equation to
explain to your boss the speedup daughterboard cache provides over motherboard
cache. Also, explain the speedup from using an integrated L2 cache over motherboard
cache. Assume a 66 MHz memory bus and a 166 MHz CPU speed (don't jump on me if I've
provided an impossible bus combination! This is totally pedagogical! :-P
) Don't fudge your answers - the boss wants to see FORMULAS!
Question 5:
Hennessy and Patterson mention in section 5.6 that the some of the Cray
Supercomputers (such as the C-90) actually used SRAM instead of DRAM to provide incredibly
fast main memory. Let's figure out why this is completely psychotic. Let's
work with a price of $75 for a 256KB SRAM (source
-- prices probably change quickly). Let's also assume that the SRAM module has an
access time of 10 ns. Also, we'll assume that DRAM is 87$ for a 32 MB module (source -- prices probably change quickly) with an access
time of 60 ns. With everything else being equal, how many times faster would our
memory be if we replaced the DRAM with SRAM? How many times more expensive would it
be?
Question 6
I didn't discuss split and unified caches (for a complete run-down, check the
Hell of Caches or hit da book), but that's not going to stop me from asking a question
about them. The basic idea in a unified cache is that both instructions and data
share the same cache. In a split cache, (you guessed it) one cache is used only for
instructions and one is used only for data. H&P mention in the example in the
book that unified caches are preferable under some circumstances while split caches are
preferable under others. They mention that unified caches have a lower miss rate
(read my explanation of higher associativity if you don't understand why -- concept is
similar) and that split caches cut back on structural hazards because each cache has its
own read and write ports. In this kid's ever-so-humble opinion, these are somewhat
peripheral issues -- after all, if you had two split caches of size 32k and you're
comparing them together to a unified cache of size 32k, your miss rate should be the same
(their example always assumes split caches are 1/2 the size of a comparable unified cache
to keep the two on an "even ground"). Additionally, what's really to stop
you from putting a second read/write port onto a unified cache and adding whatever
hardware's needed to do double accesses? So in trying to compare split and unified
caches on an "even ground", H&P have really confused the issue -- I believe
that the real benefit to a split cache over a unified cache is your ability to change the
realative sizes and associativity of the data and instruction caches as needed to provide
the most benefit (at the lowest cost) to both.
That's the impetus for this question - we're going to do a speedup to see the benefit (or
non-benefit) of making this change. Pretend that you have a direct-mapped split
cache (both of size 32k) with a hit time of 5 cycles for instructions and 3 cycles for
data. The miss rate is .39% for the instruction cache and 4.82% for the data cache.
The miss penalty is 50 clock cycles. Assume the instruction mix is 75%
instruction references and 25% data references.
Now, we're going to increase the size of the data cache to 64k and change the
associativity of the data cache to eight-way set-associative with LRU replacement.
This causes the miss rate to drop to 1.39%, but causes the hit time on the data
cache to increase from 3 to 4. Compute the speedup, if any.
Question 7:
Here are some quickies -- spit 'em out fast, and no fair peeking:
1) Victim
caches were created to reduce what kind of misses?
a)
conflict b) compiler
c) compulsory
d) capacity
2) Prefetching
reduces what kind of misses?
a) conflict b) compiler
c) compulsory
d) capacity
3) In
"Small and Simple Caches", H&P recommend using set-associative caches
whenever possible.
a) true b)
false
c) maybe
4) Critical
word first allows the CPU to continue processing before an entire block is filled.
a) true b)
false
c) maybe
5) Early
restart is a technique for reducing compulsory misses when powering up the machine.
a) true b)
false
c) maybe
AND HERE ARE THE ANSWERS ....
Answer 1:
Hit time has been reduced. Information can be cranked through the
cache quicker, so a "hit" takes less time than it used to.
Answer 2:
B-DRAM cuts down the miss penalty. Now, when we have a
"miss" and have to run out to main memory it doesn't take nearly as long as it
used to. The victim cache cuts down on the miss rate. What's the
performance difference? Let's check it out:
SPEEDUP = (MemoryAccessTimeold) / (MemoryAccessTimenew)
= (HitTimeold + (MissRateold
* MissPenaltyold)) / (HitTimenew
+ (MissRatenew * MissPenaltynew))
We can cross out the Hit Time variables since we haven't altered them. We're now left with:
SPEEDUP = (MissRateold * MissPenaltyold)
/ (MissRatenew * MissPenaltynew)
= (1.15 * MissRatenew) * (3 MissPenaltynew)
/ (MissRatenew * MissPenaltynew)
Cross out the MissRate and the MissPenalty variables and you've got it -- the new machine
is 3.45 times faster than the old one... total cake!
Answer 3:
SPEEDUP = (ICold * CPIold * CCold) /
(ICnew * CPInew * CCnew)
This one's pretty easy -- everything crosses out except for CPIold = 7 and
CPInew = 5. 7/5 = 1.4, so the new memory is running 1.4 times faster
when confronted with this particular instruction sequence.
Answer 4:
SPEEDUP = (ICold * CPIold * CCold) /
(ICnew * CPInew * CCnew)
This isn't so bad, either -- again, we're only really manipulating one variable.
Here, we're manipulating the clock rate, so make sure to invert your MHz to get CC back
out of CR. Everything else crosses out.
First, we'll do motherborad vs. daughterboard (remember that daugtherboard runs at 1/2 the
clock rate of the CPU):
SPEEDUP = (1/66) / ((2) * (1/166)) = 1.2575
So daughterboard cache is 1.2575 times faster than motherboard cache.
Now we'll do motherboard vs. integrated:
SPEEDUP = (1/66) / (1/166) = 2.5151
So integrated cache is 2.5151 times faster than motherboard cache -- not tough at all!
Answer 5:
First, let's figure out how many 256KB modules are in 32 MB. There are four
256KB modules in a MB (1024 / 256 = 4). So we need to buy 128 of these modules (32 *
4 = 128) to completely replace our DRAM. So total cost would be $9,600 (128 * $75).
SRAM would be just over 110 times more expensive ($9,600 / $87 = 110 x).
Now, let's figure out how much faster the machine would run. We're doing another
simple CPU performance equation problem to get this, so I'll skip the formalities.
It should be fairly obvious that the new RAM is 6 times faster than the old RAM (60 / 10 =
6). Not the kind of money that the normal human being would be laying out for such
a small performance increase! But then "normal" people never did use
Crays, did they......?
Answer 6:
Old memory access time = [75% * (5 + (.0039 * 50))] + [25% * (3 + (.0482 * 50))]
= 5.25
New memory access time = [75% * (5 + (.0039 * 50))] + [25% * (4+ (.0139 * 50))] = 5.07
5.25 / 5.07 = 1.036 ... so we're accessing memory 3.6% faster than we used to.
Not exactly blazing through the Excel spreadsheets, but hopefully this was enough
to get across the idea that being able to modify the size and associativity of split
caches independently has some merit.
Answer 7:
1) a = conflict (I didn't get you with
that lame "compiler" answer, did I?)
2) c = compulsory
3) b = false (You're supposed to use
direct-mapped caches whenever possible -- they're the most simple)
4) a = true
5) b = false (It's a technique for reducing
miss penalty, not miss rate!)
- This page (c)
1998, Brian Renn -- last modified: 12/13/98 -
Permission is granted for reuse and distribution of unmodified page for
academic purposes.
Send non-academic permission requests and corrections to brenn@glue.umd.edu.