Cache Q&A

CACHE QUESTIONS:

Question 1:
Pretend we switch from a cache which requires wait states during operation to a cache which doesn't require wait states. Which variable from the cache performance formula have we just reduced?

Question 2:
Now we've changed our main memory over to some incredible new high-speed DRAM (B-DRAM). Our old DRAM took 60 ns to access, but this new, incredible B-DRAM only has a 20 ns access time. Additionally, we've added a victim cache which managed to cut down on misses by 15%. What variables from the cache performance formula have we altered? How much faster is memory access with this configuration than with the old one?

Question 3:
Pretend that we have just upgraded our machine from pipelined synchburst to pipelined ZBT SRAM. Using the chart provided in the cache guide and the CPU performance equation, tell me what kind of performance improvement we expect out of a READ-WRITE-READ instruction combination (assuming that the memory has no other effect on the machine).

Question 4:
You're a technician at Cyrix (one of Intel's big competitors) and your boss has asked you to research the performance gain which can be achieved by switching to daughterboard or integrated L2 cache (you're currently using motherboard cache). Use the definitions in the cache guide along with the CPU performance equation to explain to your boss the speedup daughterboard cache provides over motherboard cache. Also, explain the speedup from using an integrated L2 cache over motherboard cache. Assume a 66 MHz memory bus and a 166 MHz CPU speed (don't jump on me if I've provided an impossible bus combination! This is totally pedagogical! :-P ) Don't fudge your answers - the boss wants to see FORMULAS!

Question 5:
Hennessy and Patterson mention in section 5.6 that the some of the Cray Supercomputers (such as the C-90) actually used SRAM instead of DRAM to provide incredibly fast main memory. Let's figure out why this is completely psychotic. Let's work with a price of $75 for a 256KB SRAM (source -- prices probably change quickly). Let's also assume that the SRAM module has an access time of 10 ns. Also, we'll assume that DRAM is 87$ for a 32 MB module (source -- prices probably change quickly) with an access time of 60 ns. With everything else being equal, how many times faster would our memory be if we replaced the DRAM with SRAM? How many times more expensive would it be?

Question 6
I didn't discuss split and unified caches (for a complete run-down, check the Hell of Caches or hit da book), but that's not going to stop me from asking a question about them. The basic idea in a unified cache is that both instructions and data share the same cache. In a split cache, (you guessed it) one cache is used only for instructions and one is used only for data. H&P mention in the example in the book that unified caches are preferable under some circumstances while split caches are preferable under others. They mention that unified caches have a lower miss rate (read my explanation of higher associativity if you don't understand why -- concept is similar) and that split caches cut back on structural hazards because each cache has its own read and write ports. In this kid's ever-so-humble opinion, these are somewhat peripheral issues -- after all, if you had two split caches of size 32k and you're comparing them together to a unified cache of size 32k, your miss rate should be the same (their example always assumes split caches are 1/2 the size of a comparable unified cache to keep the two on an "even ground"). Additionally, what's really to stop you from putting a second read/write port onto a unified cache and adding whatever hardware's needed to do double accesses? So in trying to compare split and unified caches on an "even ground", H&P have really confused the issue -- I believe that the real benefit to a split cache over a unified cache is your ability to change the realative sizes and associativity of the data and instruction caches as needed to provide the most benefit (at the lowest cost) to both.

That's the impetus for this question - we're going to do a speedup to see the benefit (or non-benefit) of making this change. Pretend that you have a direct-mapped split cache (both of size 32k) with a hit time of 5 cycles for instructions and 3 cycles for data. The miss rate is .39% for the instruction cache and 4.82% for the data cache. The miss penalty is 50 clock cycles. Assume the instruction mix is 75% instruction references and 25% data references.

Now, we're going to increase the size of the data cache to 64k and change the associativity of the data cache to eight-way set-associative with LRU replacement. This causes the miss rate to drop to 1.39%, but causes the hit time on the data cache to increase from 3 to 4. Compute the speedup, if any.

Question 7:
Here are some quickies -- spit 'em out fast, and no fair peeking:

            1) Victim caches were created to reduce what kind of misses?
                         a) conflict        b) compiler         c) compulsory         d) capacity

            2) Prefetching reduces what kind of misses?
                       a) conflict        b) compiler         c) compulsory         d) capacity

            3) In "Small and Simple Caches", H&P recommend using set-associative caches whenever possible.
                        a) true             b) false               c) maybe

            4) Critical word first allows the CPU to continue processing before an entire block is filled.
                        a) true             b) false               c) maybe

            5) Early restart is a technique for reducing compulsory misses when powering up the machine.
                        a) true             b) false               c) maybe

AND HERE ARE THE ANSWERS ....

Answer 1:
Hit time has been reduced. Information can be cranked through the cache quicker, so a "hit" takes less time than it used to.

Answer 2:
B-DRAM cuts down the miss penalty. Now, when we have a "miss" and have to run out to main memory it doesn't take nearly as long as it used to. The victim cache cuts down on the miss rate. What's the performance difference? Let's check it out:

SPEEDUP = (MemoryAccessTimeold) / (MemoryAccessTimenew)
= (HitTimeold + (MissRateold * MissPenaltyold)) / (HitTimenew + (MissRatenew * MissPenaltynew))

We can cross out the Hit Time variables since we haven't altered them. We're now left with:

SPEEDUP = (MissRateold * MissPenaltyold) / (MissRatenew * MissPenaltynew)
= (1.15 * MissRatenew) * (3 MissPenaltynew) / (MissRatenew * MissPenaltynew)

Cross out the MissRate and the MissPenalty variables and you've got it -- the new machine is 3.45 times faster than the old one... total cake!

Answer 3:
SPEEDUP = (ICold * CPIold * CCold) / (ICnew * CPInew * CCnew)
This one's pretty easy -- everything crosses out except for CPIold = 7 and CPInew = 5. 7/5 = 1.4, so the new memory is running 1.4 times faster when confronted with this particular instruction sequence.

Answer 4:
SPEEDUP = (ICold * CPIold * CCold) / (ICnew * CPInew * CCnew)

This isn't so bad, either -- again, we're only really manipulating one variable. Here, we're manipulating the clock rate, so make sure to invert your MHz to get CC back out of CR. Everything else crosses out.

First, we'll do motherborad vs. daughterboard (remember that daugtherboard runs at 1/2 the clock rate of the CPU):
SPEEDUP = (1/66) / ((2) * (1/166)) = 1.2575
So daughterboard cache is 1.2575 times faster than motherboard cache.

Now we'll do motherboard vs. integrated:
SPEEDUP = (1/66) / (1/166) = 2.5151
So integrated cache is 2.5151 times faster than motherboard cache -- not tough at all!

Answer 5:
First, let's figure out how many 256KB modules are in 32 MB. There are four 256KB modules in a MB (1024 / 256 = 4). So we need to buy 128 of these modules (32 * 4 = 128) to completely replace our DRAM. So total cost would be $9,600 (128 * $75). SRAM would be just over 110 times more expensive ($9,600 / $87 = 110 x).

Now, let's figure out how much faster the machine would run. We're doing another simple CPU performance equation problem to get this, so I'll skip the formalities. It should be fairly obvious that the new RAM is 6 times faster than the old RAM (60 / 10 = 6). Not the kind of money that the normal human being would be laying out for such a small performance increase! But then "normal" people never did use Crays, did they......?

Answer 6:
Old memory access time = [75% * (5 + (.0039 * 50))] + [25% * (3 + (.0482 * 50))] = 5.25
New memory access time = [75% * (5 + (.0039 * 50))] + [25% * (4+ (.0139 * 50))] = 5.07

5.25 / 5.07 = 1.036 ... so we're accessing memory 3.6% faster than we used to. Not exactly blazing through the Excel spreadsheets, but hopefully this was enough to get across the idea that being able to modify the size and associativity of split caches independently has some merit.

Answer 7:
        1) a = conflict (I didn't get you with that lame "compiler" answer, did I?)
        2) c = compulsory
        3) b = false (You're supposed to use direct-mapped caches whenever possible -- they're the most simple)
        4) a = true
        5) b = false (It's a technique for reducing miss penalty, not miss rate!)

- This page (c) 1998, Brian Renn -- last modified: 12/13/98 - Permission is granted for reuse and distribution of unmodified page for academic purposes. Send non-academic permission requests and corrections to brenn@glue.umd.edu.