1. What is caching and why do I care?
2. Couldn't I just make my entire memory system out of the fastest component?
3. What's a "miss"? "Hit"? What's the difference between an I-cache and a D-cache?
5. How does caching affect the simple 5 stage DLX pipeline model that I've learned?
6. How long are we talking, actually?
Caching is how we make large memories work. It doesn't matter what you make your memory out of, there's always going to be something faster and more expensive that you could make your memory out of instead. Making a large memory out of the fastest no-holds-barred component isn't realistic, so we make large memories from relatively inexpensive components that are "fast enough." We make them "fast enough" with caches.
Caches let us use small numbers of fast (more expensive) components to approximate the speed of a memory made entirely from the faster components.
Yeah, but it's a pretty dumb thing to do. Let's pretend we wanted to make a computer that had 32MB of memory and instead of using common, relatively slow 60 or 70 ns DRAM we're going to buy 32MB of really expensive 12ns SRAM.
Just by doing this we've
1 | made a portable system impossible (SRAM has very low density, so instead of a single 8 chip SIMM you'd need somewhere between 64 and 256 256 chips) |
2 | really upped the power consumption of the system (the SRAM chips draw less than the DRAMs, but there are a lot more of them) |
3 | made a good space heater/foot warmer (more power = more heat) |
4 | transformed a $120 memory bank into a $12,000 memory bank (SRAM is also more expensive) |
Here's an observation. SRAM isn't the fastest component around. We could do a lot better with registers. 32MB of registers is only 8 million additional 32bit registers. Then we're really cooking (literally!). Good luck with that.
But if we go back to the first system and put a 256k SRAM L2 cache between the CPU and the 32MB of DRAM, we can get performance that is closer to a machine with all SRAM memory than it is to a machine with only DRAM. We can actually get very close to the performance of a machine with only the faster memory type most of the time. By boosting the cache further (e.g., to a 1MB cache) or by placing additional levels of cache (for instance, by putting one or two small, very fast caches on the CPU, and then an external cache in the same package or attached to a fast external bus) the difference for most (not all) code is minor.
Caches work because most memory access by a CPU is local. A small, tight loop may fit entirely in the CPU's instruction cache. If so, it will execute entirely from the cache at maximum speed.
A cache miss describes a condition where the CPU is attempting to access a block in memory that is not in the cache and must be loaded from a lower level (e.g., an external cache or main memory) in the memory heirarchy. A cache "hit" is the opposite -- when there is a "hit" it is because the block is in the cache and does not need to be loaded. Cache design focuses on optimizing the possibility of hits and minimizing the penalty incurred when a miss occurs (inevitably, misses will occur). Cache optimization (of software) focuses on writing cache-friendly software that accesses memory in such a way that hits occur as often as possible (by accessing memory linearly, for instance).
"I-cache" refers to "instruction cache." D-cache refers to data cache. These refer to a split cache design where two small caches exist, one exclusively cachine instruction code and the other exclusively caching data. Compiled software binaries usually consist of two or more "segments" that seperate code from data (global and static variables, constants, etc.) so that if examined directly, there is a large spatial seperation between the actual instruction code and the hard-coded data, and dynamically allocated data is, of course, seperate (spatially, in memory) from the compiler-generated binary.
Additionally, "locality" tends to favor either code or data. By splitting the cache we attempt to create two pools of cache that are both very local for their respective content (instructions and data, respectively.)
Some other terms that will be used in the technical section are:
"block" -- a block refers to the minimum "chunk" of memory that can be stored in a cache location. Caches are broken up into blocks. A "block" may be 4 words, or 8 words, or so on.
"miss rate" -- the AVERAGE frequency of data accesses that access data not in the cache
"miss penalty" -- the amount of time necessary to service a miss (e.g., read or write to memory as requested)
"fill" -- means to load data into the cache
"LRU" -- LRU stands for "least recently used." It is a scheme for managing cache contents. A simple version of LRU might have a "referenced" bit on each block of cache that was cleared every 30 cycles but which was set ever time the block was accessed. If the CPU needed to discard one of the cache blocks (e.g., the cache is full but the CPU needs to load data not in the cache) it would check and see if any cache blocks did not have the referenced but set and discard one of those. This way, anything that was recently referenced would not be discarded since the referenced bit would be set.
"FIFO" -- FIFO stands for "First In, First Out." It's another replacement scheme. FIFO works like a queue. In a FIFO scheme, the oldest block is the one that is discarded. This may be sub-optimal. Consider the case of a loop -- the loop index may be referenced only at the end of a loop but MUST be referenced every cycle. However, it is initialized at the beginning of a loop and therefore would enter the cache before any iterations of the loop had executed. In a FIFO scheme, a loop which adds two arrays would discard the loop index before discarding the previous array values (which will never be referenced again!).
"Second chance FIFO" -- A modified FIFO policy which places discards into a third level of cache. If they are referenced again they are reloaded quickly into the cache. Many virtual memory systems use a second chance FIFO system to simulate an LRU policy without the overhead. (This is relevant to victim caches, see below.)
The cache heirarchy is described in terms of it's distance from the CPU. There's some terminology conflicts about L0/L1 caches. An "L0" cache is typically a very, very small cache (on the order of 128 bits wide) that is accessed directly by the functional units. Some CPUs don't have an L0 cache. A "level 1" cache is typically the "on-chip split instruction and data caches" or "unified on-chip cache." This usually ranges from 4k to 32k. These are fast on-chip caches that run at the CPU clockspeed and can be accessed in a single cycle. A "Level 2" cache is typically external and much larger -- 256k to 2MB, sometimes more. These are usually accessed at some multiple of the CPU clockspeed, since they are external to the CPU and run at the board clockspeed instead of the CPU clockspeed (most CPUs run internally at some multiple of the external clockspeed - a 120MHz Pentium runs at 66 MHz externally).
There's some variation. Some CPUs are packaged in what's called a "Multichip module" (MCM) or "dual cavity package" where two chips can be seated inside the the large ceramic package that interfaces with the system board. These caches typically run at or close to the internal CPU clockspeed (e.g, at, or 1/2) but are faster than the external clockspeed. By putting the L2 cache in the same package chip designers can have a wider bus to the cache memory, clock it faster, and so on, while retaining the relatively less expensive memory type and without affecting their CPU chip's die size (the physical area of the chip). Sometimes these are called "L1.5" caches. The Pentium Pro from Intel and the Alpha 21164a have dual chip packages which contain an L2 cache (256k or 512k and 96k, respectively). On the other hand, these are typically more expensive. The Pentium II from Intel uses a plug-in board that contains the CPU and some highspeed cache RAM instead of a dual-chip module.
There can be other levels of cache, L3, L4 and so on. Usually these get progressively slower the farther they get from the CPU and correspondingly larger. One way to look at "main memory" (e.g., the 32MB in the system above) is as a very, very large cache for the actual addressable memory (which is the storage on magnetic media allocated to the virtual memory paging file).
The five stage DLX pipe is
FETCH | F | from memory - instruction cache |
DECODE | D | reads values from registers, decodes ins. |
EXECUTE | X | performs operation or calculates address |
MEMORY | M | reads/stores to/from memory - data cache |
WRITEBACK | W | writes result to dest. register |
The "5 stage pipeline" that H&P talk about for the first 3 chapters of their book is a big fat lie. Why? Because the MEMORY stage that fetches the operands from memory on a load or writes to memory on a store completes in one cycle. There is no RAM in the world that the CPU can read from in a single clock cycle.
As discussed in Chapter 3, a pipelined architecture is only interesting because it gives high instruction throughput -- ideally you complete a single instruction ever cycle once the CPU has been "primed" (all of the stages are full) despite the fact that any individual instruction requires 5 cycles to complete. Since the time to complete a single instruction is relatively unimportant to the flow of a program a CPU with high throughput approximates the performance of a much faster CPU that can complete an entire individual instruction in the same length of time (e.g., a 5 stage/5 clocks-per-instruction pipelined DLX running at 100MHz approximates the performance of a 1 stage/1 clock per instruction DLX running at 500MHz).
For the DLX introduction, they're assuming a cache that can be read from or written to in one cycle. The assumed cache is a split instruction/data cache because otherwise structural conflicts would arise between the FETCH and MEMORY stages as they would access the same resource. They're also assuming the cache neer misses, because you never need to deal with a memory stall due to a miss.
In reality, when you have a cache miss, you stall that stage until the memory is ready and you can fetch/write the value from/to memory. This can be anywhere from four to dozen clock cycles for data resident in the level 2 caches and level 3 caches to dozens of cycles to load the data from main memory into the local caches to millions or tens of millions of cycles if the data has been paged out of main memory to disk by the virtual memory system. The number of cycles wasted by a cache miss grows as the CPU's speed relatively to main memory increases, and CPU speeds have increased much more rapidly than the speed of DRAM over the last fifteen years. For a 600MHz Alpha 21164a CPU, the wait on main memory can cost a hundred CPU cycles, during which the CPU stalls.
But since essentially nothing is happening while the CPU is halted while waiting for memory, the *model* of the pipeline is valid if one keeps in mind the fact that the "clock count" for a piece of code by analyzing its pipeline bahavior is really a count of the "logical" clocks the instructions spent in the pipe. It represents "best case" performance. Actual performance for a small section of code will be much, much worse than the performance for a similar section of code that amortizes the cost of priming the caches (incurring a cache miss on the first data read from memory but loading the caches in the process and avoiding future misses) over a lengthly set of code, since the latter will see benefit from caching.
Basically, the only change that adding memory-related stalls to a trace of some code through the DLX pipeline would be to put a dozen or so stalls during occasional MEMORY stages, but not all of them. Since instruction access is usually linear, the only time you'd see stalling on memory during FETCH is after a long jump (essentially, out of the cached block) or on block boundaries (compulsory misses, see below).
Well, according to H&P, the numbr of stall cycles induced by a cache miss to fetch from main memory is given:
Memory stall cycles = IC * AvgRefs * MR * MP
where IC is the instruction count for the executed code, AvgRefs is the averaged data references per instruction (which is really the percentage of loads & stores out of the total instruction count), MR is the Miss Rate as defined above, and MP is the miss penalty.
This doesn't tell us what we want to know, though, if what we care about is how many cycles a single cache miss actually costs us. We can derive it, though, by describing an instruction stream that consists of one instruction, a load or store, which we assert misses. Then
stalls = 1 * 1 * 1 * MP
stalls = MP
still not very informative, but more concrete. The problem is that MP varies depending on how far down in the heirarchy the data resides. Typically we care if it resides in main memory. MP may be anywhere from 12 to 100 cycles in that case, perhaps less on some agressive systems (e.g., the POWER2 has a very low clockspeed and therefore suffers proportionally less because the gap between CPU and memory speed is quite low). Note that this is still dependent on the speed of the CPU and the gap in speed between the CPU and main memory, which is variable.
Some CPUs use a variety of tricks to make writes to main memory execute in the background, allowing the CPU to continue processing and making a write miss a lot less disastrous in terms of performance than reads. These are discussed below.
There are also some architectures (called multithreaded architectures, note that "thread" in this case doesn't refer to the high level concept of "threads" used in programming languages or to processes on the operating system level, but rather to low-level instruction streams coming either from multiple processes or dynamically extracted from the instruction stream by the scheduler) which don't halt when they have an outstanding memory request. Instead of pausing the CPU until the block is made available in the cache, a multithreaded architecture switches to another instruction stream until that block has been read from memory. The performance on an individual code stream for multithreaded architectures is no better than that for normal architectures, but the overall throughput should be higher because "dead cycles" lost to memory access no longer exist and can be used by other instruction streams. MTAs have their own problems and are not covered in CMSC411.
A better answer to the question of how much it hurts us is, "A lot." That's why we needed caches in the first place.