1. What are some ways to reduce cache misses?
2. What are some ways to reduce the penalty when we miss?
3. What are some ways to reduce the time required to check for a cache hit?
method | purpose | pitfall |
larger block size | blocks contain more surrounding data, more hits per block for code with high locality | block replacement consumes more memory bandwidth, too large a block size may increases misses by reducing number of blocks in a given cache size |
higher associativity | reduces conflict discards | may increase hit check time |
victim caches | tiny caches which cache only discards, they help minimize the impack of conflict-induced discards | benefit primarily for direct mapped caches, with many misses victim caches cannot help |
psuedo-associtivity | attempts to simulate set-associativity with direct mapped cache performance | complex and slower in implementation |
hardware prefetch | cache grabs extra blocks on a miss, under the assumption that subsequent blocks will be referenced soon, thereby avoiding a miss for each pre-fetched block | increases memory bandwidth consumed when servicing a miss, may interfere with demand (miss) fetches, lowering actual performance |
software prefetch | compiler can instruct CPU to preload data into a register or touch a memory block (load into cache). two varieties -- faulting and nonfaulting -- which either cause virtual memory page faults or are no-ops if a page fault occurs (no point in pre-fetching through an expensive VM page fault) | adds complexity to compiler implementation, also compilers cannot do perfect prefetching. ties binary more closely to the underlying hardware implementation |
compiler optimization | less code = fewer instruction misses, better data memory organization = fewer data (and conflict-induced) misses, etc. | obvious compiler complexity, but portable and beneficial for more reasons than just cache interaction |
method | purpose | pitfall |
Priority Reads | service reads before servicing writes under the theory that reads are needed to resume processing | implemented by write buffers or similar techniques (see previous section) |
Sub-block placement | breaks blocks up into smaller virtual blocks, reducing the overhead when only a portion of a block changes | may require search of block for locations and increase hit check time |
Early restart | when fetching, as soon as the required word of a missed block arrives, resume processing | low benefit for small block sizes since the overhead of transfer is relatively small compared to the overhead for a single word, due to memory access delays and setup time |
Critical word first | loads the required word before loading the block | same as early restart |
Deeper cache heirarchy | adds another layer of cache below the current level, making this cache's apparent access to memory faster | complicated performance analysis, adds to hardware complexity and expense, not really worth it for small-size main memories |
Non-blocking caches | services cache-hits while handling memory transfers in background so that cache misses do not block CPU | adds a great deal of complexity since the cache management must handle multiple outstanding memory requests while still servicing hits |
method | purpose | pitfall |
small, simple cache design | simple caches have easy hit detection (direct mapped or low associativity) due to reduced possibilities for block location, small caches reduce the possible number of blocks | obvious performance tradeoffs in terms of cache size and conflict-induced misses |
virtually addressed caches | ? | ? |
pipelined cache hardware | parallelize tag searches on associative caches, reducing the time necessary to search for a block | complicates tag memory design |