Although adding a pipeline stage improves frequency, it decreases CPI performance, i.e., the longer the pipeline, the more work done speculatively by the machine and therefore more work is being thrown away in the case of branch misprediction. The additional pipeline stage costs decreased the CPI performance of the processor by 5-6%.
That performance loss was gained back in 3 ways.
1. Improved branch prediction. The MMX Pentium uses new prediction algorithms to improve branch prediction, thereby reducing the delays caused by branch misprediction. Branch Target Buffers are used to minimize the delay to execute a statement after a branch, and a Return Stack Buffer is used to predict call/return type branches.
2. A 5% improvement was gained by improving core/bus protocols. The original Pentium assumed that the core clock rate and the bus clock rate had a 1:1 ratio. The changes they made are hard to understand. Here's a list of what they did: write buffers were combined into a single pool, thereby allowing both pipes to share the same hardware, the clock crossover mechanism was changed, and the DP protocol was completely redesigned to decouple core and bus frequencies.
3. Creating larger caches and fully-associative Translation Lookaside Buffers (TLB). In general, increasing cache size is the most cost-effective way to improve performance. The Pentium processor with MMX technology increased the size of both caches from 8Kbyte to 16Kbyte and made them four-way set-associative. Fully-associative TLBs improved CPI to some extent, making address translation faster than in the original TLB design. Larger caches and fully-associative TLBs bought us about a 7-10% CPI performance improvement.
These three improvements led to a 15% better CPI despite the CPI loss due to the additional pipeline stage.