5. Improving Benchmarks

The usefulness of a benchmark is influenced by several factors and these factors may change over time. A big factor is the ability of the benchmark to resist "cracking" or a way of tuning the performace of a benchmark by the use of different compilers, preprocessors, or benchmark-specific flags. After a benchmark has become standardized, there is tremendous pressure to improve performance by targeted optimizations or by aggressive interpretations of the rules for running the benchmark.

5.1 Optimizations that Affect Benchmark Performance

To further illustrate why benchmarks may become obsolete, it would be helpful to know why the SPEC92 benchmarks were replaced with the SPEC95.

When the benchmarks were adopted for the SPEC92 suite, several flaws were present but these were minimal since code optimization was difficult at that time. However, after the program became a benchmark, compiler authors became resourceful in optimizing around the benchmark. The following is a list of some optimizations that are used:

Many compilers perform loop unrolling. They duplicate the code of the loop body, generate larger basic blocks that can be more easily optimized by other compilation techniques. This is a common optimization that generally benefits programs with loops.
Via some compilers, the conditions that are checked by the if statements are transformed to a logically equivalent form that can be compiled into more efficient code.
Some compilers optimize the load instruction over several iterations of the loop. Instead of loading a 16-bit item at a time, the compiler generates load instructions for a 32-bit or 64-bit words, storing them in later iterations in the loop. However, this is a legitmate optimization if implemented properly. It benefits some programs more than others, such as programs that possess these data type properties.

5.2 Updating Benchmarks

Since technology is always improving, benchmarks that are used to measure the effect of changes in technology also need to be improved. There are several issues that have to be addressed.

Runtime. The length of time that a benchmark is running is important because if the running time interval is too short, small changes or fluctuations in the measurements will have a significant impact on the observed percentage improvements. Thus SPEC improved their benchmarks by making them longer to take account of future performance.
Application size. Applications are growing in complexity and size, thus benchmarks may become less representative of what was run on current systems if they are not adapted for larger programs. Also it is important to mix benchmarks requiring large resources along with smaller programs.
Application type. Just as the size and complexity of application increases, a wider range of types of applications are growing. Thus, these other types should be considered in order to cover different varieties and increased complexities of the workload.
Portability. It is also important that benchmarks and tools used in the process of benchmarking are independent of the operating system.
Pre-emptive multitasking. Most systems can handle many tasks, such as printing or switching between applications, at the same time by splitting processing time between the multiple tasks.
Increasing data sizes. There is an increase in the use of audio and video intensive apllications which require more efficient I/O capabilities for higher bandwith.
New system resources. There are new operating systems which make it easier for developers to integrate newer capabilities into their applications. These include better 3D graphics engines, telephony protocols, and data types like audio and video which often require more computing power.
Moving target. It is likely that improvements in the test performance become specific to that test only. By updating benchmarks, it may encourage general improvements and make test-specific optimizations less effective.