Frederic T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz
Parallel Processing Letters, Volume 5, Number 4, pages 671-683, 1995.
University of Maryland Technical Report: CS-TR-3266 March 1994.
We examine multiprocessor runtime support for fine-grained, irregular directed acyclic
graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct
our experiments on the CM-5, whose lower latencies and active-message support allow us to
achieve unprecedented speedups for a general multiprocessor. Where as previous
implementations have maximum speedups of less than 4 on even simple banded matrices, we
are able to obtain scalable performance on extremely small and irregular problems. On a
matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of
34 for 128 processors, resulting in an absolute performance of 54 million double-precision
floating point operations per second.
We achieve these speedups with non-matrix-specific methods which are applicable to any
DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from
the Harwell-Boeing benchmark set. Although precomputed data distributions and execution
schedules produce the best performance, we find that it is challenging to keep their cost
low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce
communication overhead by a factor of three over the standard CM-5 policies. We present a
detailed study of runtime overheads and demonstrate that send and receive processor
overhead still dominate these applications on the CM-5. We conclude that these
applications would highly benefit from architectural support for low-overhead
communication.
[Applications
| High Performance I/O | Compilers | Tools]