Publications

Click here to remove the abstracts

Titan: A High Performance Remote Sensing Database

Chialin Chang, Bongki Moon, Anurag Acharya, Carter Shock, Alan Sussman and Joel Saltz

August 1996

University of Maryland Technical Report CS-TR-3689 and UMIACS-TR-96-67

Submitted to 1997 International Conference on Data Engineering

There are two major challenges for a high-performance remote-sensing database. First, it must provide low-latency retrieval of very large volumes of spatio-temporal data. Second, the order of magnitude reduction in data-size due to post-processing must make it imperative, from a performance perspective, that the postprocessing be done on the machine that holds the data. This paper describes the design, implementation and evaluations of Titan, a parallel shared-nothing database designed for remote sensing data. The computational platform for Titan is 16-processor IBM SP-2 with four fast disks attached to each processor. Titan is currently operational and contains about 24 GB of data from the Advanced Very High Resolution Radiometer, (AVHRR) on the NOAA-7 satellite. The experiemtnal results show that Titan provides good performance for global queries, and the interactive response times for local queries.

Network-aware Mobile Programs
M. Ranganathan, Anurag Acharya, Shamik Sharma, Joel Saltz
June, 1996
University of Maryland Technical Report: CS-TR-3659 and UMIACS-TR-96-46

Submitted to Usenix 97 Conference
In this paper, we investigate network-aware mobile programs, programs that can use mobility as a tool to adapt to variations in network characteristics. We present infrastructural support for mobility and network monitoring and show how adaptalk, a Java-based mobile Internet chat application can take advantage of this support to dynamically place the chat server so as to minimize response time. Our conclusion was that on-line network monitoring and adaptive placement of shared data-structures can significantly improve performance of distributed applications on the Internet.

Interleaved Parallel Hybrid Arnoldi Method for a Parallel Machine and a Network of Workstations
Guy Edjlali, Serge Petition, Nahid Emad
June, 1996
The data-parallel paradigm which is effective for many numerical algebra computations is no more sufficient when programming heterogeneous machines. To efficiently utilize these new resources we need to combine the two new paradigms: data-parallelism and task parallelism. This leads us to define new algorithims.
In this paper, throughout all examples we will illustrate this statement. The chosen example is the computation of some eigenvalues of large non-structured sparse non-Hermitian matrices. We present a new algorithm, Interleaved Arnoldi, and compare its performances to the parallel Arnoldi algorithm. The different environments we consider are: a CM5, a CM5 and a Sparc 20, a CM5 and 2 Sparc 20. We show that significant results can be obtained even whell the het- erogeneous machine consists of a CM5 and 2 Sparc 20 workstations.

Interoperability of Data Parallel Runtime Libraries with Meta-Chaos
Guy Edjlali and Alan Sussman and Joel Saltz
May 1996
University of Maryland Technical Report: CS-TR-3633 and UMIACS-TR-96-30
This paper describes a framework for providing the ability to use multiple specialized data parallel libraries and/or languages within a single applications. The ability to use multiple libraries) is required in many application areas, such as multidisciplinary complex physical simulations and remote sensing image database applications. An application can consist of one program or multiple programming that use different libraries- to parallelize operations on distributed data structures. The framework is embodied in a runtime library called Meta--Chaos that has been used to exchange data between data parallel programs written using High Performance Fortran, the Chaos and Multiblock Parti libraries developed at Maryland for handling various types of unstructured problems, and the runtime library for pc++, a data parallel version of C++ from Indiana University. Experimental results show that Meta-Chaos is able to move data between libraries efficiently, and that Meta-Chaos provides effective support, for complex applications.

Scalability Analysis of Declustering Methods for Cartesian product files
Bongki Moon and Joel H. Saltz
April 1996
University of Maryland Technical Report: CS-TR-3590 and UMIACS-TR-96-5

To be published in: IEEE Transactions on Software Engineering
Efficient storage and retrieval of multi-attribute datasets has become one of the essential requirements for many data-intensive applications. The Cartesian product file has been known as an effective multi-attribute file structure for partial-match and best-match queries. Several heuristic methods have been developed to decluster Cartesian product files over multiple disks to obtain high performance for disk accesses. Though the scalability of the declustering methods becomes increasingly important for systems equipped with a large number of disks, no analytic studies have been done so far In this paper we derive formulas describing the, scalability of two popular declustering methods Disk Modulo and Fieldwise Xor for range queries, which are the most common type of queries. These formulas disclose the limited scalability of the declustering methods and are corroborated by extensive simulation experiments. From the practical point of view, the formulas given in this paper provide a simple measure which can be used to predict the response time of a given range query and to guide the selection of a declustering method under various conditions.

Analysis of the Clustering Properties of Hilbert Space-filling Curve
Bongki Moon, H. V. Jagadish, Christos Faloutsos, Joel H. Saltz
March 1996
University of Maryland Department of Computer Science Technical Report, CS-TR-3611 and UMIACS-TR-96-20
Submitted to IEEE Transactions on Knowledge and Data Engineering, March 1996
Several schemes for linear mapping of multidimensional space have been proposed for many applications such as access methods for spatio-temporal databases, image compression and so on. In all these applications, one of the most desired properties from such linear mappings is clustering, which means the locality between objects in the multidimensional space is preserved in the linear space. It is widely believed that Hilbert space-filling curve achieves the best clustering. In this paper we provide closed-fortran formulas of the number of clusters required by a given query region of an arbitrary shape (e.g., polygons and polyhedra) for Hilbert space-filling curve. Both the asymptotic solution for a general case and the exact solution for a special case generalize the previous work, and they agree with the empirical results that the number of clusters depends on the hyper-surface area of the query region and not on its hyper-volume. We have also shown that Hilbert curve achieves better clustering than z-curve. From the practical point of view, the formulas given in this paper provide a simple measure which can be used to predict the required disk access behaviors and hence the total access time.
Index Terms: locality-preserving linear mapping, range queries, multi-attribute access methods, data clustering, Hilbert curve, space-filling curves, fractals.

Eliminating Redundant Barrier Synchronizations in Rule-based Programs
Anurag Acharya
To be published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996

A rule-based program consists of a set of if-then rules and a tuple-space. The rules are the code for the program and the tuple-space contains the data being processed by the program. Previous efforts to parallelize rule-based programs have achieved limited speedups. The main reason for these disappointing results is a high frequency of barrier synchronizations. Since little work is done between successive barrier synchronizations, the number of processors that can be effectively utilized is bounded. Even though required by language semantics, a large fraction of the barrier synchronizations are not necessary for most programs. This paper proposes a pair of simple language extensions that allow an implementation to efficiently detect and eliminate redundant barrier synchronizations. Simulation results based on a real implementation show that for a set of five benchmarks, this scheme is able to eliminate between 95.6% and 99.9% of the barrier synchronizations. This results in a multiplicative speedup of between 2.2 and 52.3 fold over and above the speedup achieved by a parallelizing compiler. For the programs studied, simulations indicate that speedups up to 115 fold relative to an optimized sequential version are possible.

Study of Scalable Declustering Algorithms for Parallel Grid Files
Bongki Moon, Anurag Acharya, Joel Saltz
April 1996
To be published in Proceedings of the 1996 Parallel Processing Symposium
University of Maryland Technical Report: CS-TR-3589 and UMIACS-TR-96-4 Lengthened version
Efficient storage and retrieval of large multidimensional datasets has been one of the key issues in large-scale scientific computations such as long running time-dependent simulations periodically generating snapshots of the state. The main challenge for efficiently handling such datasets is to minimize response time for multidimensional range queries. The grid file is one of the well known access methods for multidimensional and spatial data. We investigate effective and scalable declustering techniques for grid files with the primary goal of minimizing response time and the secondary goal of maximizing the fairness of data distribution. The main contributions of this paper is (1) the analytic and experimental evaluation of existing index-based declustering techniques and their extensions for grid files, and (2) the development of an improved proximity-based declustering algorithm called minimax which has been experimentally shown to scale well and consistently achieve better response time compared to all the other algorithms while maintaining perfect disk distribution.

Tuning the Performance of I/O-Intensive Parallel Applications
Anurag Acharya, Mustafa Uysal, Robert Bennett, Assaf Mendelson, Michael Beynon, Jeff Hollingsworth, Joel Saltz, Alan Sussman
To appear in: Fourth Annual Workshop on I/O in Parallel and Distributed Systems, May 27 1996, Philadelphia, Pennsylvania
Getting good I/O performance from parallel programs is a critical problem for many application domains. In this paper, we report our experience tuning the I/O performance of four application programs from the areas of sensor data processing and linear algebra. After tuning, three of the four applications achieve effective I/O rates of over 100MB/s, on 16 processors. The total volume of I/O required by the programs ranged from about 75MB to over 200GB. We report the lessons learned in achieving high I/O performance from these applications, including the need for code restructuring, local disks on every node and overlapping I/O with computation. We also report our experience on achieving high performance on peer-to-peer configurations. Finally, we comment on the necessity of complex I/O interfaces like collective I/O and strided requests to achieve high performance.

An Interprocedural Framework for Placement of Asynchronous I/O Operations

Gagan Agrawal, Anurag Acharya, Joel Saltz
To be published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 358-365

Overlapping memory accesses with computations is a standard technique for improving performance of modern architectures, which have deep memory hierarchies. In this paper, we present a compiler technique for overlapping accesses to secondary memory (disks) with computation. We have developed an Interprocedural Balanced Code Placement (IBCP) framework, which performs analysis on arbitrary recursive procedures and arbitrary control flow and replaces synchronous I/O operations with a balanced pair of asynchronous operations. We demonstrate how this analysis is useful for the applications which perform frequent and large accesses to secondary memory, analysis is useful for the applications which perform frequent and large accesses to secondary memory, including the applications which snapshot or checkpoint their computations or the out-of-core applications.

Experimental Evaluation of Sparse Matrix Distributions

M. Ujaldon Martinez, S. Sharma, J. Saltz, and E. Zapata
To be published in:Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 78-85

Sparse matrix problems are difficult to parallelize efficiently on distributed memory machines since non-zero elements are unevenly scattered and and are accessed via multiple levels of indirection. Irregular distributions that acheive good load balance and locality are hard to compute, have high memory overheads and also lead to further indirection in locating distributed data. This paper evaluates alternative semi-regular distribution strategies which rade off the quality of the load balance and locality for power distribution overheads and efficient lookup. The proposed techniques are compared to an irregular sparse matrix partitioner and the relative merits of each distribution method are outlined.

Runtime Coupling of Data-Parallel Programs
May 1996
Mudumbai Ranganathan, Anurag Acharya, Guy Edjlali, Alan Sussman, Joel Saltz
Published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, pages 229-236 May 25-28, 1996
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3565 and UMIACS TR-95-116
We consider the problem of efficiently computing multiple data-parallel programs at runtime. We propose an approach that establishes a mapping between data structures in different data-parallel programs and implements a user-specified consistency model. Mappings are established at runtime and can be added and deleted while the programs being coupled are in execution. Mappings, or the identity of the processors involved, do not have to be known at compile-time or even link-time. Programs can be made to interact with different granularities of interaction without requiring any re-coding. A-priori knowledge of consistency requirements aflows buffering of data as well as concurrent execution of the coupled applications. Efficient data movement is achieved by pre-computing an optimized schedule. We describe our prototype implementation and evaluate its performance using a set of synthetic benchmarks. We examine the variation of performance with variation in the consistency requirement. We demonstrate that the cost of the flexibility provided by our coupling scheme is not prohibitive when compared with a monolithic program that performs the same computation.

Runtime Techniques for Parallelizing Sparse Matrix Applications
Published in: Proceedings of the 1995 Workshop on Irregular Problems
September, 1995
Manuel Ujaldon Martinez, Shamik D. Sharma, Joel Saltz, Emilio Zapata
Sparse matrix problems are difficult to parallelize efficiently on message-passing machines, since they access data through multiple levels of indirection. Inspector/executor strategies, which are typically used to parallelize such problems impose insignificant preprocessing overheads. This paper describes the runtime support required by new compilation techniques for sparse matrices and evaluates their performance, highlighting optimizations and improvements over previous techniques.

Compiler and Runtime Support for Programming in Adaptive Parallel Environments
Published as: University of Maryland Technical Report CS-TR-3510 and UMIACS-TR-95-83

Submitted for Journal Publication
Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, Joel Saltz

For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers of HPF-like languages to generate code for an adaptive environment. We present performance results for Navier-Stokes solver on a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varies frequently, the cost of data distribution is not significant when compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network on non-dedicated workstations, which are likely to be an important resource for parallel programming in the future.

Object-Oriented Runtime Support for Complex Distributed Data Structures
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3438 and UMIACS-TR-95-35
C. Chang, A. Sussman, J. Saltz
March, 1995
Object-oriented applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex data types. Traditionally, however, applications executed on distributed memory parallel architectures in single-program multiple-data (SPMD) mode use on distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques that to such applications executing in a loosely synchronous manner. Existing runtime systems that rely solely on global indices are not always applicable to object-oriented applications, since no global names or indices are imposed upon dynamic complex data structures linked by pointers.
We describe a portable object-oriented runtime library that has been designed to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures, by providing two abstractions, mobile objects and globally addressable objects. CHAOS++ uses preprocessing techniques to analyze communication patterns, and provides data exchange routines to perform efficient transfers between processors. Results for applications taken from three distinct classes are also presented to demonstrate the wide applicability and good performance characteristics of the runtime library.

A Manual for the {CHAOS} Runtime Library
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3437 and UMIACS-TR-95-34
March 1995
J. Saltz, R. Ponnusammy, S. Sharma, B. Moon, Y.-S. Hwang, M. Uysal, R. Das
Procedures are presented that are designed to help users efficiently program irregular problems (e. g. unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These procedures are also designed for use in compilers for distributed memory multiprocessors. The portable CHAOS procedures are designed to support dynamic data distributions and to automatically generate send and receive message by capturing communications patterns at runtime.

Parallelization of Fine Grained, Irregular DAGs
Published in to International Conference on Supercomputing 1995
Published: F. Chong, S. Sharma, E. Brewer, J. Saltz
December, 1994
This paper examines parallelization of computations that can be characterized as a fine-grained, irregular, directed acyclic graph of tasks. Such computations typically arise in sparse matrix applications where loops may have loop carried data dependencies. The fine grain of the computation and the irregular nature of the dependencies make such loops extremely hard to parallelize efficiently. We use runtime preprocessing to parallelize such computations. A preprocessing step analyzes the DAG and determines an optimized mapping of tasks to processors, an appropriate order of executing these tasks on each processor and a set of synchronization constraints. This information is used to distribute data across processors, and to schedule computation, communication and synchronization. We use sparse-matrix triangular solution as our example application and compare a range of runtime preprocessed techniques for its parallelization. Whereas previous studies have achieved maximum speedups of less that 4 on even simple banded matrices, we achieve a speedup of almost 35 on 128 processors of the CM-5 on an irregular matrix with 5300 rows. We find that and receive overheads dominate the time of out application. Efficient parallelization of such loops requires even more support for low latency communication.

Distributed Memory Compiler Design for Sparse Problems
Published in: IEEE Transactions on Computers vol. 44, no.6, pg: 737, June 1995
J. Wu, R. Das, J. Saltz, H. Berryman, S. Hiranandani
This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accessed that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used effectively by a compiler to generate efficient codes in these situations. The compiler accepts as input a Fortran 77 program enhanced with specifications for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. A variety of performance results on the Intel iPSC/860 are presented.

A Framework for Optimizing Parallel {I/O}
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3417 and UMIACS-TR-95-20
R. Bennett, K. Byrant, A. Sussman, R. Das, J. Saltz
January 1995
There has been a great deal of recent interest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architecture that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that the I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations, called Jovian, on three application templates are also presented.

Index Translation Schemes for Adaptive Computations on Distributed Memory Multicomputers
Published in: Proceedings of the Ninth International Parallel Processing Symposium pg: 812-819, April 1995
B. Moon, M. Uysal, J. Saltz
Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with as adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance.

Data Parallel Programming in an Adaptive Environment
Published in: Proceedings of the Ninth International Parallel Processing Symposium April 1995 pg: 827-832
G. Edjlali, G. Agrawal, A. Sussman, J. Saltz
An extended version also available as University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
Interleaved Parallel Hybrid Arnoldi Method for a Parallel Machine and a Network of Workstations
Guy Edjlali, Serge Petition, Nahid Emad
June, 1996
The data-parallel paradigm which is effective for many numerical algebra computations is no more sufficient when programming heterogeneous machines. To efficiently utilize these new resources we need to combine the two new paradigms: data-parallelism and task parallelism. This leads us to define new algorithims.
In this paper, throughout all examples we will illustrate this statement. The chosen example is the computation of some eigenvalues of large non-structured sparse non-Hermitian matrices. We present a new algorithm, Interleaved Arnoldi, and compare its performances to the parallel Arnoldi algorithm. The different environments we consider are: a CM5, a CM5 and a Sparc 20, a CM5 and 2 Sparc 20. We show that significant results can be obtained even whell the het- erogeneous machine consists of a CM5 and 2 Sparc 20 workstations.

Interoperability of Data Parallel Runtime Libraries with Meta-Chaos
Guy Edjlali and Alan Sussman and Joel Saltz
May 1996
University of Maryland Technical Report: CS-TR-3633 and UMIACS-TR-96-30
This paper describes a framework for providing the ability to use multiple specialized data parallel libraries and/or languages within a single applications. The ability to use multiple libraries) is required in many application areas, such as multidisciplinary complex physical simulations and remote sensing image database applications. An application can consist of one program or multiple programming that use different libraries- to parallelize operations on distributed data structures. The framework is embodied in a runtime library called Meta--Chaos that has been used to exchange data between data parallel programs written using High Performance Fortran, the Chaos and Multiblock Parti libraries developed at Maryland for handling various types of unstructured problems, and the runtime library for pc++, a data parallel version of C++ from Indiana University. Experimental results show that Meta-Chaos is able to move data between libraries efficiently, and that Meta-Chaos provides effective support, for complex applications.

Scalability Analysis of Declustering Methods for Cartesian product files
Bongki Moon and Joel H. Saltz
April 1996
University of Maryland Technical Report: CS-TR-3590 and UMIACS-TR-96-5

To be published in: IEEE Transactions on Software Engineering
Efficient storage and retrieval of multi-attribute datasets has become one of the essential requirements for many data-intensive applications. The Cartesian product file has been known as an effective multi-attribute file structure for partial-match and best-match queries. Several heuristic methods have been developed to decluster Cartesian product files over multiple disks to obtain high performance for disk accesses. Though the scalability of the declustering methods becomes increasingly important for systems equipped with a large number of disks, no analytic studies have been done so far In this paper we derive formulas describing the, scalability of two popular declustering methods Disk Modulo and Fieldwise Xor for range queries, which are the most common type of queries. These formulas disclose the limited scalability of the declustering methods and are corroborated by extensive simulation experiments. From the practical point of view, the formulas given in this paper provide a simple measure which can be used to predict the response time of a given range query and to guide the selection of a declustering method under various conditions.

Analysis of the Clustering Properties of Hilbert Space-filling Curve
Bongki Moon, H. V. Jagadish, Christos Faloutsos, Joel H. Saltz
March 1996
University of Maryland Department of Computer Science Technical Report, CS-TR-3611 and UMIACS-TR-96-20
Submitted to IEEE Transactions on Knowledge and Data Engineering, March 1996
Several schemes for linear mapping of multidimensional space have been proposed for many applications such as access methods for spatio-temporal databases, image compression and so on. In all these applications, one of the most desired properties from such linear mappings is clustering, which means the locality between objects in the multidimensional space is preserved in the linear space. It is widely believed that Hilbert space-filling curve achieves the best clustering. In this paper we provide closed-fortran formulas of the number of clusters required by a given query region of an arbitrary shape (e.g., polygons and polyhedra) for Hilbert space-filling curve. Both the asymptotic solution for a general case and the exact solution for a special case generalize the previous work, and they agree with the empirical results that the number of clusters depends on the hyper-surface area of the query region and not on its hyper-volume. We have also shown that Hilbert curve achieves better clustering than z-curve. From the practical point of view, the formulas given in this paper provide a simple measure which can be used to predict the required disk access behaviors and hence the total access time.
Index Terms: locality-preserving linear mapping, range queries, multi-attribute access methods, data clustering, Hilbert curve, space-filling curves, fractals.

Eliminating Redundant Barrier Synchronizations in Rule-based Programs
Anurag Acharya
To be published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996

A rule-based program consists of a set of if-then rules and a tuple-space. The rules are the code for the program and the tuple-space contains the data being processed by the program. Previous efforts to parallelize rule-based programs have achieved limited speedups. The main reason for these disappointing results is a high frequency of barrier synchronizations. Since little work is done between successive barrier synchronizations, the number of processors that can be effectively utilized is bounded. Even though required by language semantics, a large fraction of the barrier synchronizations are not necessary for most programs. This paper proposes a pair of simple language extensions that allow an implementation to efficiently detect and eliminate redundant barrier synchronizations. Simulation results based on a real implementation show that for a set of five benchmarks, this scheme is able to eliminate between 95.6% and 99.9% of the barrier synchronizations. This results in a multiplicative speedup of between 2.2 and 52.3 fold over and above the speedup achieved by a parallelizing compiler. For the programs studied, simulations indicate that speedups up to 115 fold relative to an optimized sequential version are possible.

Study of Scalable Declustering Algorithms for Parallel Grid Files
Bongki Moon, Anurag Acharya, Joel Saltz
April 1996
To be published in Proceedings of the 1996 Parallel Processing Symposium
University of Maryland Technical Report: CS-TR-3589 and UMIACS-TR-96-4 Lengthened version
Efficient storage and retrieval of large multidimensional datasets has been one of the key issues in large-scale scientific computations such as long running time-dependent simulations periodically generating snapshots of the state. The main challenge for efficiently handling such datasets is to minimize response time for multidimensional range queries. The grid file is one of the well known access methods for multidimensional and spatial data. We investigate effective and scalable declustering techniques for grid files with the primary goal of minimizing response time and the secondary goal of maximizing the fairness of data distribution. The main contributions of this paper is (1) the analytic and experimental evaluation of existing index-based declustering techniques and their extensions for grid files, and (2) the development of an improved proximity-based declustering algorithm called minimax which has been experimentally shown to scale well and consistently achieve better response time compared to all the other algorithms while maintaining perfect disk distribution.

Tuning the Performance of I/O-Intensive Parallel Applications
Anurag Acharya, Mustafa Uysal, Robert Bennett, Assaf Mendelson, Michael Beynon, Jeff Hollingsworth, Joel Saltz, Alan Sussman
To appear in: Fourth Annual Workshop on I/O in Parallel and Distributed Systems, May 27 1996, Philadelphia, Pennsylvania
Getting good I/O performance from parallel programs is a critical problem for many application domains. In this paper, we report our experience tuning the I/O performance of four application programs from the areas of sensor data processing and linear algebra. After tuning, three of the four applications achieve effective I/O rates of over 100MB/s, on 16 processors. The total volume of I/O required by the programs ranged from about 75MB to over 200GB. We report the lessons learned in achieving high I/O performance from these applications, including the need for code restructuring, local disks on every node and overlapping I/O with computation. We also report our experience on achieving high performance on peer-to-peer configurations. Finally, we comment on the necessity of complex I/O interfaces like collective I/O and strided requests to achieve high performance.

An Interprocedural Framework for Placement of Asynchronous I/O Operations

Gagan Agrawal, Anurag Acharya, Joel Saltz
To be published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 358-365

Overlapping memory accesses with computations is a standard technique for improving performance of modern architectures, which have deep memory hierarchies. In this paper, we present a compiler technique for overlapping accesses to secondary memory (disks) with computation. We have developed an Interprocedural Balanced Code Placement (IBCP) framework, which performs analysis on arbitrary recursive procedures and arbitrary control flow and replaces synchronous I/O operations with a balanced pair of asynchronous operations. We demonstrate how this analysis is useful for the applications which perform frequent and large accesses to secondary memory, analysis is useful for the applications which perform frequent and large accesses to secondary memory, including the applications which snapshot or checkpoint their computations or the out-of-core applications.

Experimental Evaluation of Sparse Matrix Distributions

M. Ujaldon Martinez, S. Sharma, J. Saltz, and E. Zapata
To be published in:Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 78-85

Sparse matrix problems are difficult to parallelize efficiently on distributed memory machines since non-zero elements are unevenly scattered and and are accessed via multiple levels of indirection. Irregular distributions that acheive good load balance and locality are hard to compute, have high memory overheads and also lead to further indirection in locating distributed data. This paper evaluates alternative semi-regular distribution strategies which rade off the quality of the load balance and locality for power distribution overheads and efficient lookup. The proposed techniques are compared to an irregular sparse matrix partitioner and the relative merits of each distribution method are outlined.

Runtime Coupling of Data-Parallel Programs
May 1996
Mudumbai Ranganathan, Anurag Acharya, Guy Edjlali, Alan Sussman, Joel Saltz
Published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, pages 229-236 May 25-28, 1996
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3565 and UMIACS TR-95-116
We consider the problem of efficiently computing multiple data-parallel programs at runtime. We propose an approach that establishes a mapping between data structures in different data-parallel programs and implements a user-specified consistency model. Mappings are established at runtime and can be added and deleted while the programs being coupled are in execution. Mappings, or the identity of the processors involved, do not have to be known at compile-time or even link-time. Programs can be made to interact with different granularities of interaction without requiring any re-coding. A-priori knowledge of consistency requirements aflows buffering of data as well as concurrent execution of the coupled applications. Efficient data movement is achieved by pre-computing an optimized schedule. We describe our prototype implementation and evaluate its performance using a set of synthetic benchmarks. We examine the variation of performance with variation in the consistency requirement. We demonstrate that the cost of the flexibility provided by our coupling scheme is not prohibitive when compared with a monolithic program that performs the same computation.

Runtime Techniques for Parallelizing Sparse Matrix Applications
Published in: Proceedings of the 1995 Workshop on Irregular Problems
September, 1995
Manuel Ujaldon Martinez, Shamik D. Sharma, Joel Saltz, Emilio Zapata
Sparse matrix problems are difficult to parallelize efficiently on message-passing machines, since they access data through multiple levels of indirection. Inspector/executor strategies, which are typically used to parallelize such problems impose insignificant preprocessing overheads. This paper describes the runtime support required by new compilation techniques for sparse matrices and evaluates their performance, highlighting optimizations and improvements over previous techniques.

Compiler and Runtime Support for Programming in Adaptive Parallel Environments
Published as: University of Maryland Technical Report CS-TR-3510 and UMIACS-TR-95-83

Submitted for Journal Publication
Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, Joel Saltz

For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers of HPF-like languages to generate code for an adaptive environment. We present performance results for Navier-Stokes solver on a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varies frequently, the cost of data distribution is not significant when compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network on non-dedicated workstations, which are likely to be an important resource for parallel programming in the future.

Object-Oriented Runtime Support for Complex Distributed Data Structures
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3438 and UMIACS-TR-95-35
C. Chang, A. Sussman, J. Saltz
March, 1995
Object-oriented applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex data types. Traditionally, however, applications executed on distributed memory parallel architectures in single-program multiple-data (SPMD) mode use on distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques that to such applications executing in a loosely synchronous manner. Existing runtime systems that rely solely on global indices are not always applicable to object-oriented applications, since no global names or indices are imposed upon dynamic complex data structures linked by pointers.
We describe a portable object-oriented runtime library that has been designed to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures, by providing two abstractions, mobile objects and globally addressable objects. CHAOS++ uses preprocessing techniques to analyze communication patterns, and provides data exchange routines to perform efficient transfers between processors. Results for applications taken from three distinct classes are also presented to demonstrate the wide applicability and good performance characteristics of the runtime library.

A Manual for the {CHAOS} Runtime Library
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3437 and UMIACS-TR-95-34
March 1995
J. Saltz, R. Ponnusammy, S. Sharma, B. Moon, Y.-S. Hwang, M. Uysal, R. Das
Procedures are presented that are designed to help users efficiently program irregular problems (e. g. unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These procedures are also designed for use in compilers for distributed memory multiprocessors. The portable CHAOS procedures are designed to support dynamic data distributions and to automatically generate send and receive message by capturing communications patterns at runtime.

Parallelization of Fine Grained, Irregular DAGs
Published in to International Conference on Supercomputing 1995
Published: F. Chong, S. Sharma, E. Brewer, J. Saltz
December, 1994
This paper examines parallelization of computations that can be characterized as a fine-grained, irregular, directed acyclic graph of tasks. Such computations typically arise in sparse matrix applications where loops may have loop carried data dependencies. The fine grain of the computation and the irregular nature of the dependencies make such loops extremely hard to parallelize efficiently. We use runtime preprocessing to parallelize such computations. A preprocessing step analyzes the DAG and determines an optimized mapping of tasks to processors, an appropriate order of executing these tasks on each processor and a set of synchronization constraints. This information is used to distribute data across processors, and to schedule computation, communication and synchronization. We use sparse-matrix triangular solution as our example application and compare a range of runtime preprocessed techniques for its parallelization. Whereas previous studies have achieved maximum speedups of less that 4 on even simple banded matrices, we achieve a speedup of almost 35 on 128 processors of the CM-5 on an irregular matrix with 5300 rows. We find that and receive overheads dominate the time of out application. Efficient parallelization of such loops requires even more support for low latency communication.

Distributed Memory Compiler Design for Sparse Problems
Published in: IEEE Transactions on Computers vol. 44, no.6, pg: 737, June 1995
J. Wu, R. Das, J. Saltz, H. Berryman, S. Hiranandani
This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accessed that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used effectively by a compiler to generate efficient codes in these situations. The compiler accepts as input a Fortran 77 program enhanced with specifications for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. A variety of performance results on the Intel iPSC/860 are presented.

A Framework for Optimizing Parallel {I/O}
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3417 and UMIACS-TR-95-20
R. Bennett, K. Byrant, A. Sussman, R. Das, J. Saltz
January 1995
There has been a great deal of recent interest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architecture that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that the I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations, called Jovian, on three application templates are also presented.

Index Translation Schemes for Adaptive Computations on Distributed Memory Multicomputers
Published in: Proceedings of the Ninth International Parallel Processing Symposium pg: 812-819, April 1995
B. Moon, M. Uysal, J. Saltz
Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with as adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance.

Data Parallel Programming in an Adaptive Environment
Published in: Proceedings of the Ninth International Parallel Processing Symposium April 1995 pg: 827-832
G. Edjlali, G. Agrawal, A. Sussman, J. Saltz
An extended version also available as University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
Interoperability of Data Parallel Runtime Libraries with Meta-Chaos
Guy Edjlali and Alan Sussman and Joel Saltz
May 1996
University of Maryland Technical Report: CS-TR-3633 and UMIACS-TR-96-30
This paper describes a framework for providing the ability to use multiple specialized data parallel libraries and/or languages within a single applications. The ability to use multiple libraries) is required in many application areas, such as multidisciplinary complex physical simulations and remote sensing image database applications. An application can consist of one program or multiple programming that use different libraries- to parallelize operations on distributed data structures. The framework is embodied in a runtime library called Meta--Chaos that has been used to exchange data between data parallel programs written using High Performance Fortran, the Chaos and Multiblock Parti libraries developed at Maryland for handling various types of unstructured problems, and the runtime library for pc++, a data parallel version of C++ from Indiana University. Experimental results show that Meta-Chaos is able to move data between libraries efficiently, and that Meta-Chaos provides effective support, for complex applications.

Scalability Analysis of Declustering Methods for Cartesian product files
Bongki Moon and Joel H. Saltz
April 1996
University of Maryland Technical Report: CS-TR-3590 and UMIACS-TR-96-5

To be published in: IEEE Transactions on Software Engineering
Efficient storage and retrieval of multi-attribute datasets has become one of the essential requirements for many data-intensive applications. The Cartesian product file has been known as an effective multi-attribute file structure for partial-match and best-match queries. Several heuristic methods have been developed to decluster Cartesian product files over multiple disks to obtain high performance for disk accesses. Though the scalability of the declustering methods becomes increasingly important for systems equipped with a large number of disks, no analytic studies have been done so far In this paper we derive formulas describing the, scalability of two popular declustering methods Disk Modulo and Fieldwise Xor for range queries, which are the most common type of queries. These formulas disclose the limited scalability of the declustering methods and are corroborated by extensive simulation experiments. From the practical point of view, the formulas given in this paper provide a simple measure which can be used to predict the response time of a given range query and to guide the selection of a declustering method under various conditions.

Analysis of the Clustering Properties of Hilbert Space-filling Curve
Bongki Moon, H. V. Jagadish, Christos Faloutsos, Joel H. Saltz
March 1996
University of Maryland Department of Computer Science Technical Report, CS-TR-3611 and UMIACS-TR-96-20
Submitted to IEEE Transactions on Knowledge and Data Engineering, March 1996
Several schemes for linear mapping of multidimensional space have been proposed for many applications such as access methods for spatio-temporal databases, image compression and so on. In all these applications, one of the most desired properties from such linear mappings is clustering, which means the locality between objects in the multidimensional space is preserved in the linear space. It is widely believed that Hilbert space-filling curve achieves the best clustering. In this paper we provide closed-fortran formulas of the number of clusters required by a given query region of an arbitrary shape (e.g., polygons and polyhedra) for Hilbert space-filling curve. Both the asymptotic solution for a general case and the exact solution for a special case generalize the previous work, and they agree with the empirical results that the number of clusters depends on the hyper-surface area of the query region and not on its hyper-volume. We have also shown that Hilbert curve achieves better clustering than z-curve. From the practical point of view, the formulas given in this paper provide a simple measure which can be used to predict the required disk access behaviors and hence the total access time.
Index Terms: locality-preserving linear mapping, range queries, multi-attribute access methods, data clustering, Hilbert curve, space-filling curves, fractals.

Eliminating Redundant Barrier Synchronizations in Rule-based Programs
Anurag Acharya
To be published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996

A rule-based program consists of a set of if-then rules and a tuple-space. The rules are the code for the program and the tuple-space contains the data being processed by the program. Previous efforts to parallelize rule-based programs have achieved limited speedups. The main reason for these disappointing results is a high frequency of barrier synchronizations. Since little work is done between successive barrier synchronizations, the number of processors that can be effectively utilized is bounded. Even though required by language semantics, a large fraction of the barrier synchronizations are not necessary for most programs. This paper proposes a pair of simple language extensions that allow an implementation to efficiently detect and eliminate redundant barrier synchronizations. Simulation results based on a real implementation show that for a set of five benchmarks, this scheme is able to eliminate between 95.6% and 99.9% of the barrier synchronizations. This results in a multiplicative speedup of between 2.2 and 52.3 fold over and above the speedup achieved by a parallelizing compiler. For the programs studied, simulations indicate that speedups up to 115 fold relative to an optimized sequential version are possible.

Study of Scalable Declustering Algorithms for Parallel Grid Files
Bongki Moon, Anurag Acharya, Joel Saltz
April 1996
To be published in Proceedings of the 1996 Parallel Processing Symposium
University of Maryland Technical Report: CS-TR-3589 and UMIACS-TR-96-4 Lengthened version
Efficient storage and retrieval of large multidimensional datasets has been one of the key issues in large-scale scientific computations such as long running time-dependent simulations periodically generating snapshots of the state. The main challenge for efficiently handling such datasets is to minimize response time for multidimensional range queries. The grid file is one of the well known access methods for multidimensional and spatial data. We investigate effective and scalable declustering techniques for grid files with the primary goal of minimizing response time and the secondary goal of maximizing the fairness of data distribution. The main contributions of this paper is (1) the analytic and experimental evaluation of existing index-based declustering techniques and their extensions for grid files, and (2) the development of an improved proximity-based declustering algorithm called minimax which has been experimentally shown to scale well and consistently achieve better response time compared to all the other algorithms while maintaining perfect disk distribution.

Tuning the Performance of I/O-Intensive Parallel Applications
Anurag Acharya, Mustafa Uysal, Robert Bennett, Assaf Mendelson, Michael Beynon, Jeff Hollingsworth, Joel Saltz, Alan Sussman
To appear in: Fourth Annual Workshop on I/O in Parallel and Distributed Systems, May 27 1996, Philadelphia, Pennsylvania
Getting good I/O performance from parallel programs is a critical problem for many application domains. In this paper, we report our experience tuning the I/O performance of four application programs from the areas of sensor data processing and linear algebra. After tuning, three of the four applications achieve effective I/O rates of over 100MB/s, on 16 processors. The total volume of I/O required by the programs ranged from about 75MB to over 200GB. We report the lessons learned in achieving high I/O performance from these applications, including the need for code restructuring, local disks on every node and overlapping I/O with computation. We also report our experience on achieving high performance on peer-to-peer configurations. Finally, we comment on the necessity of complex I/O interfaces like collective I/O and strided requests to achieve high performance.

An Interprocedural Framework for Placement of Asynchronous I/O Operations

Gagan Agrawal, Anurag Acharya, Joel Saltz
To be published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 358-365

Overlapping memory accesses with computations is a standard technique for improving performance of modern architectures, which have deep memory hierarchies. In this paper, we present a compiler technique for overlapping accesses to secondary memory (disks) with computation. We have developed an Interprocedural Balanced Code Placement (IBCP) framework, which performs analysis on arbitrary recursive procedures and arbitrary control flow and replaces synchronous I/O operations with a balanced pair of asynchronous operations. We demonstrate how this analysis is useful for the applications which perform frequent and large accesses to secondary memory, analysis is useful for the applications which perform frequent and large accesses to secondary memory, including the applications which snapshot or checkpoint their computations or the out-of-core applications.

Experimental Evaluation of Sparse Matrix Distributions

M. Ujaldon Martinez, S. Sharma, J. Saltz, and E. Zapata
To be published in:Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 78-85

Sparse matrix problems are difficult to parallelize efficiently on distributed memory machines since non-zero elements are unevenly scattered and and are accessed via multiple levels of indirection. Irregular distributions that acheive good load balance and locality are hard to compute, have high memory overheads and also lead to further indirection in locating distributed data. This paper evaluates alternative semi-regular distribution strategies which rade off the quality of the load balance and locality for power distribution overheads and efficient lookup. The proposed techniques are compared to an irregular sparse matrix partitioner and the relative merits of each distribution method are outlined.

Runtime Coupling of Data-Parallel Programs
May 1996
Mudumbai Ranganathan, Anurag Acharya, Guy Edjlali, Alan Sussman, Joel Saltz
Published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, pages 229-236 May 25-28, 1996
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3565 and UMIACS TR-95-116
We consider the problem of efficiently computing multiple data-parallel programs at runtime. We propose an approach that establishes a mapping between data structures in different data-parallel programs and implements a user-specified consistency model. Mappings are established at runtime and can be added and deleted while the programs being coupled are in execution. Mappings, or the identity of the processors involved, do not have to be known at compile-time or even link-time. Programs can be made to interact with different granularities of interaction without requiring any re-coding. A-priori knowledge of consistency requirements aflows buffering of data as well as concurrent execution of the coupled applications. Efficient data movement is achieved by pre-computing an optimized schedule. We describe our prototype implementation and evaluate its performance using a set of synthetic benchmarks. We examine the variation of performance with variation in the consistency requirement. We demonstrate that the cost of the flexibility provided by our coupling scheme is not prohibitive when compared with a monolithic program that performs the same computation.

Runtime Techniques for Parallelizing Sparse Matrix Applications
Published in: Proceedings of the 1995 Workshop on Irregular Problems
September, 1995
Manuel Ujaldon Martinez, Shamik D. Sharma, Joel Saltz, Emilio Zapata
Sparse matrix problems are difficult to parallelize efficiently on message-passing machines, since they access data through multiple levels of indirection. Inspector/executor strategies, which are typically used to parallelize such problems impose insignificant preprocessing overheads. This paper describes the runtime support required by new compilation techniques for sparse matrices and evaluates their performance, highlighting optimizations and improvements over previous techniques.

Compiler and Runtime Support for Programming in Adaptive Parallel Environments
Published as: University of Maryland Technical Report CS-TR-3510 and UMIACS-TR-95-83

Submitted for Journal Publication
Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, Joel Saltz

For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers of HPF-like languages to generate code for an adaptive environment. We present performance results for Navier-Stokes solver on a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varies frequently, the cost of data distribution is not significant when compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network on non-dedicated workstations, which are likely to be an important resource for parallel programming in the future.

Object-Oriented Runtime Support for Complex Distributed Data Structures
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3438 and UMIACS-TR-95-35
C. Chang, A. Sussman, J. Saltz
March, 1995
Object-oriented applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex data types. Traditionally, however, applications executed on distributed memory parallel architectures in single-program multiple-data (SPMD) mode use on distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques that to such applications executing in a loosely synchronous manner. Existing runtime systems that rely solely on global indices are not always applicable to object-oriented applications, since no global names or indices are imposed upon dynamic complex data structures linked by pointers.
We describe a portable object-oriented runtime library that has been designed to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures, by providing two abstractions, mobile objects and globally addressable objects. CHAOS++ uses preprocessing techniques to analyze communication patterns, and provides data exchange routines to perform efficient transfers between processors. Results for applications taken from three distinct classes are also presented to demonstrate the wide applicability and good performance characteristics of the runtime library.

A Manual for the {CHAOS} Runtime Library
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3437 and UMIACS-TR-95-34
March 1995
J. Saltz, R. Ponnusammy, S. Sharma, B. Moon, Y.-S. Hwang, M. Uysal, R. Das
Procedures are presented that are designed to help users efficiently program irregular problems (e. g. unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These procedures are also designed for use in compilers for distributed memory multiprocessors. The portable CHAOS procedures are designed to support dynamic data distributions and to automatically generate send and receive message by capturing communications patterns at runtime.

Parallelization of Fine Grained, Irregular DAGs
Published in to International Conference on Supercomputing 1995
Published: F. Chong, S. Sharma, E. Brewer, J. Saltz
December, 1994
This paper examines parallelization of computations that can be characterized as a fine-grained, irregular, directed acyclic graph of tasks. Such computations typically arise in sparse matrix applications where loops may have loop carried data dependencies. The fine grain of the computation and the irregular nature of the dependencies make such loops extremely hard to parallelize efficiently. We use runtime preprocessing to parallelize such computations. A preprocessing step analyzes the DAG and determines an optimized mapping of tasks to processors, an appropriate order of executing these tasks on each processor and a set of synchronization constraints. This information is used to distribute data across processors, and to schedule computation, communication and synchronization. We use sparse-matrix triangular solution as our example application and compare a range of runtime preprocessed techniques for its parallelization. Whereas previous studies have achieved maximum speedups of less that 4 on even simple banded matrices, we achieve a speedup of almost 35 on 128 processors of the CM-5 on an irregular matrix with 5300 rows. We find that and receive overheads dominate the time of out application. Efficient parallelization of such loops requires even more support for low latency communication.

Distributed Memory Compiler Design for Sparse Problems
Published in: IEEE Transactions on Computers vol. 44, no.6, pg: 737, June 1995
J. Wu, R. Das, J. Saltz, H. Berryman, S. Hiranandani
This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accessed that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used effectively by a compiler to generate efficient codes in these situations. The compiler accepts as input a Fortran 77 program enhanced with specifications for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. A variety of performance results on the Intel iPSC/860 are presented.

A Framework for Optimizing Parallel {I/O}
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3417 and UMIACS-TR-95-20
R. Bennett, K. Byrant, A. Sussman, R. Das, J. Saltz
January 1995
There has been a great deal of recent interest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architecture that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that the I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations, called Jovian, on three application templates are also presented.

Index Translation Schemes for Adaptive Computations on Distributed Memory Multicomputers
Published in: Proceedings of the Ninth International Parallel Processing Symposium pg: 812-819, April 1995
B. Moon, M. Uysal, J. Saltz
Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with as adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance.

Data Parallel Programming in an Adaptive Environment
Published in: Proceedings of the Ninth International Parallel Processing Symposium April 1995 pg: 827-832
G. Edjlali, G. Agrawal, A. Sussman, J. Saltz
An extended version also available as University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
Scalability Analysis of Declustering Methods for Cartesian product files
Bongki Moon and Joel H. Saltz
April 1996
University of Maryland Technical Report: CS-TR-3590 and UMIACS-TR-96-5

To be published in: IEEE Transactions on Software Engineering
Efficient storage and retrieval of multi-attribute datasets has become one of the essential requirements for many data-intensive applications. The Cartesian product file has been known as an effective multi-attribute file structure for partial-match and best-match queries. Several heuristic methods have been developed to decluster Cartesian product files over multiple disks to obtain high performance for disk accesses. Though the scalability of the declustering methods becomes increasingly important for systems equipped with a large number of disks, no analytic studies have been done so far In this paper we derive formulas describing the, scalability of two popular declustering methods Disk Modulo and Fieldwise Xor for range queries, which are the most common type of queries. These formulas disclose the limited scalability of the declustering methods and are corroborated by extensive simulation experiments. From the practical point of view, the formulas given in this paper provide a simple measure which can be used to predict the response time of a given range query and to guide the selection of a declustering method under various conditions.

Analysis of the Clustering Properties of Hilbert Space-filling Curve
Bongki Moon, H. V. Jagadish, Christos Faloutsos, Joel H. Saltz
March 1996
University of Maryland Department of Computer Science Technical Report, CS-TR-3611 and UMIACS-TR-96-20
Submitted to IEEE Transactions on Knowledge and Data Engineering, March 1996
Several schemes for linear mapping of multidimensional space have been proposed for many applications such as access methods for spatio-temporal databases, image compression and so on. In all these applications, one of the most desired properties from such linear mappings is clustering, which means the locality between objects in the multidimensional space is preserved in the linear space. It is widely believed that Hilbert space-filling curve achieves the best clustering. In this paper we provide closed-fortran formulas of the number of clusters required by a given query region of an arbitrary shape (e.g., polygons and polyhedra) for Hilbert space-filling curve. Both the asymptotic solution for a general case and the exact solution for a special case generalize the previous work, and they agree with the empirical results that the number of clusters depends on the hyper-surface area of the query region and not on its hyper-volume. We have also shown that Hilbert curve achieves better clustering than z-curve. From the practical point of view, the formulas given in this paper provide a simple measure which can be used to predict the required disk access behaviors and hence the total access time.
Index Terms: locality-preserving linear mapping, range queries, multi-attribute access methods, data clustering, Hilbert curve, space-filling curves, fractals.

Eliminating Redundant Barrier Synchronizations in Rule-based Programs
Anurag Acharya
To be published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996

A rule-based program consists of a set of if-then rules and a tuple-space. The rules are the code for the program and the tuple-space contains the data being processed by the program. Previous efforts to parallelize rule-based programs have achieved limited speedups. The main reason for these disappointing results is a high frequency of barrier synchronizations. Since little work is done between successive barrier synchronizations, the number of processors that can be effectively utilized is bounded. Even though required by language semantics, a large fraction of the barrier synchronizations are not necessary for most programs. This paper proposes a pair of simple language extensions that allow an implementation to efficiently detect and eliminate redundant barrier synchronizations. Simulation results based on a real implementation show that for a set of five benchmarks, this scheme is able to eliminate between 95.6% and 99.9% of the barrier synchronizations. This results in a multiplicative speedup of between 2.2 and 52.3 fold over and above the speedup achieved by a parallelizing compiler. For the programs studied, simulations indicate that speedups up to 115 fold relative to an optimized sequential version are possible.

Study of Scalable Declustering Algorithms for Parallel Grid Files
Bongki Moon, Anurag Acharya, Joel Saltz
April 1996
To be published in Proceedings of the 1996 Parallel Processing Symposium
University of Maryland Technical Report: CS-TR-3589 and UMIACS-TR-96-4 Lengthened version
Efficient storage and retrieval of large multidimensional datasets has been one of the key issues in large-scale scientific computations such as long running time-dependent simulations periodically generating snapshots of the state. The main challenge for efficiently handling such datasets is to minimize response time for multidimensional range queries. The grid file is one of the well known access methods for multidimensional and spatial data. We investigate effective and scalable declustering techniques for grid files with the primary goal of minimizing response time and the secondary goal of maximizing the fairness of data distribution. The main contributions of this paper is (1) the analytic and experimental evaluation of existing index-based declustering techniques and their extensions for grid files, and (2) the development of an improved proximity-based declustering algorithm called minimax which has been experimentally shown to scale well and consistently achieve better response time compared to all the other algorithms while maintaining perfect disk distribution.

Tuning the Performance of I/O-Intensive Parallel Applications
Anurag Acharya, Mustafa Uysal, Robert Bennett, Assaf Mendelson, Michael Beynon, Jeff Hollingsworth, Joel Saltz, Alan Sussman
To appear in: Fourth Annual Workshop on I/O in Parallel and Distributed Systems, May 27 1996, Philadelphia, Pennsylvania
Getting good I/O performance from parallel programs is a critical problem for many application domains. In this paper, we report our experience tuning the I/O performance of four application programs from the areas of sensor data processing and linear algebra. After tuning, three of the four applications achieve effective I/O rates of over 100MB/s, on 16 processors. The total volume of I/O required by the programs ranged from about 75MB to over 200GB. We report the lessons learned in achieving high I/O performance from these applications, including the need for code restructuring, local disks on every node and overlapping I/O with computation. We also report our experience on achieving high performance on peer-to-peer configurations. Finally, we comment on the necessity of complex I/O interfaces like collective I/O and strided requests to achieve high performance.

An Interprocedural Framework for Placement of Asynchronous I/O Operations

Gagan Agrawal, Anurag Acharya, Joel Saltz
To be published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 358-365

Overlapping memory accesses with computations is a standard technique for improving performance of modern architectures, which have deep memory hierarchies. In this paper, we present a compiler technique for overlapping accesses to secondary memory (disks) with computation. We have developed an Interprocedural Balanced Code Placement (IBCP) framework, which performs analysis on arbitrary recursive procedures and arbitrary control flow and replaces synchronous I/O operations with a balanced pair of asynchronous operations. We demonstrate how this analysis is useful for the applications which perform frequent and large accesses to secondary memory, analysis is useful for the applications which perform frequent and large accesses to secondary memory, including the applications which snapshot or checkpoint their computations or the out-of-core applications.

Experimental Evaluation of Sparse Matrix Distributions

M. Ujaldon Martinez, S. Sharma, J. Saltz, and E. Zapata
To be published in:Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 78-85

Sparse matrix problems are difficult to parallelize efficiently on distributed memory machines since non-zero elements are unevenly scattered and and are accessed via multiple levels of indirection. Irregular distributions that acheive good load balance and locality are hard to compute, have high memory overheads and also lead to further indirection in locating distributed data. This paper evaluates alternative semi-regular distribution strategies which rade off the quality of the load balance and locality for power distribution overheads and efficient lookup. The proposed techniques are compared to an irregular sparse matrix partitioner and the relative merits of each distribution method are outlined.

Runtime Coupling of Data-Parallel Programs
May 1996
Mudumbai Ranganathan, Anurag Acharya, Guy Edjlali, Alan Sussman, Joel Saltz
Published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, pages 229-236 May 25-28, 1996
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3565 and UMIACS TR-95-116
We consider the problem of efficiently computing multiple data-parallel programs at runtime. We propose an approach that establishes a mapping between data structures in different data-parallel programs and implements a user-specified consistency model. Mappings are established at runtime and can be added and deleted while the programs being coupled are in execution. Mappings, or the identity of the processors involved, do not have to be known at compile-time or even link-time. Programs can be made to interact with different granularities of interaction without requiring any re-coding. A-priori knowledge of consistency requirements aflows buffering of data as well as concurrent execution of the coupled applications. Efficient data movement is achieved by pre-computing an optimized schedule. We describe our prototype implementation and evaluate its performance using a set of synthetic benchmarks. We examine the variation of performance with variation in the consistency requirement. We demonstrate that the cost of the flexibility provided by our coupling scheme is not prohibitive when compared with a monolithic program that performs the same computation.

Runtime Techniques for Parallelizing Sparse Matrix Applications
Published in: Proceedings of the 1995 Workshop on Irregular Problems
September, 1995
Manuel Ujaldon Martinez, Shamik D. Sharma, Joel Saltz, Emilio Zapata
Sparse matrix problems are difficult to parallelize efficiently on message-passing machines, since they access data through multiple levels of indirection. Inspector/executor strategies, which are typically used to parallelize such problems impose insignificant preprocessing overheads. This paper describes the runtime support required by new compilation techniques for sparse matrices and evaluates their performance, highlighting optimizations and improvements over previous techniques.

Compiler and Runtime Support for Programming in Adaptive Parallel Environments
Published as: University of Maryland Technical Report CS-TR-3510 and UMIACS-TR-95-83

Submitted for Journal Publication
Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, Joel Saltz

For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers of HPF-like languages to generate code for an adaptive environment. We present performance results for Navier-Stokes solver on a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varies frequently, the cost of data distribution is not significant when compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network on non-dedicated workstations, which are likely to be an important resource for parallel programming in the future.

Object-Oriented Runtime Support for Complex Distributed Data Structures
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3438 and UMIACS-TR-95-35
C. Chang, A. Sussman, J. Saltz
March, 1995
Object-oriented applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex data types. Traditionally, however, applications executed on distributed memory parallel architectures in single-program multiple-data (SPMD) mode use on distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques that to such applications executing in a loosely synchronous manner. Existing runtime systems that rely solely on global indices are not always applicable to object-oriented applications, since no global names or indices are imposed upon dynamic complex data structures linked by pointers.
We describe a portable object-oriented runtime library that has been designed to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures, by providing two abstractions, mobile objects and globally addressable objects. CHAOS++ uses preprocessing techniques to analyze communication patterns, and provides data exchange routines to perform efficient transfers between processors. Results for applications taken from three distinct classes are also presented to demonstrate the wide applicability and good performance characteristics of the runtime library.

A Manual for the {CHAOS} Runtime Library
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3437 and UMIACS-TR-95-34
March 1995
J. Saltz, R. Ponnusammy, S. Sharma, B. Moon, Y.-S. Hwang, M. Uysal, R. Das
Procedures are presented that are designed to help users efficiently program irregular problems (e. g. unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These procedures are also designed for use in compilers for distributed memory multiprocessors. The portable CHAOS procedures are designed to support dynamic data distributions and to automatically generate send and receive message by capturing communications patterns at runtime.

Parallelization of Fine Grained, Irregular DAGs
Published in to International Conference on Supercomputing 1995
Published: F. Chong, S. Sharma, E. Brewer, J. Saltz
December, 1994
This paper examines parallelization of computations that can be characterized as a fine-grained, irregular, directed acyclic graph of tasks. Such computations typically arise in sparse matrix applications where loops may have loop carried data dependencies. The fine grain of the computation and the irregular nature of the dependencies make such loops extremely hard to parallelize efficiently. We use runtime preprocessing to parallelize such computations. A preprocessing step analyzes the DAG and determines an optimized mapping of tasks to processors, an appropriate order of executing these tasks on each processor and a set of synchronization constraints. This information is used to distribute data across processors, and to schedule computation, communication and synchronization. We use sparse-matrix triangular solution as our example application and compare a range of runtime preprocessed techniques for its parallelization. Whereas previous studies have achieved maximum speedups of less that 4 on even simple banded matrices, we achieve a speedup of almost 35 on 128 processors of the CM-5 on an irregular matrix with 5300 rows. We find that and receive overheads dominate the time of out application. Efficient parallelization of such loops requires even more support for low latency communication.

Distributed Memory Compiler Design for Sparse Problems
Published in: IEEE Transactions on Computers vol. 44, no.6, pg: 737, June 1995
J. Wu, R. Das, J. Saltz, H. Berryman, S. Hiranandani
This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accessed that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used effectively by a compiler to generate efficient codes in these situations. The compiler accepts as input a Fortran 77 program enhanced with specifications for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. A variety of performance results on the Intel iPSC/860 are presented.

A Framework for Optimizing Parallel {I/O}
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3417 and UMIACS-TR-95-20
R. Bennett, K. Byrant, A. Sussman, R. Das, J. Saltz
January 1995
There has been a great deal of recent interest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architecture that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that the I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations, called Jovian, on three application templates are also presented.

Index Translation Schemes for Adaptive Computations on Distributed Memory Multicomputers
Published in: Proceedings of the Ninth International Parallel Processing Symposium pg: 812-819, April 1995
B. Moon, M. Uysal, J. Saltz
Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with as adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance.

Data Parallel Programming in an Adaptive Environment
Published in: Proceedings of the Ninth International Parallel Processing Symposium April 1995 pg: 827-832
G. Edjlali, G. Agrawal, A. Sussman, J. Saltz
An extended version also available as University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
Analysis of the Clustering Properties of Hilbert Space-filling Curve
Bongki Moon, H. V. Jagadish, Christos Faloutsos, Joel H. Saltz
March 1996
University of Maryland Department of Computer Science Technical Report, CS-TR-3611 and UMIACS-TR-96-20
Submitted to IEEE Transactions on Knowledge and Data Engineering, March 1996
Several schemes for linear mapping of multidimensional space have been proposed for many applications such as access methods for spatio-temporal databases, image compression and so on. In all these applications, one of the most desired properties from such linear mappings is clustering, which means the locality between objects in the multidimensional space is preserved in the linear space. It is widely believed that Hilbert space-filling curve achieves the best clustering. In this paper we provide closed-fortran formulas of the number of clusters required by a given query region of an arbitrary shape (e.g., polygons and polyhedra) for Hilbert space-filling curve. Both the asymptotic solution for a general case and the exact solution for a special case generalize the previous work, and they agree with the empirical results that the number of clusters depends on the hyper-surface area of the query region and not on its hyper-volume. We have also shown that Hilbert curve achieves better clustering than z-curve. From the practical point of view, the formulas given in this paper provide a simple measure which can be used to predict the required disk access behaviors and hence the total access time.
Index Terms: locality-preserving linear mapping, range queries, multi-attribute access methods, data clustering, Hilbert curve, space-filling curves, fractals.

Eliminating Redundant Barrier Synchronizations in Rule-based Programs
Anurag Acharya
To be published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996

A rule-based program consists of a set of if-then rules and a tuple-space. The rules are the code for the program and the tuple-space contains the data being processed by the program. Previous efforts to parallelize rule-based programs have achieved limited speedups. The main reason for these disappointing results is a high frequency of barrier synchronizations. Since little work is done between successive barrier synchronizations, the number of processors that can be effectively utilized is bounded. Even though required by language semantics, a large fraction of the barrier synchronizations are not necessary for most programs. This paper proposes a pair of simple language extensions that allow an implementation to efficiently detect and eliminate redundant barrier synchronizations. Simulation results based on a real implementation show that for a set of five benchmarks, this scheme is able to eliminate between 95.6% and 99.9% of the barrier synchronizations. This results in a multiplicative speedup of between 2.2 and 52.3 fold over and above the speedup achieved by a parallelizing compiler. For the programs studied, simulations indicate that speedups up to 115 fold relative to an optimized sequential version are possible.

Study of Scalable Declustering Algorithms for Parallel Grid Files
Bongki Moon, Anurag Acharya, Joel Saltz
April 1996
To be published in Proceedings of the 1996 Parallel Processing Symposium
University of Maryland Technical Report: CS-TR-3589 and UMIACS-TR-96-4 Lengthened version
Efficient storage and retrieval of large multidimensional datasets has been one of the key issues in large-scale scientific computations such as long running time-dependent simulations periodically generating snapshots of the state. The main challenge for efficiently handling such datasets is to minimize response time for multidimensional range queries. The grid file is one of the well known access methods for multidimensional and spatial data. We investigate effective and scalable declustering techniques for grid files with the primary goal of minimizing response time and the secondary goal of maximizing the fairness of data distribution. The main contributions of this paper is (1) the analytic and experimental evaluation of existing index-based declustering techniques and their extensions for grid files, and (2) the development of an improved proximity-based declustering algorithm called minimax which has been experimentally shown to scale well and consistently achieve better response time compared to all the other algorithms while maintaining perfect disk distribution.

Tuning the Performance of I/O-Intensive Parallel Applications
Anurag Acharya, Mustafa Uysal, Robert Bennett, Assaf Mendelson, Michael Beynon, Jeff Hollingsworth, Joel Saltz, Alan Sussman
To appear in: Fourth Annual Workshop on I/O in Parallel and Distributed Systems, May 27 1996, Philadelphia, Pennsylvania
Getting good I/O performance from parallel programs is a critical problem for many application domains. In this paper, we report our experience tuning the I/O performance of four application programs from the areas of sensor data processing and linear algebra. After tuning, three of the four applications achieve effective I/O rates of over 100MB/s, on 16 processors. The total volume of I/O required by the programs ranged from about 75MB to over 200GB. We report the lessons learned in achieving high I/O performance from these applications, including the need for code restructuring, local disks on every node and overlapping I/O with computation. We also report our experience on achieving high performance on peer-to-peer configurations. Finally, we comment on the necessity of complex I/O interfaces like collective I/O and strided requests to achieve high performance.

An Interprocedural Framework for Placement of Asynchronous I/O Operations

Gagan Agrawal, Anurag Acharya, Joel Saltz
To be published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 358-365

Overlapping memory accesses with computations is a standard technique for improving performance of modern architectures, which have deep memory hierarchies. In this paper, we present a compiler technique for overlapping accesses to secondary memory (disks) with computation. We have developed an Interprocedural Balanced Code Placement (IBCP) framework, which performs analysis on arbitrary recursive procedures and arbitrary control flow and replaces synchronous I/O operations with a balanced pair of asynchronous operations. We demonstrate how this analysis is useful for the applications which perform frequent and large accesses to secondary memory, analysis is useful for the applications which perform frequent and large accesses to secondary memory, including the applications which snapshot or checkpoint their computations or the out-of-core applications.

Experimental Evaluation of Sparse Matrix Distributions

M. Ujaldon Martinez, S. Sharma, J. Saltz, and E. Zapata
To be published in:Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 78-85

Sparse matrix problems are difficult to parallelize efficiently on distributed memory machines since non-zero elements are unevenly scattered and and are accessed via multiple levels of indirection. Irregular distributions that acheive good load balance and locality are hard to compute, have high memory overheads and also lead to further indirection in locating distributed data. This paper evaluates alternative semi-regular distribution strategies which rade off the quality of the load balance and locality for power distribution overheads and efficient lookup. The proposed techniques are compared to an irregular sparse matrix partitioner and the relative merits of each distribution method are outlined.

Runtime Coupling of Data-Parallel Programs
May 1996
Mudumbai Ranganathan, Anurag Acharya, Guy Edjlali, Alan Sussman, Joel Saltz
Published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, pages 229-236 May 25-28, 1996
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3565 and UMIACS TR-95-116
We consider the problem of efficiently computing multiple data-parallel programs at runtime. We propose an approach that establishes a mapping between data structures in different data-parallel programs and implements a user-specified consistency model. Mappings are established at runtime and can be added and deleted while the programs being coupled are in execution. Mappings, or the identity of the processors involved, do not have to be known at compile-time or even link-time. Programs can be made to interact with different granularities of interaction without requiring any re-coding. A-priori knowledge of consistency requirements aflows buffering of data as well as concurrent execution of the coupled applications. Efficient data movement is achieved by pre-computing an optimized schedule. We describe our prototype implementation and evaluate its performance using a set of synthetic benchmarks. We examine the variation of performance with variation in the consistency requirement. We demonstrate that the cost of the flexibility provided by our coupling scheme is not prohibitive when compared with a monolithic program that performs the same computation.

Runtime Techniques for Parallelizing Sparse Matrix Applications
Published in: Proceedings of the 1995 Workshop on Irregular Problems
September, 1995
Manuel Ujaldon Martinez, Shamik D. Sharma, Joel Saltz, Emilio Zapata
Sparse matrix problems are difficult to parallelize efficiently on message-passing machines, since they access data through multiple levels of indirection. Inspector/executor strategies, which are typically used to parallelize such problems impose insignificant preprocessing overheads. This paper describes the runtime support required by new compilation techniques for sparse matrices and evaluates their performance, highlighting optimizations and improvements over previous techniques.

Compiler and Runtime Support for Programming in Adaptive Parallel Environments
Published as: University of Maryland Technical Report CS-TR-3510 and UMIACS-TR-95-83

Submitted for Journal Publication
Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, Joel Saltz

For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers of HPF-like languages to generate code for an adaptive environment. We present performance results for Navier-Stokes solver on a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varies frequently, the cost of data distribution is not significant when compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network on non-dedicated workstations, which are likely to be an important resource for parallel programming in the future.

Object-Oriented Runtime Support for Complex Distributed Data Structures
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3438 and UMIACS-TR-95-35
C. Chang, A. Sussman, J. Saltz
March, 1995
Object-oriented applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex data types. Traditionally, however, applications executed on distributed memory parallel architectures in single-program multiple-data (SPMD) mode use on distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques that to such applications executing in a loosely synchronous manner. Existing runtime systems that rely solely on global indices are not always applicable to object-oriented applications, since no global names or indices are imposed upon dynamic complex data structures linked by pointers.
We describe a portable object-oriented runtime library that has been designed to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures, by providing two abstractions, mobile objects and globally addressable objects. CHAOS++ uses preprocessing techniques to analyze communication patterns, and provides data exchange routines to perform efficient transfers between processors. Results for applications taken from three distinct classes are also presented to demonstrate the wide applicability and good performance characteristics of the runtime library.

A Manual for the {CHAOS} Runtime Library
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3437 and UMIACS-TR-95-34
March 1995
J. Saltz, R. Ponnusammy, S. Sharma, B. Moon, Y.-S. Hwang, M. Uysal, R. Das
Procedures are presented that are designed to help users efficiently program irregular problems (e. g. unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These procedures are also designed for use in compilers for distributed memory multiprocessors. The portable CHAOS procedures are designed to support dynamic data distributions and to automatically generate send and receive message by capturing communications patterns at runtime.

Parallelization of Fine Grained, Irregular DAGs
Published in to International Conference on Supercomputing 1995
Published: F. Chong, S. Sharma, E. Brewer, J. Saltz
December, 1994
This paper examines parallelization of computations that can be characterized as a fine-grained, irregular, directed acyclic graph of tasks. Such computations typically arise in sparse matrix applications where loops may have loop carried data dependencies. The fine grain of the computation and the irregular nature of the dependencies make such loops extremely hard to parallelize efficiently. We use runtime preprocessing to parallelize such computations. A preprocessing step analyzes the DAG and determines an optimized mapping of tasks to processors, an appropriate order of executing these tasks on each processor and a set of synchronization constraints. This information is used to distribute data across processors, and to schedule computation, communication and synchronization. We use sparse-matrix triangular solution as our example application and compare a range of runtime preprocessed techniques for its parallelization. Whereas previous studies have achieved maximum speedups of less that 4 on even simple banded matrices, we achieve a speedup of almost 35 on 128 processors of the CM-5 on an irregular matrix with 5300 rows. We find that and receive overheads dominate the time of out application. Efficient parallelization of such loops requires even more support for low latency communication.

Distributed Memory Compiler Design for Sparse Problems
Published in: IEEE Transactions on Computers vol. 44, no.6, pg: 737, June 1995
J. Wu, R. Das, J. Saltz, H. Berryman, S. Hiranandani
This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accessed that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used effectively by a compiler to generate efficient codes in these situations. The compiler accepts as input a Fortran 77 program enhanced with specifications for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. A variety of performance results on the Intel iPSC/860 are presented.

A Framework for Optimizing Parallel {I/O}
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3417 and UMIACS-TR-95-20
R. Bennett, K. Byrant, A. Sussman, R. Das, J. Saltz
January 1995
There has been a great deal of recent interest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architecture that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that the I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations, called Jovian, on three application templates are also presented.

Index Translation Schemes for Adaptive Computations on Distributed Memory Multicomputers
Published in: Proceedings of the Ninth International Parallel Processing Symposium pg: 812-819, April 1995
B. Moon, M. Uysal, J. Saltz
Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with as adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance.

Data Parallel Programming in an Adaptive Environment
Published in: Proceedings of the Ninth International Parallel Processing Symposium April 1995 pg: 827-832
G. Edjlali, G. Agrawal, A. Sussman, J. Saltz
An extended version also available as University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu

Eliminating Redundant Barrier Synchronizations in Rule-based Programs
Anurag Acharya
To be published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996

A rule-based program consists of a set of if-then rules and a tuple-space. The rules are the code for the program and the tuple-space contains the data being processed by the program. Previous efforts to parallelize rule-based programs have achieved limited speedups. The main reason for these disappointing results is a high frequency of barrier synchronizations. Since little work is done between successive barrier synchronizations, the number of processors that can be effectively utilized is bounded. Even though required by language semantics, a large fraction of the barrier synchronizations are not necessary for most programs. This paper proposes a pair of simple language extensions that allow an implementation to efficiently detect and eliminate redundant barrier synchronizations. Simulation results based on a real implementation show that for a set of five benchmarks, this scheme is able to eliminate between 95.6% and 99.9% of the barrier synchronizations. This results in a multiplicative speedup of between 2.2 and 52.3 fold over and above the speedup achieved by a parallelizing compiler. For the programs studied, simulations indicate that speedups up to 115 fold relative to an optimized sequential version are possible.

Study of Scalable Declustering Algorithms for Parallel Grid Files
Bongki Moon, Anurag Acharya, Joel Saltz
April 1996
To be published in Proceedings of the 1996 Parallel Processing Symposium
University of Maryland Technical Report: CS-TR-3589 and UMIACS-TR-96-4 Lengthened version
Efficient storage and retrieval of large multidimensional datasets has been one of the key issues in large-scale scientific computations such as long running time-dependent simulations periodically generating snapshots of the state. The main challenge for efficiently handling such datasets is to minimize response time for multidimensional range queries. The grid file is one of the well known access methods for multidimensional and spatial data. We investigate effective and scalable declustering techniques for grid files with the primary goal of minimizing response time and the secondary goal of maximizing the fairness of data distribution. The main contributions of this paper is (1) the analytic and experimental evaluation of existing index-based declustering techniques and their extensions for grid files, and (2) the development of an improved proximity-based declustering algorithm called minimax which has been experimentally shown to scale well and consistently achieve better response time compared to all the other algorithms while maintaining perfect disk distribution.

Tuning the Performance of I/O-Intensive Parallel Applications
Anurag Acharya, Mustafa Uysal, Robert Bennett, Assaf Mendelson, Michael Beynon, Jeff Hollingsworth, Joel Saltz, Alan Sussman
To appear in: Fourth Annual Workshop on I/O in Parallel and Distributed Systems, May 27 1996, Philadelphia, Pennsylvania
Getting good I/O performance from parallel programs is a critical problem for many application domains. In this paper, we report our experience tuning the I/O performance of four application programs from the areas of sensor data processing and linear algebra. After tuning, three of the four applications achieve effective I/O rates of over 100MB/s, on 16 processors. The total volume of I/O required by the programs ranged from about 75MB to over 200GB. We report the lessons learned in achieving high I/O performance from these applications, including the need for code restructuring, local disks on every node and overlapping I/O with computation. We also report our experience on achieving high performance on peer-to-peer configurations. Finally, we comment on the necessity of complex I/O interfaces like collective I/O and strided requests to achieve high performance.

An Interprocedural Framework for Placement of Asynchronous I/O Operations

Gagan Agrawal, Anurag Acharya, Joel Saltz
To be published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 358-365

Overlapping memory accesses with computations is a standard technique for improving performance of modern architectures, which have deep memory hierarchies. In this paper, we present a compiler technique for overlapping accesses to secondary memory (disks) with computation. We have developed an Interprocedural Balanced Code Placement (IBCP) framework, which performs analysis on arbitrary recursive procedures and arbitrary control flow and replaces synchronous I/O operations with a balanced pair of asynchronous operations. We demonstrate how this analysis is useful for the applications which perform frequent and large accesses to secondary memory, analysis is useful for the applications which perform frequent and large accesses to secondary memory, including the applications which snapshot or checkpoint their computations or the out-of-core applications.

Experimental Evaluation of Sparse Matrix Distributions

M. Ujaldon Martinez, S. Sharma, J. Saltz, and E. Zapata
To be published in:Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 78-85

Sparse matrix problems are difficult to parallelize efficiently on distributed memory machines since non-zero elements are unevenly scattered and and are accessed via multiple levels of indirection. Irregular distributions that acheive good load balance and locality are hard to compute, have high memory overheads and also lead to further indirection in locating distributed data. This paper evaluates alternative semi-regular distribution strategies which rade off the quality of the load balance and locality for power distribution overheads and efficient lookup. The proposed techniques are compared to an irregular sparse matrix partitioner and the relative merits of each distribution method are outlined.

Runtime Coupling of Data-Parallel Programs
May 1996
Mudumbai Ranganathan, Anurag Acharya, Guy Edjlali, Alan Sussman, Joel Saltz
Published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, pages 229-236 May 25-28, 1996
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3565 and UMIACS TR-95-116
We consider the problem of efficiently computing multiple data-parallel programs at runtime. We propose an approach that establishes a mapping between data structures in different data-parallel programs and implements a user-specified consistency model. Mappings are established at runtime and can be added and deleted while the programs being coupled are in execution. Mappings, or the identity of the processors involved, do not have to be known at compile-time or even link-time. Programs can be made to interact with different granularities of interaction without requiring any re-coding. A-priori knowledge of consistency requirements aflows buffering of data as well as concurrent execution of the coupled applications. Efficient data movement is achieved by pre-computing an optimized schedule. We describe our prototype implementation and evaluate its performance using a set of synthetic benchmarks. We examine the variation of performance with variation in the consistency requirement. We demonstrate that the cost of the flexibility provided by our coupling scheme is not prohibitive when compared with a monolithic program that performs the same computation.

Runtime Techniques for Parallelizing Sparse Matrix Applications
Published in: Proceedings of the 1995 Workshop on Irregular Problems
September, 1995
Manuel Ujaldon Martinez, Shamik D. Sharma, Joel Saltz, Emilio Zapata
Sparse matrix problems are difficult to parallelize efficiently on message-passing machines, since they access data through multiple levels of indirection. Inspector/executor strategies, which are typically used to parallelize such problems impose insignificant preprocessing overheads. This paper describes the runtime support required by new compilation techniques for sparse matrices and evaluates their performance, highlighting optimizations and improvements over previous techniques.

Compiler and Runtime Support for Programming in Adaptive Parallel Environments
Published as: University of Maryland Technical Report CS-TR-3510 and UMIACS-TR-95-83

Submitted for Journal Publication
Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, Joel Saltz

For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers of HPF-like languages to generate code for an adaptive environment. We present performance results for Navier-Stokes solver on a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varies frequently, the cost of data distribution is not significant when compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network on non-dedicated workstations, which are likely to be an important resource for parallel programming in the future.

Object-Oriented Runtime Support for Complex Distributed Data Structures
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3438 and UMIACS-TR-95-35
C. Chang, A. Sussman, J. Saltz
March, 1995
Object-oriented applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex data types. Traditionally, however, applications executed on distributed memory parallel architectures in single-program multiple-data (SPMD) mode use on distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques that to such applications executing in a loosely synchronous manner. Existing runtime systems that rely solely on global indices are not always applicable to object-oriented applications, since no global names or indices are imposed upon dynamic complex data structures linked by pointers.
We describe a portable object-oriented runtime library that has been designed to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures, by providing two abstractions, mobile objects and globally addressable objects. CHAOS++ uses preprocessing techniques to analyze communication patterns, and provides data exchange routines to perform efficient transfers between processors. Results for applications taken from three distinct classes are also presented to demonstrate the wide applicability and good performance characteristics of the runtime library.

A Manual for the {CHAOS} Runtime Library
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3437 and UMIACS-TR-95-34
March 1995
J. Saltz, R. Ponnusammy, S. Sharma, B. Moon, Y.-S. Hwang, M. Uysal, R. Das
Procedures are presented that are designed to help users efficiently program irregular problems (e. g. unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These procedures are also designed for use in compilers for distributed memory multiprocessors. The portable CHAOS procedures are designed to support dynamic data distributions and to automatically generate send and receive message by capturing communications patterns at runtime.

Parallelization of Fine Grained, Irregular DAGs
Published in to International Conference on Supercomputing 1995
Published: F. Chong, S. Sharma, E. Brewer, J. Saltz
December, 1994
This paper examines parallelization of computations that can be characterized as a fine-grained, irregular, directed acyclic graph of tasks. Such computations typically arise in sparse matrix applications where loops may have loop carried data dependencies. The fine grain of the computation and the irregular nature of the dependencies make such loops extremely hard to parallelize efficiently. We use runtime preprocessing to parallelize such computations. A preprocessing step analyzes the DAG and determines an optimized mapping of tasks to processors, an appropriate order of executing these tasks on each processor and a set of synchronization constraints. This information is used to distribute data across processors, and to schedule computation, communication and synchronization. We use sparse-matrix triangular solution as our example application and compare a range of runtime preprocessed techniques for its parallelization. Whereas previous studies have achieved maximum speedups of less that 4 on even simple banded matrices, we achieve a speedup of almost 35 on 128 processors of the CM-5 on an irregular matrix with 5300 rows. We find that and receive overheads dominate the time of out application. Efficient parallelization of such loops requires even more support for low latency communication.

Distributed Memory Compiler Design for Sparse Problems
Published in: IEEE Transactions on Computers vol. 44, no.6, pg: 737, June 1995
J. Wu, R. Das, J. Saltz, H. Berryman, S. Hiranandani
This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accessed that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used effectively by a compiler to generate efficient codes in these situations. The compiler accepts as input a Fortran 77 program enhanced with specifications for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. A variety of performance results on the Intel iPSC/860 are presented.

A Framework for Optimizing Parallel {I/O}
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3417 and UMIACS-TR-95-20
R. Bennett, K. Byrant, A. Sussman, R. Das, J. Saltz
January 1995
There has been a great deal of recent interest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architecture that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that the I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations, called Jovian, on three application templates are also presented.

Index Translation Schemes for Adaptive Computations on Distributed Memory Multicomputers
Published in: Proceedings of the Ninth International Parallel Processing Symposium pg: 812-819, April 1995
B. Moon, M. Uysal, J. Saltz
Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with as adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance.

Data Parallel Programming in an Adaptive Environment
Published in: Proceedings of the Ninth International Parallel Processing Symposium April 1995 pg: 827-832
G. Edjlali, G. Agrawal, A. Sussman, J. Saltz
An extended version also available as University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
To be published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996 A rule-based program consists of a set of if-then rules and a tuple-space. The rules are the code for the program and the tuple-space contains the data being processed by the program. Previous efforts to parallelize rule-based programs have achieved limited speedups. The main reason for these disappointing results is a high frequency of barrier synchronizations. Since little work is done between successive barrier synchronizations, the number of processors that can be effectively utilized is bounded. Even though required by language semantics, a large fraction of the barrier synchronizations are not necessary for most programs. This paper proposes a pair of simple language extensions that allow an implementation to efficiently detect and eliminate redundant barrier synchronizations. Simulation results based on a real implementation show that for a set of five benchmarks, this scheme is able to eliminate between 95.6% and 99.9% of the barrier synchronizations. This results in a multiplicative speedup of between 2.2 and 52.3 fold over and above the speedup achieved by a parallelizing compiler. For the programs studied, simulations indicate that speedups up to 115 fold relative to an optimized sequential version are possible.
Study of Scalable Declustering Algorithms for Parallel Grid Files
Bongki Moon, Anurag Acharya, Joel Saltz
April 1996
To be published in Proceedings of the 1996 Parallel Processing Symposium
University of Maryland Technical Report: CS-TR-3589 and UMIACS-TR-96-4 Lengthened version
Efficient storage and retrieval of large multidimensional datasets has been one of the key issues in large-scale scientific computations such as long running time-dependent simulations periodically generating snapshots of the state. The main challenge for efficiently handling such datasets is to minimize response time for multidimensional range queries. The grid file is one of the well known access methods for multidimensional and spatial data. We investigate effective and scalable declustering techniques for grid files with the primary goal of minimizing response time and the secondary goal of maximizing the fairness of data distribution. The main contributions of this paper is (1) the analytic and experimental evaluation of existing index-based declustering techniques and their extensions for grid files, and (2) the development of an improved proximity-based declustering algorithm called minimax which has been experimentally shown to scale well and consistently achieve better response time compared to all the other algorithms while maintaining perfect disk distribution.

Tuning the Performance of I/O-Intensive Parallel Applications
Anurag Acharya, Mustafa Uysal, Robert Bennett, Assaf Mendelson, Michael Beynon, Jeff Hollingsworth, Joel Saltz, Alan Sussman
To appear in: Fourth Annual Workshop on I/O in Parallel and Distributed Systems, May 27 1996, Philadelphia, Pennsylvania
Getting good I/O performance from parallel programs is a critical problem for many application domains. In this paper, we report our experience tuning the I/O performance of four application programs from the areas of sensor data processing and linear algebra. After tuning, three of the four applications achieve effective I/O rates of over 100MB/s, on 16 processors. The total volume of I/O required by the programs ranged from about 75MB to over 200GB. We report the lessons learned in achieving high I/O performance from these applications, including the need for code restructuring, local disks on every node and overlapping I/O with computation. We also report our experience on achieving high performance on peer-to-peer configurations. Finally, we comment on the necessity of complex I/O interfaces like collective I/O and strided requests to achieve high performance.

An Interprocedural Framework for Placement of Asynchronous I/O Operations

Gagan Agrawal, Anurag Acharya, Joel Saltz
To be published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 358-365

Overlapping memory accesses with computations is a standard technique for improving performance of modern architectures, which have deep memory hierarchies. In this paper, we present a compiler technique for overlapping accesses to secondary memory (disks) with computation. We have developed an Interprocedural Balanced Code Placement (IBCP) framework, which performs analysis on arbitrary recursive procedures and arbitrary control flow and replaces synchronous I/O operations with a balanced pair of asynchronous operations. We demonstrate how this analysis is useful for the applications which perform frequent and large accesses to secondary memory, analysis is useful for the applications which perform frequent and large accesses to secondary memory, including the applications which snapshot or checkpoint their computations or the out-of-core applications.

Experimental Evaluation of Sparse Matrix Distributions

M. Ujaldon Martinez, S. Sharma, J. Saltz, and E. Zapata
To be published in:Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 78-85

Sparse matrix problems are difficult to parallelize efficiently on distributed memory machines since non-zero elements are unevenly scattered and and are accessed via multiple levels of indirection. Irregular distributions that acheive good load balance and locality are hard to compute, have high memory overheads and also lead to further indirection in locating distributed data. This paper evaluates alternative semi-regular distribution strategies which rade off the quality of the load balance and locality for power distribution overheads and efficient lookup. The proposed techniques are compared to an irregular sparse matrix partitioner and the relative merits of each distribution method are outlined.

Runtime Coupling of Data-Parallel Programs
May 1996
Mudumbai Ranganathan, Anurag Acharya, Guy Edjlali, Alan Sussman, Joel Saltz
Published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, pages 229-236 May 25-28, 1996
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3565 and UMIACS TR-95-116
We consider the problem of efficiently computing multiple data-parallel programs at runtime. We propose an approach that establishes a mapping between data structures in different data-parallel programs and implements a user-specified consistency model. Mappings are established at runtime and can be added and deleted while the programs being coupled are in execution. Mappings, or the identity of the processors involved, do not have to be known at compile-time or even link-time. Programs can be made to interact with different granularities of interaction without requiring any re-coding. A-priori knowledge of consistency requirements aflows buffering of data as well as concurrent execution of the coupled applications. Efficient data movement is achieved by pre-computing an optimized schedule. We describe our prototype implementation and evaluate its performance using a set of synthetic benchmarks. We examine the variation of performance with variation in the consistency requirement. We demonstrate that the cost of the flexibility provided by our coupling scheme is not prohibitive when compared with a monolithic program that performs the same computation.

Runtime Techniques for Parallelizing Sparse Matrix Applications
Published in: Proceedings of the 1995 Workshop on Irregular Problems
September, 1995
Manuel Ujaldon Martinez, Shamik D. Sharma, Joel Saltz, Emilio Zapata
Sparse matrix problems are difficult to parallelize efficiently on message-passing machines, since they access data through multiple levels of indirection. Inspector/executor strategies, which are typically used to parallelize such problems impose insignificant preprocessing overheads. This paper describes the runtime support required by new compilation techniques for sparse matrices and evaluates their performance, highlighting optimizations and improvements over previous techniques.

Compiler and Runtime Support for Programming in Adaptive Parallel Environments
Published as: University of Maryland Technical Report CS-TR-3510 and UMIACS-TR-95-83

Submitted for Journal Publication
Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, Joel Saltz

For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers of HPF-like languages to generate code for an adaptive environment. We present performance results for Navier-Stokes solver on a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varies frequently, the cost of data distribution is not significant when compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network on non-dedicated workstations, which are likely to be an important resource for parallel programming in the future.

Object-Oriented Runtime Support for Complex Distributed Data Structures
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3438 and UMIACS-TR-95-35
C. Chang, A. Sussman, J. Saltz
March, 1995
Object-oriented applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex data types. Traditionally, however, applications executed on distributed memory parallel architectures in single-program multiple-data (SPMD) mode use on distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques that to such applications executing in a loosely synchronous manner. Existing runtime systems that rely solely on global indices are not always applicable to object-oriented applications, since no global names or indices are imposed upon dynamic complex data structures linked by pointers.
We describe a portable object-oriented runtime library that has been designed to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures, by providing two abstractions, mobile objects and globally addressable objects. CHAOS++ uses preprocessing techniques to analyze communication patterns, and provides data exchange routines to perform efficient transfers between processors. Results for applications taken from three distinct classes are also presented to demonstrate the wide applicability and good performance characteristics of the runtime library.

A Manual for the {CHAOS} Runtime Library
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3437 and UMIACS-TR-95-34
March 1995
J. Saltz, R. Ponnusammy, S. Sharma, B. Moon, Y.-S. Hwang, M. Uysal, R. Das
Procedures are presented that are designed to help users efficiently program irregular problems (e. g. unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These procedures are also designed for use in compilers for distributed memory multiprocessors. The portable CHAOS procedures are designed to support dynamic data distributions and to automatically generate send and receive message by capturing communications patterns at runtime.

Parallelization of Fine Grained, Irregular DAGs
Published in to International Conference on Supercomputing 1995
Published: F. Chong, S. Sharma, E. Brewer, J. Saltz
December, 1994
This paper examines parallelization of computations that can be characterized as a fine-grained, irregular, directed acyclic graph of tasks. Such computations typically arise in sparse matrix applications where loops may have loop carried data dependencies. The fine grain of the computation and the irregular nature of the dependencies make such loops extremely hard to parallelize efficiently. We use runtime preprocessing to parallelize such computations. A preprocessing step analyzes the DAG and determines an optimized mapping of tasks to processors, an appropriate order of executing these tasks on each processor and a set of synchronization constraints. This information is used to distribute data across processors, and to schedule computation, communication and synchronization. We use sparse-matrix triangular solution as our example application and compare a range of runtime preprocessed techniques for its parallelization. Whereas previous studies have achieved maximum speedups of less that 4 on even simple banded matrices, we achieve a speedup of almost 35 on 128 processors of the CM-5 on an irregular matrix with 5300 rows. We find that and receive overheads dominate the time of out application. Efficient parallelization of such loops requires even more support for low latency communication.

Distributed Memory Compiler Design for Sparse Problems
Published in: IEEE Transactions on Computers vol. 44, no.6, pg: 737, June 1995
J. Wu, R. Das, J. Saltz, H. Berryman, S. Hiranandani
This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accessed that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used effectively by a compiler to generate efficient codes in these situations. The compiler accepts as input a Fortran 77 program enhanced with specifications for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. A variety of performance results on the Intel iPSC/860 are presented.

A Framework for Optimizing Parallel {I/O}
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3417 and UMIACS-TR-95-20
R. Bennett, K. Byrant, A. Sussman, R. Das, J. Saltz
January 1995
There has been a great deal of recent interest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architecture that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that the I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations, called Jovian, on three application templates are also presented.

Index Translation Schemes for Adaptive Computations on Distributed Memory Multicomputers
Published in: Proceedings of the Ninth International Parallel Processing Symposium pg: 812-819, April 1995
B. Moon, M. Uysal, J. Saltz
Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with as adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance.

Data Parallel Programming in an Adaptive Environment
Published in: Proceedings of the Ninth International Parallel Processing Symposium April 1995 pg: 827-832
G. Edjlali, G. Agrawal, A. Sussman, J. Saltz
An extended version also available as University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
Tuning the Performance of I/O-Intensive Parallel Applications
Anurag Acharya, Mustafa Uysal, Robert Bennett, Assaf Mendelson, Michael Beynon, Jeff Hollingsworth, Joel Saltz, Alan Sussman
To appear in: Fourth Annual Workshop on I/O in Parallel and Distributed Systems, May 27 1996, Philadelphia, Pennsylvania
Getting good I/O performance from parallel programs is a critical problem for many application domains. In this paper, we report our experience tuning the I/O performance of four application programs from the areas of sensor data processing and linear algebra. After tuning, three of the four applications achieve effective I/O rates of over 100MB/s, on 16 processors. The total volume of I/O required by the programs ranged from about 75MB to over 200GB. We report the lessons learned in achieving high I/O performance from these applications, including the need for code restructuring, local disks on every node and overlapping I/O with computation. We also report our experience on achieving high performance on peer-to-peer configurations. Finally, we comment on the necessity of complex I/O interfaces like collective I/O and strided requests to achieve high performance.

An Interprocedural Framework for Placement of Asynchronous I/O Operations

Gagan Agrawal, Anurag Acharya, Joel Saltz
To be published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 358-365

Overlapping memory accesses with computations is a standard technique for improving performance of modern architectures, which have deep memory hierarchies. In this paper, we present a compiler technique for overlapping accesses to secondary memory (disks) with computation. We have developed an Interprocedural Balanced Code Placement (IBCP) framework, which performs analysis on arbitrary recursive procedures and arbitrary control flow and replaces synchronous I/O operations with a balanced pair of asynchronous operations. We demonstrate how this analysis is useful for the applications which perform frequent and large accesses to secondary memory, analysis is useful for the applications which perform frequent and large accesses to secondary memory, including the applications which snapshot or checkpoint their computations or the out-of-core applications.

Experimental Evaluation of Sparse Matrix Distributions

M. Ujaldon Martinez, S. Sharma, J. Saltz, and E. Zapata
To be published in:Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 78-85

Sparse matrix problems are difficult to parallelize efficiently on distributed memory machines since non-zero elements are unevenly scattered and and are accessed via multiple levels of indirection. Irregular distributions that acheive good load balance and locality are hard to compute, have high memory overheads and also lead to further indirection in locating distributed data. This paper evaluates alternative semi-regular distribution strategies which rade off the quality of the load balance and locality for power distribution overheads and efficient lookup. The proposed techniques are compared to an irregular sparse matrix partitioner and the relative merits of each distribution method are outlined.

Runtime Coupling of Data-Parallel Programs
May 1996
Mudumbai Ranganathan, Anurag Acharya, Guy Edjlali, Alan Sussman, Joel Saltz
Published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, pages 229-236 May 25-28, 1996
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3565 and UMIACS TR-95-116
We consider the problem of efficiently computing multiple data-parallel programs at runtime. We propose an approach that establishes a mapping between data structures in different data-parallel programs and implements a user-specified consistency model. Mappings are established at runtime and can be added and deleted while the programs being coupled are in execution. Mappings, or the identity of the processors involved, do not have to be known at compile-time or even link-time. Programs can be made to interact with different granularities of interaction without requiring any re-coding. A-priori knowledge of consistency requirements aflows buffering of data as well as concurrent execution of the coupled applications. Efficient data movement is achieved by pre-computing an optimized schedule. We describe our prototype implementation and evaluate its performance using a set of synthetic benchmarks. We examine the variation of performance with variation in the consistency requirement. We demonstrate that the cost of the flexibility provided by our coupling scheme is not prohibitive when compared with a monolithic program that performs the same computation.

Runtime Techniques for Parallelizing Sparse Matrix Applications
Published in: Proceedings of the 1995 Workshop on Irregular Problems
September, 1995
Manuel Ujaldon Martinez, Shamik D. Sharma, Joel Saltz, Emilio Zapata
Sparse matrix problems are difficult to parallelize efficiently on message-passing machines, since they access data through multiple levels of indirection. Inspector/executor strategies, which are typically used to parallelize such problems impose insignificant preprocessing overheads. This paper describes the runtime support required by new compilation techniques for sparse matrices and evaluates their performance, highlighting optimizations and improvements over previous techniques.

Compiler and Runtime Support for Programming in Adaptive Parallel Environments
Published as: University of Maryland Technical Report CS-TR-3510 and UMIACS-TR-95-83

Submitted for Journal Publication
Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, Joel Saltz

For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers of HPF-like languages to generate code for an adaptive environment. We present performance results for Navier-Stokes solver on a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varies frequently, the cost of data distribution is not significant when compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network on non-dedicated workstations, which are likely to be an important resource for parallel programming in the future.

Object-Oriented Runtime Support for Complex Distributed Data Structures
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3438 and UMIACS-TR-95-35
C. Chang, A. Sussman, J. Saltz
March, 1995
Object-oriented applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex data types. Traditionally, however, applications executed on distributed memory parallel architectures in single-program multiple-data (SPMD) mode use on distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques that to such applications executing in a loosely synchronous manner. Existing runtime systems that rely solely on global indices are not always applicable to object-oriented applications, since no global names or indices are imposed upon dynamic complex data structures linked by pointers.
We describe a portable object-oriented runtime library that has been designed to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures, by providing two abstractions, mobile objects and globally addressable objects. CHAOS++ uses preprocessing techniques to analyze communication patterns, and provides data exchange routines to perform efficient transfers between processors. Results for applications taken from three distinct classes are also presented to demonstrate the wide applicability and good performance characteristics of the runtime library.

A Manual for the {CHAOS} Runtime Library
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3437 and UMIACS-TR-95-34
March 1995
J. Saltz, R. Ponnusammy, S. Sharma, B. Moon, Y.-S. Hwang, M. Uysal, R. Das
Procedures are presented that are designed to help users efficiently program irregular problems (e. g. unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These procedures are also designed for use in compilers for distributed memory multiprocessors. The portable CHAOS procedures are designed to support dynamic data distributions and to automatically generate send and receive message by capturing communications patterns at runtime.

Parallelization of Fine Grained, Irregular DAGs
Published in to International Conference on Supercomputing 1995
Published: F. Chong, S. Sharma, E. Brewer, J. Saltz
December, 1994
This paper examines parallelization of computations that can be characterized as a fine-grained, irregular, directed acyclic graph of tasks. Such computations typically arise in sparse matrix applications where loops may have loop carried data dependencies. The fine grain of the computation and the irregular nature of the dependencies make such loops extremely hard to parallelize efficiently. We use runtime preprocessing to parallelize such computations. A preprocessing step analyzes the DAG and determines an optimized mapping of tasks to processors, an appropriate order of executing these tasks on each processor and a set of synchronization constraints. This information is used to distribute data across processors, and to schedule computation, communication and synchronization. We use sparse-matrix triangular solution as our example application and compare a range of runtime preprocessed techniques for its parallelization. Whereas previous studies have achieved maximum speedups of less that 4 on even simple banded matrices, we achieve a speedup of almost 35 on 128 processors of the CM-5 on an irregular matrix with 5300 rows. We find that and receive overheads dominate the time of out application. Efficient parallelization of such loops requires even more support for low latency communication.

Distributed Memory Compiler Design for Sparse Problems
Published in: IEEE Transactions on Computers vol. 44, no.6, pg: 737, June 1995
J. Wu, R. Das, J. Saltz, H. Berryman, S. Hiranandani
This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accessed that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used effectively by a compiler to generate efficient codes in these situations. The compiler accepts as input a Fortran 77 program enhanced with specifications for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. A variety of performance results on the Intel iPSC/860 are presented.

A Framework for Optimizing Parallel {I/O}
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3417 and UMIACS-TR-95-20
R. Bennett, K. Byrant, A. Sussman, R. Das, J. Saltz
January 1995
There has been a great deal of recent interest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architecture that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that the I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations, called Jovian, on three application templates are also presented.

Index Translation Schemes for Adaptive Computations on Distributed Memory Multicomputers
Published in: Proceedings of the Ninth International Parallel Processing Symposium pg: 812-819, April 1995
B. Moon, M. Uysal, J. Saltz
Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with as adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance.

Data Parallel Programming in an Adaptive Environment
Published in: Proceedings of the Ninth International Parallel Processing Symposium April 1995 pg: 827-832
G. Edjlali, G. Agrawal, A. Sussman, J. Saltz
An extended version also available as University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
An Interprocedural Framework for Placement of Asynchronous I/O Operations

Gagan Agrawal, Anurag Acharya, Joel Saltz
To be published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 358-365

Overlapping memory accesses with computations is a standard technique for improving performance of modern architectures, which have deep memory hierarchies. In this paper, we present a compiler technique for overlapping accesses to secondary memory (disks) with computation. We have developed an Interprocedural Balanced Code Placement (IBCP) framework, which performs analysis on arbitrary recursive procedures and arbitrary control flow and replaces synchronous I/O operations with a balanced pair of asynchronous operations. We demonstrate how this analysis is useful for the applications which perform frequent and large accesses to secondary memory, analysis is useful for the applications which perform frequent and large accesses to secondary memory, including the applications which snapshot or checkpoint their computations or the out-of-core applications.

Experimental Evaluation of Sparse Matrix Distributions

M. Ujaldon Martinez, S. Sharma, J. Saltz, and E. Zapata
To be published in:Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 78-85

Sparse matrix problems are difficult to parallelize efficiently on distributed memory machines since non-zero elements are unevenly scattered and and are accessed via multiple levels of indirection. Irregular distributions that acheive good load balance and locality are hard to compute, have high memory overheads and also lead to further indirection in locating distributed data. This paper evaluates alternative semi-regular distribution strategies which rade off the quality of the load balance and locality for power distribution overheads and efficient lookup. The proposed techniques are compared to an irregular sparse matrix partitioner and the relative merits of each distribution method are outlined.

Runtime Coupling of Data-Parallel Programs
May 1996
Mudumbai Ranganathan, Anurag Acharya, Guy Edjlali, Alan Sussman, Joel Saltz
Published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, pages 229-236 May 25-28, 1996
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3565 and UMIACS TR-95-116
We consider the problem of efficiently computing multiple data-parallel programs at runtime. We propose an approach that establishes a mapping between data structures in different data-parallel programs and implements a user-specified consistency model. Mappings are established at runtime and can be added and deleted while the programs being coupled are in execution. Mappings, or the identity of the processors involved, do not have to be known at compile-time or even link-time. Programs can be made to interact with different granularities of interaction without requiring any re-coding. A-priori knowledge of consistency requirements aflows buffering of data as well as concurrent execution of the coupled applications. Efficient data movement is achieved by pre-computing an optimized schedule. We describe our prototype implementation and evaluate its performance using a set of synthetic benchmarks. We examine the variation of performance with variation in the consistency requirement. We demonstrate that the cost of the flexibility provided by our coupling scheme is not prohibitive when compared with a monolithic program that performs the same computation.

Runtime Techniques for Parallelizing Sparse Matrix Applications
Published in: Proceedings of the 1995 Workshop on Irregular Problems
September, 1995
Manuel Ujaldon Martinez, Shamik D. Sharma, Joel Saltz, Emilio Zapata
Sparse matrix problems are difficult to parallelize efficiently on message-passing machines, since they access data through multiple levels of indirection. Inspector/executor strategies, which are typically used to parallelize such problems impose insignificant preprocessing overheads. This paper describes the runtime support required by new compilation techniques for sparse matrices and evaluates their performance, highlighting optimizations and improvements over previous techniques.

Compiler and Runtime Support for Programming in Adaptive Parallel Environments
Published as: University of Maryland Technical Report CS-TR-3510 and UMIACS-TR-95-83

Submitted for Journal Publication
Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, Joel Saltz

For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers of HPF-like languages to generate code for an adaptive environment. We present performance results for Navier-Stokes solver on a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varies frequently, the cost of data distribution is not significant when compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network on non-dedicated workstations, which are likely to be an important resource for parallel programming in the future.

Object-Oriented Runtime Support for Complex Distributed Data Structures
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3438 and UMIACS-TR-95-35
C. Chang, A. Sussman, J. Saltz
March, 1995
Object-oriented applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex data types. Traditionally, however, applications executed on distributed memory parallel architectures in single-program multiple-data (SPMD) mode use on distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques that to such applications executing in a loosely synchronous manner. Existing runtime systems that rely solely on global indices are not always applicable to object-oriented applications, since no global names or indices are imposed upon dynamic complex data structures linked by pointers.
We describe a portable object-oriented runtime library that has been designed to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures, by providing two abstractions, mobile objects and globally addressable objects. CHAOS++ uses preprocessing techniques to analyze communication patterns, and provides data exchange routines to perform efficient transfers between processors. Results for applications taken from three distinct classes are also presented to demonstrate the wide applicability and good performance characteristics of the runtime library.

A Manual for the {CHAOS} Runtime Library
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3437 and UMIACS-TR-95-34
March 1995
J. Saltz, R. Ponnusammy, S. Sharma, B. Moon, Y.-S. Hwang, M. Uysal, R. Das
Procedures are presented that are designed to help users efficiently program irregular problems (e. g. unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These procedures are also designed for use in compilers for distributed memory multiprocessors. The portable CHAOS procedures are designed to support dynamic data distributions and to automatically generate send and receive message by capturing communications patterns at runtime.

Parallelization of Fine Grained, Irregular DAGs
Published in to International Conference on Supercomputing 1995
Published: F. Chong, S. Sharma, E. Brewer, J. Saltz
December, 1994
This paper examines parallelization of computations that can be characterized as a fine-grained, irregular, directed acyclic graph of tasks. Such computations typically arise in sparse matrix applications where loops may have loop carried data dependencies. The fine grain of the computation and the irregular nature of the dependencies make such loops extremely hard to parallelize efficiently. We use runtime preprocessing to parallelize such computations. A preprocessing step analyzes the DAG and determines an optimized mapping of tasks to processors, an appropriate order of executing these tasks on each processor and a set of synchronization constraints. This information is used to distribute data across processors, and to schedule computation, communication and synchronization. We use sparse-matrix triangular solution as our example application and compare a range of runtime preprocessed techniques for its parallelization. Whereas previous studies have achieved maximum speedups of less that 4 on even simple banded matrices, we achieve a speedup of almost 35 on 128 processors of the CM-5 on an irregular matrix with 5300 rows. We find that and receive overheads dominate the time of out application. Efficient parallelization of such loops requires even more support for low latency communication.

Distributed Memory Compiler Design for Sparse Problems
Published in: IEEE Transactions on Computers vol. 44, no.6, pg: 737, June 1995
J. Wu, R. Das, J. Saltz, H. Berryman, S. Hiranandani
This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accessed that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used effectively by a compiler to generate efficient codes in these situations. The compiler accepts as input a Fortran 77 program enhanced with specifications for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. A variety of performance results on the Intel iPSC/860 are presented.

A Framework for Optimizing Parallel {I/O}
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3417 and UMIACS-TR-95-20
R. Bennett, K. Byrant, A. Sussman, R. Das, J. Saltz
January 1995
There has been a great deal of recent interest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architecture that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that the I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations, called Jovian, on three application templates are also presented.

Index Translation Schemes for Adaptive Computations on Distributed Memory Multicomputers
Published in: Proceedings of the Ninth International Parallel Processing Symposium pg: 812-819, April 1995
B. Moon, M. Uysal, J. Saltz
Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with as adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance.

Data Parallel Programming in an Adaptive Environment
Published in: Proceedings of the Ninth International Parallel Processing Symposium April 1995 pg: 827-832
G. Edjlali, G. Agrawal, A. Sussman, J. Saltz
An extended version also available as University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
Gagan Agrawal, Anurag Acharya, Joel Saltz To be published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 358-365 Overlapping memory accesses with computations is a standard technique for improving performance of modern architectures, which have deep memory hierarchies. In this paper, we present a compiler technique for overlapping accesses to secondary memory (disks) with computation. We have developed an Interprocedural Balanced Code Placement (IBCP) framework, which performs analysis on arbitrary recursive procedures and arbitrary control flow and replaces synchronous I/O operations with a balanced pair of asynchronous operations. We demonstrate how this analysis is useful for the applications which perform frequent and large accesses to secondary memory, analysis is useful for the applications which perform frequent and large accesses to secondary memory, including the applications which snapshot or checkpoint their computations or the out-of-core applications.
Experimental Evaluation of Sparse Matrix Distributions

M. Ujaldon Martinez, S. Sharma, J. Saltz, and E. Zapata
To be published in:Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 78-85

Sparse matrix problems are difficult to parallelize efficiently on distributed memory machines since non-zero elements are unevenly scattered and and are accessed via multiple levels of indirection. Irregular distributions that acheive good load balance and locality are hard to compute, have high memory overheads and also lead to further indirection in locating distributed data. This paper evaluates alternative semi-regular distribution strategies which rade off the quality of the load balance and locality for power distribution overheads and efficient lookup. The proposed techniques are compared to an irregular sparse matrix partitioner and the relative merits of each distribution method are outlined.

Runtime Coupling of Data-Parallel Programs
May 1996
Mudumbai Ranganathan, Anurag Acharya, Guy Edjlali, Alan Sussman, Joel Saltz
Published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, pages 229-236 May 25-28, 1996
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3565 and UMIACS TR-95-116
We consider the problem of efficiently computing multiple data-parallel programs at runtime. We propose an approach that establishes a mapping between data structures in different data-parallel programs and implements a user-specified consistency model. Mappings are established at runtime and can be added and deleted while the programs being coupled are in execution. Mappings, or the identity of the processors involved, do not have to be known at compile-time or even link-time. Programs can be made to interact with different granularities of interaction without requiring any re-coding. A-priori knowledge of consistency requirements aflows buffering of data as well as concurrent execution of the coupled applications. Efficient data movement is achieved by pre-computing an optimized schedule. We describe our prototype implementation and evaluate its performance using a set of synthetic benchmarks. We examine the variation of performance with variation in the consistency requirement. We demonstrate that the cost of the flexibility provided by our coupling scheme is not prohibitive when compared with a monolithic program that performs the same computation.

Runtime Techniques for Parallelizing Sparse Matrix Applications
Published in: Proceedings of the 1995 Workshop on Irregular Problems
September, 1995
Manuel Ujaldon Martinez, Shamik D. Sharma, Joel Saltz, Emilio Zapata
Sparse matrix problems are difficult to parallelize efficiently on message-passing machines, since they access data through multiple levels of indirection. Inspector/executor strategies, which are typically used to parallelize such problems impose insignificant preprocessing overheads. This paper describes the runtime support required by new compilation techniques for sparse matrices and evaluates their performance, highlighting optimizations and improvements over previous techniques.

Compiler and Runtime Support for Programming in Adaptive Parallel Environments
Published as: University of Maryland Technical Report CS-TR-3510 and UMIACS-TR-95-83

Submitted for Journal Publication
Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, Joel Saltz

For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers of HPF-like languages to generate code for an adaptive environment. We present performance results for Navier-Stokes solver on a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varies frequently, the cost of data distribution is not significant when compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network on non-dedicated workstations, which are likely to be an important resource for parallel programming in the future.

Object-Oriented Runtime Support for Complex Distributed Data Structures
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3438 and UMIACS-TR-95-35
C. Chang, A. Sussman, J. Saltz
March, 1995
Object-oriented applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex data types. Traditionally, however, applications executed on distributed memory parallel architectures in single-program multiple-data (SPMD) mode use on distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques that to such applications executing in a loosely synchronous manner. Existing runtime systems that rely solely on global indices are not always applicable to object-oriented applications, since no global names or indices are imposed upon dynamic complex data structures linked by pointers.
We describe a portable object-oriented runtime library that has been designed to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures, by providing two abstractions, mobile objects and globally addressable objects. CHAOS++ uses preprocessing techniques to analyze communication patterns, and provides data exchange routines to perform efficient transfers between processors. Results for applications taken from three distinct classes are also presented to demonstrate the wide applicability and good performance characteristics of the runtime library.

A Manual for the {CHAOS} Runtime Library
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3437 and UMIACS-TR-95-34
March 1995
J. Saltz, R. Ponnusammy, S. Sharma, B. Moon, Y.-S. Hwang, M. Uysal, R. Das
Procedures are presented that are designed to help users efficiently program irregular problems (e. g. unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These procedures are also designed for use in compilers for distributed memory multiprocessors. The portable CHAOS procedures are designed to support dynamic data distributions and to automatically generate send and receive message by capturing communications patterns at runtime.

Parallelization of Fine Grained, Irregular DAGs
Published in to International Conference on Supercomputing 1995
Published: F. Chong, S. Sharma, E. Brewer, J. Saltz
December, 1994
This paper examines parallelization of computations that can be characterized as a fine-grained, irregular, directed acyclic graph of tasks. Such computations typically arise in sparse matrix applications where loops may have loop carried data dependencies. The fine grain of the computation and the irregular nature of the dependencies make such loops extremely hard to parallelize efficiently. We use runtime preprocessing to parallelize such computations. A preprocessing step analyzes the DAG and determines an optimized mapping of tasks to processors, an appropriate order of executing these tasks on each processor and a set of synchronization constraints. This information is used to distribute data across processors, and to schedule computation, communication and synchronization. We use sparse-matrix triangular solution as our example application and compare a range of runtime preprocessed techniques for its parallelization. Whereas previous studies have achieved maximum speedups of less that 4 on even simple banded matrices, we achieve a speedup of almost 35 on 128 processors of the CM-5 on an irregular matrix with 5300 rows. We find that and receive overheads dominate the time of out application. Efficient parallelization of such loops requires even more support for low latency communication.

Distributed Memory Compiler Design for Sparse Problems
Published in: IEEE Transactions on Computers vol. 44, no.6, pg: 737, June 1995
J. Wu, R. Das, J. Saltz, H. Berryman, S. Hiranandani
This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accessed that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used effectively by a compiler to generate efficient codes in these situations. The compiler accepts as input a Fortran 77 program enhanced with specifications for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. A variety of performance results on the Intel iPSC/860 are presented.

A Framework for Optimizing Parallel {I/O}
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3417 and UMIACS-TR-95-20
R. Bennett, K. Byrant, A. Sussman, R. Das, J. Saltz
January 1995
There has been a great deal of recent interest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architecture that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that the I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations, called Jovian, on three application templates are also presented.

Index Translation Schemes for Adaptive Computations on Distributed Memory Multicomputers
Published in: Proceedings of the Ninth International Parallel Processing Symposium pg: 812-819, April 1995
B. Moon, M. Uysal, J. Saltz
Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with as adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance.

Data Parallel Programming in an Adaptive Environment
Published in: Proceedings of the Ninth International Parallel Processing Symposium April 1995 pg: 827-832
G. Edjlali, G. Agrawal, A. Sussman, J. Saltz
An extended version also available as University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
M. Ujaldon Martinez, S. Sharma, J. Saltz, and E. Zapata To be published in:Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, May 25-27, 1996, pages 78-85 Sparse matrix problems are difficult to parallelize efficiently on distributed memory machines since non-zero elements are unevenly scattered and and are accessed via multiple levels of indirection. Irregular distributions that acheive good load balance and locality are hard to compute, have high memory overheads and also lead to further indirection in locating distributed data. This paper evaluates alternative semi-regular distribution strategies which rade off the quality of the load balance and locality for power distribution overheads and efficient lookup. The proposed techniques are compared to an irregular sparse matrix partitioner and the relative merits of each distribution method are outlined.
Runtime Coupling of Data-Parallel Programs
May 1996
Mudumbai Ranganathan, Anurag Acharya, Guy Edjlali, Alan Sussman, Joel Saltz
Published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, pages 229-236 May 25-28, 1996
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3565 and UMIACS TR-95-116
We consider the problem of efficiently computing multiple data-parallel programs at runtime. We propose an approach that establishes a mapping between data structures in different data-parallel programs and implements a user-specified consistency model. Mappings are established at runtime and can be added and deleted while the programs being coupled are in execution. Mappings, or the identity of the processors involved, do not have to be known at compile-time or even link-time. Programs can be made to interact with different granularities of interaction without requiring any re-coding. A-priori knowledge of consistency requirements aflows buffering of data as well as concurrent execution of the coupled applications. Efficient data movement is achieved by pre-computing an optimized schedule. We describe our prototype implementation and evaluate its performance using a set of synthetic benchmarks. We examine the variation of performance with variation in the consistency requirement. We demonstrate that the cost of the flexibility provided by our coupling scheme is not prohibitive when compared with a monolithic program that performs the same computation.

Runtime Techniques for Parallelizing Sparse Matrix Applications
Published in: Proceedings of the 1995 Workshop on Irregular Problems
September, 1995
Manuel Ujaldon Martinez, Shamik D. Sharma, Joel Saltz, Emilio Zapata
Sparse matrix problems are difficult to parallelize efficiently on message-passing machines, since they access data through multiple levels of indirection. Inspector/executor strategies, which are typically used to parallelize such problems impose insignificant preprocessing overheads. This paper describes the runtime support required by new compilation techniques for sparse matrices and evaluates their performance, highlighting optimizations and improvements over previous techniques.

Compiler and Runtime Support for Programming in Adaptive Parallel Environments
Published as: University of Maryland Technical Report CS-TR-3510 and UMIACS-TR-95-83

Submitted for Journal Publication
Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, Joel Saltz

For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers of HPF-like languages to generate code for an adaptive environment. We present performance results for Navier-Stokes solver on a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varies frequently, the cost of data distribution is not significant when compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network on non-dedicated workstations, which are likely to be an important resource for parallel programming in the future.

Object-Oriented Runtime Support for Complex Distributed Data Structures
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3438 and UMIACS-TR-95-35
C. Chang, A. Sussman, J. Saltz
March, 1995
Object-oriented applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex data types. Traditionally, however, applications executed on distributed memory parallel architectures in single-program multiple-data (SPMD) mode use on distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques that to such applications executing in a loosely synchronous manner. Existing runtime systems that rely solely on global indices are not always applicable to object-oriented applications, since no global names or indices are imposed upon dynamic complex data structures linked by pointers.
We describe a portable object-oriented runtime library that has been designed to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures, by providing two abstractions, mobile objects and globally addressable objects. CHAOS++ uses preprocessing techniques to analyze communication patterns, and provides data exchange routines to perform efficient transfers between processors. Results for applications taken from three distinct classes are also presented to demonstrate the wide applicability and good performance characteristics of the runtime library.

A Manual for the {CHAOS} Runtime Library
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3437 and UMIACS-TR-95-34
March 1995
J. Saltz, R. Ponnusammy, S. Sharma, B. Moon, Y.-S. Hwang, M. Uysal, R. Das
Procedures are presented that are designed to help users efficiently program irregular problems (e. g. unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These procedures are also designed for use in compilers for distributed memory multiprocessors. The portable CHAOS procedures are designed to support dynamic data distributions and to automatically generate send and receive message by capturing communications patterns at runtime.

Parallelization of Fine Grained, Irregular DAGs
Published in to International Conference on Supercomputing 1995
Published: F. Chong, S. Sharma, E. Brewer, J. Saltz
December, 1994
This paper examines parallelization of computations that can be characterized as a fine-grained, irregular, directed acyclic graph of tasks. Such computations typically arise in sparse matrix applications where loops may have loop carried data dependencies. The fine grain of the computation and the irregular nature of the dependencies make such loops extremely hard to parallelize efficiently. We use runtime preprocessing to parallelize such computations. A preprocessing step analyzes the DAG and determines an optimized mapping of tasks to processors, an appropriate order of executing these tasks on each processor and a set of synchronization constraints. This information is used to distribute data across processors, and to schedule computation, communication and synchronization. We use sparse-matrix triangular solution as our example application and compare a range of runtime preprocessed techniques for its parallelization. Whereas previous studies have achieved maximum speedups of less that 4 on even simple banded matrices, we achieve a speedup of almost 35 on 128 processors of the CM-5 on an irregular matrix with 5300 rows. We find that and receive overheads dominate the time of out application. Efficient parallelization of such loops requires even more support for low latency communication.

Distributed Memory Compiler Design for Sparse Problems
Published in: IEEE Transactions on Computers vol. 44, no.6, pg: 737, June 1995
J. Wu, R. Das, J. Saltz, H. Berryman, S. Hiranandani
This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accessed that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used effectively by a compiler to generate efficient codes in these situations. The compiler accepts as input a Fortran 77 program enhanced with specifications for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. A variety of performance results on the Intel iPSC/860 are presented.

A Framework for Optimizing Parallel {I/O}
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3417 and UMIACS-TR-95-20
R. Bennett, K. Byrant, A. Sussman, R. Das, J. Saltz
January 1995
There has been a great deal of recent interest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architecture that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that the I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations, called Jovian, on three application templates are also presented.

Index Translation Schemes for Adaptive Computations on Distributed Memory Multicomputers
Published in: Proceedings of the Ninth International Parallel Processing Symposium pg: 812-819, April 1995
B. Moon, M. Uysal, J. Saltz
Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with as adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance.

Data Parallel Programming in an Adaptive Environment
Published in: Proceedings of the Ninth International Parallel Processing Symposium April 1995 pg: 827-832
G. Edjlali, G. Agrawal, A. Sussman, J. Saltz
An extended version also available as University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
Mudumbai Ranganathan, Anurag Acharya, Guy Edjlali, Alan Sussman, Joel Saltz
Published in: Proceedings of the ACM International Conference of Supercomputing, Philadelphia, Pennsylvania, pages 229-236 May 25-28, 1996 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3565 and UMIACS TR-95-116 We consider the problem of efficiently computing multiple data-parallel programs at runtime. We propose an approach that establishes a mapping between data structures in different data-parallel programs and implements a user-specified consistency model. Mappings are established at runtime and can be added and deleted while the programs being coupled are in execution. Mappings, or the identity of the processors involved, do not have to be known at compile-time or even link-time. Programs can be made to interact with different granularities of interaction without requiring any re-coding. A-priori knowledge of consistency requirements aflows buffering of data as well as concurrent execution of the coupled applications. Efficient data movement is achieved by pre-computing an optimized schedule. We describe our prototype implementation and evaluate its performance using a set of synthetic benchmarks. We examine the variation of performance with variation in the consistency requirement. We demonstrate that the cost of the flexibility provided by our coupling scheme is not prohibitive when compared with a monolithic program that performs the same computation.
Runtime Techniques for Parallelizing Sparse Matrix Applications
Published in: Proceedings of the 1995 Workshop on Irregular Problems
September, 1995
Manuel Ujaldon Martinez, Shamik D. Sharma, Joel Saltz, Emilio Zapata
Sparse matrix problems are difficult to parallelize efficiently on message-passing machines, since they access data through multiple levels of indirection. Inspector/executor strategies, which are typically used to parallelize such problems impose insignificant preprocessing overheads. This paper describes the runtime support required by new compilation techniques for sparse matrices and evaluates their performance, highlighting optimizations and improvements over previous techniques.

Compiler and Runtime Support for Programming in Adaptive Parallel Environments
Published as: University of Maryland Technical Report CS-TR-3510 and UMIACS-TR-95-83

Submitted for Journal Publication
Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, Joel Saltz

For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers of HPF-like languages to generate code for an adaptive environment. We present performance results for Navier-Stokes solver on a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varies frequently, the cost of data distribution is not significant when compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network on non-dedicated workstations, which are likely to be an important resource for parallel programming in the future.

Object-Oriented Runtime Support for Complex Distributed Data Structures
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3438 and UMIACS-TR-95-35
C. Chang, A. Sussman, J. Saltz
March, 1995
Object-oriented applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex data types. Traditionally, however, applications executed on distributed memory parallel architectures in single-program multiple-data (SPMD) mode use on distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques that to such applications executing in a loosely synchronous manner. Existing runtime systems that rely solely on global indices are not always applicable to object-oriented applications, since no global names or indices are imposed upon dynamic complex data structures linked by pointers.
We describe a portable object-oriented runtime library that has been designed to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures, by providing two abstractions, mobile objects and globally addressable objects. CHAOS++ uses preprocessing techniques to analyze communication patterns, and provides data exchange routines to perform efficient transfers between processors. Results for applications taken from three distinct classes are also presented to demonstrate the wide applicability and good performance characteristics of the runtime library.

A Manual for the {CHAOS} Runtime Library
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3437 and UMIACS-TR-95-34
March 1995
J. Saltz, R. Ponnusammy, S. Sharma, B. Moon, Y.-S. Hwang, M. Uysal, R. Das
Procedures are presented that are designed to help users efficiently program irregular problems (e. g. unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These procedures are also designed for use in compilers for distributed memory multiprocessors. The portable CHAOS procedures are designed to support dynamic data distributions and to automatically generate send and receive message by capturing communications patterns at runtime.

Parallelization of Fine Grained, Irregular DAGs
Published in to International Conference on Supercomputing 1995
Published: F. Chong, S. Sharma, E. Brewer, J. Saltz
December, 1994
This paper examines parallelization of computations that can be characterized as a fine-grained, irregular, directed acyclic graph of tasks. Such computations typically arise in sparse matrix applications where loops may have loop carried data dependencies. The fine grain of the computation and the irregular nature of the dependencies make such loops extremely hard to parallelize efficiently. We use runtime preprocessing to parallelize such computations. A preprocessing step analyzes the DAG and determines an optimized mapping of tasks to processors, an appropriate order of executing these tasks on each processor and a set of synchronization constraints. This information is used to distribute data across processors, and to schedule computation, communication and synchronization. We use sparse-matrix triangular solution as our example application and compare a range of runtime preprocessed techniques for its parallelization. Whereas previous studies have achieved maximum speedups of less that 4 on even simple banded matrices, we achieve a speedup of almost 35 on 128 processors of the CM-5 on an irregular matrix with 5300 rows. We find that and receive overheads dominate the time of out application. Efficient parallelization of such loops requires even more support for low latency communication.

Distributed Memory Compiler Design for Sparse Problems
Published in: IEEE Transactions on Computers vol. 44, no.6, pg: 737, June 1995
J. Wu, R. Das, J. Saltz, H. Berryman, S. Hiranandani
This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accessed that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used effectively by a compiler to generate efficient codes in these situations. The compiler accepts as input a Fortran 77 program enhanced with specifications for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. A variety of performance results on the Intel iPSC/860 are presented.

A Framework for Optimizing Parallel {I/O}
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3417 and UMIACS-TR-95-20
R. Bennett, K. Byrant, A. Sussman, R. Das, J. Saltz
January 1995
There has been a great deal of recent interest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architecture that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that the I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations, called Jovian, on three application templates are also presented.

Index Translation Schemes for Adaptive Computations on Distributed Memory Multicomputers
Published in: Proceedings of the Ninth International Parallel Processing Symposium pg: 812-819, April 1995
B. Moon, M. Uysal, J. Saltz
Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with as adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance.

Data Parallel Programming in an Adaptive Environment
Published in: Proceedings of the Ninth International Parallel Processing Symposium April 1995 pg: 827-832
G. Edjlali, G. Agrawal, A. Sussman, J. Saltz
An extended version also available as University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
September, 1995
Manuel Ujaldon Martinez, Shamik D. Sharma, Joel Saltz, Emilio Zapata
Sparse matrix problems are difficult to parallelize efficiently on message-passing machines, since they access data through multiple levels of indirection. Inspector/executor strategies, which are typically used to parallelize such problems impose insignificant preprocessing overheads. This paper describes the runtime support required by new compilation techniques for sparse matrices and evaluates their performance, highlighting optimizations and improvements over previous techniques.
Compiler and Runtime Support for Programming in Adaptive Parallel Environments
Published as: University of Maryland Technical Report CS-TR-3510 and UMIACS-TR-95-83

Submitted for Journal Publication
Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, Joel Saltz

For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers of HPF-like languages to generate code for an adaptive environment. We present performance results for Navier-Stokes solver on a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varies frequently, the cost of data distribution is not significant when compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network on non-dedicated workstations, which are likely to be an important resource for parallel programming in the future.

Object-Oriented Runtime Support for Complex Distributed Data Structures
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3438 and UMIACS-TR-95-35
C. Chang, A. Sussman, J. Saltz
March, 1995
Object-oriented applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex data types. Traditionally, however, applications executed on distributed memory parallel architectures in single-program multiple-data (SPMD) mode use on distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques that to such applications executing in a loosely synchronous manner. Existing runtime systems that rely solely on global indices are not always applicable to object-oriented applications, since no global names or indices are imposed upon dynamic complex data structures linked by pointers.
We describe a portable object-oriented runtime library that has been designed to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures, by providing two abstractions, mobile objects and globally addressable objects. CHAOS++ uses preprocessing techniques to analyze communication patterns, and provides data exchange routines to perform efficient transfers between processors. Results for applications taken from three distinct classes are also presented to demonstrate the wide applicability and good performance characteristics of the runtime library.

A Manual for the {CHAOS} Runtime Library
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3437 and UMIACS-TR-95-34
March 1995
J. Saltz, R. Ponnusammy, S. Sharma, B. Moon, Y.-S. Hwang, M. Uysal, R. Das
Procedures are presented that are designed to help users efficiently program irregular problems (e. g. unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These procedures are also designed for use in compilers for distributed memory multiprocessors. The portable CHAOS procedures are designed to support dynamic data distributions and to automatically generate send and receive message by capturing communications patterns at runtime.

Parallelization of Fine Grained, Irregular DAGs
Published in to International Conference on Supercomputing 1995
Published: F. Chong, S. Sharma, E. Brewer, J. Saltz
December, 1994
This paper examines parallelization of computations that can be characterized as a fine-grained, irregular, directed acyclic graph of tasks. Such computations typically arise in sparse matrix applications where loops may have loop carried data dependencies. The fine grain of the computation and the irregular nature of the dependencies make such loops extremely hard to parallelize efficiently. We use runtime preprocessing to parallelize such computations. A preprocessing step analyzes the DAG and determines an optimized mapping of tasks to processors, an appropriate order of executing these tasks on each processor and a set of synchronization constraints. This information is used to distribute data across processors, and to schedule computation, communication and synchronization. We use sparse-matrix triangular solution as our example application and compare a range of runtime preprocessed techniques for its parallelization. Whereas previous studies have achieved maximum speedups of less that 4 on even simple banded matrices, we achieve a speedup of almost 35 on 128 processors of the CM-5 on an irregular matrix with 5300 rows. We find that and receive overheads dominate the time of out application. Efficient parallelization of such loops requires even more support for low latency communication.

Distributed Memory Compiler Design for Sparse Problems
Published in: IEEE Transactions on Computers vol. 44, no.6, pg: 737, June 1995
J. Wu, R. Das, J. Saltz, H. Berryman, S. Hiranandani
This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accessed that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used effectively by a compiler to generate efficient codes in these situations. The compiler accepts as input a Fortran 77 program enhanced with specifications for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. A variety of performance results on the Intel iPSC/860 are presented.

A Framework for Optimizing Parallel {I/O}
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3417 and UMIACS-TR-95-20
R. Bennett, K. Byrant, A. Sussman, R. Das, J. Saltz
January 1995
There has been a great deal of recent interest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architecture that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that the I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations, called Jovian, on three application templates are also presented.

Index Translation Schemes for Adaptive Computations on Distributed Memory Multicomputers
Published in: Proceedings of the Ninth International Parallel Processing Symposium pg: 812-819, April 1995
B. Moon, M. Uysal, J. Saltz
Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with as adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance.

Data Parallel Programming in an Adaptive Environment
Published in: Proceedings of the Ninth International Parallel Processing Symposium April 1995 pg: 827-832
G. Edjlali, G. Agrawal, A. Sussman, J. Saltz
An extended version also available as University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
Submitted for Journal Publication
Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, Joel Saltz
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers of HPF-like languages to generate code for an adaptive environment. We present performance results for Navier-Stokes solver on a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varies frequently, the cost of data distribution is not significant when compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network on non-dedicated workstations, which are likely to be an important resource for parallel programming in the future.
Object-Oriented Runtime Support for Complex Distributed Data Structures
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3438 and UMIACS-TR-95-35
C. Chang, A. Sussman, J. Saltz
March, 1995
Object-oriented applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex data types. Traditionally, however, applications executed on distributed memory parallel architectures in single-program multiple-data (SPMD) mode use on distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques that to such applications executing in a loosely synchronous manner. Existing runtime systems that rely solely on global indices are not always applicable to object-oriented applications, since no global names or indices are imposed upon dynamic complex data structures linked by pointers.
We describe a portable object-oriented runtime library that has been designed to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures, by providing two abstractions, mobile objects and globally addressable objects. CHAOS++ uses preprocessing techniques to analyze communication patterns, and provides data exchange routines to perform efficient transfers between processors. Results for applications taken from three distinct classes are also presented to demonstrate the wide applicability and good performance characteristics of the runtime library.

A Manual for the {CHAOS} Runtime Library
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3437 and UMIACS-TR-95-34
March 1995
J. Saltz, R. Ponnusammy, S. Sharma, B. Moon, Y.-S. Hwang, M. Uysal, R. Das
Procedures are presented that are designed to help users efficiently program irregular problems (e. g. unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These procedures are also designed for use in compilers for distributed memory multiprocessors. The portable CHAOS procedures are designed to support dynamic data distributions and to automatically generate send and receive message by capturing communications patterns at runtime.

Parallelization of Fine Grained, Irregular DAGs
Published in to International Conference on Supercomputing 1995
Published: F. Chong, S. Sharma, E. Brewer, J. Saltz
December, 1994
This paper examines parallelization of computations that can be characterized as a fine-grained, irregular, directed acyclic graph of tasks. Such computations typically arise in sparse matrix applications where loops may have loop carried data dependencies. The fine grain of the computation and the irregular nature of the dependencies make such loops extremely hard to parallelize efficiently. We use runtime preprocessing to parallelize such computations. A preprocessing step analyzes the DAG and determines an optimized mapping of tasks to processors, an appropriate order of executing these tasks on each processor and a set of synchronization constraints. This information is used to distribute data across processors, and to schedule computation, communication and synchronization. We use sparse-matrix triangular solution as our example application and compare a range of runtime preprocessed techniques for its parallelization. Whereas previous studies have achieved maximum speedups of less that 4 on even simple banded matrices, we achieve a speedup of almost 35 on 128 processors of the CM-5 on an irregular matrix with 5300 rows. We find that and receive overheads dominate the time of out application. Efficient parallelization of such loops requires even more support for low latency communication.

Distributed Memory Compiler Design for Sparse Problems
Published in: IEEE Transactions on Computers vol. 44, no.6, pg: 737, June 1995
J. Wu, R. Das, J. Saltz, H. Berryman, S. Hiranandani
This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accessed that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used effectively by a compiler to generate efficient codes in these situations. The compiler accepts as input a Fortran 77 program enhanced with specifications for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. A variety of performance results on the Intel iPSC/860 are presented.

A Framework for Optimizing Parallel {I/O}
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3417 and UMIACS-TR-95-20
R. Bennett, K. Byrant, A. Sussman, R. Das, J. Saltz
January 1995
There has been a great deal of recent interest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architecture that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that the I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations, called Jovian, on three application templates are also presented.

Index Translation Schemes for Adaptive Computations on Distributed Memory Multicomputers
Published in: Proceedings of the Ninth International Parallel Processing Symposium pg: 812-819, April 1995
B. Moon, M. Uysal, J. Saltz
Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with as adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance.

Data Parallel Programming in an Adaptive Environment
Published in: Proceedings of the Ninth International Parallel Processing Symposium April 1995 pg: 827-832
G. Edjlali, G. Agrawal, A. Sussman, J. Saltz
An extended version also available as University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
A Manual for the {CHAOS} Runtime Library
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3437 and UMIACS-TR-95-34
March 1995
J. Saltz, R. Ponnusammy, S. Sharma, B. Moon, Y.-S. Hwang, M. Uysal, R. Das
Procedures are presented that are designed to help users efficiently program irregular problems (e. g. unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These procedures are also designed for use in compilers for distributed memory multiprocessors. The portable CHAOS procedures are designed to support dynamic data distributions and to automatically generate send and receive message by capturing communications patterns at runtime.

Parallelization of Fine Grained, Irregular DAGs
Published in to International Conference on Supercomputing 1995
Published: F. Chong, S. Sharma, E. Brewer, J. Saltz
December, 1994
This paper examines parallelization of computations that can be characterized as a fine-grained, irregular, directed acyclic graph of tasks. Such computations typically arise in sparse matrix applications where loops may have loop carried data dependencies. The fine grain of the computation and the irregular nature of the dependencies make such loops extremely hard to parallelize efficiently. We use runtime preprocessing to parallelize such computations. A preprocessing step analyzes the DAG and determines an optimized mapping of tasks to processors, an appropriate order of executing these tasks on each processor and a set of synchronization constraints. This information is used to distribute data across processors, and to schedule computation, communication and synchronization. We use sparse-matrix triangular solution as our example application and compare a range of runtime preprocessed techniques for its parallelization. Whereas previous studies have achieved maximum speedups of less that 4 on even simple banded matrices, we achieve a speedup of almost 35 on 128 processors of the CM-5 on an irregular matrix with 5300 rows. We find that and receive overheads dominate the time of out application. Efficient parallelization of such loops requires even more support for low latency communication.

Distributed Memory Compiler Design for Sparse Problems
Published in: IEEE Transactions on Computers vol. 44, no.6, pg: 737, June 1995
J. Wu, R. Das, J. Saltz, H. Berryman, S. Hiranandani
This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accessed that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used effectively by a compiler to generate efficient codes in these situations. The compiler accepts as input a Fortran 77 program enhanced with specifications for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. A variety of performance results on the Intel iPSC/860 are presented.

A Framework for Optimizing Parallel {I/O}
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3417 and UMIACS-TR-95-20
R. Bennett, K. Byrant, A. Sussman, R. Das, J. Saltz
January 1995
There has been a great deal of recent interest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architecture that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that the I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations, called Jovian, on three application templates are also presented.

Index Translation Schemes for Adaptive Computations on Distributed Memory Multicomputers
Published in: Proceedings of the Ninth International Parallel Processing Symposium pg: 812-819, April 1995
B. Moon, M. Uysal, J. Saltz
Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with as adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance.

Data Parallel Programming in an Adaptive Environment
Published in: Proceedings of the Ninth International Parallel Processing Symposium April 1995 pg: 827-832
G. Edjlali, G. Agrawal, A. Sussman, J. Saltz
An extended version also available as University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
March 1995
J. Saltz, R. Ponnusammy, S. Sharma, B. Moon, Y.-S. Hwang, M. Uysal, R. Das Procedures are presented that are designed to help users efficiently program irregular problems (e. g. unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These procedures are also designed for use in compilers for distributed memory multiprocessors. The portable CHAOS procedures are designed to support dynamic data distributions and to automatically generate send and receive message by capturing communications patterns at runtime.
Parallelization of Fine Grained, Irregular DAGs
Published in to International Conference on Supercomputing 1995
Published: F. Chong, S. Sharma, E. Brewer, J. Saltz
December, 1994
This paper examines parallelization of computations that can be characterized as a fine-grained, irregular, directed acyclic graph of tasks. Such computations typically arise in sparse matrix applications where loops may have loop carried data dependencies. The fine grain of the computation and the irregular nature of the dependencies make such loops extremely hard to parallelize efficiently. We use runtime preprocessing to parallelize such computations. A preprocessing step analyzes the DAG and determines an optimized mapping of tasks to processors, an appropriate order of executing these tasks on each processor and a set of synchronization constraints. This information is used to distribute data across processors, and to schedule computation, communication and synchronization. We use sparse-matrix triangular solution as our example application and compare a range of runtime preprocessed techniques for its parallelization. Whereas previous studies have achieved maximum speedups of less that 4 on even simple banded matrices, we achieve a speedup of almost 35 on 128 processors of the CM-5 on an irregular matrix with 5300 rows. We find that and receive overheads dominate the time of out application. Efficient parallelization of such loops requires even more support for low latency communication.

Distributed Memory Compiler Design for Sparse Problems
Published in: IEEE Transactions on Computers vol. 44, no.6, pg: 737, June 1995
J. Wu, R. Das, J. Saltz, H. Berryman, S. Hiranandani
This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accessed that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used effectively by a compiler to generate efficient codes in these situations. The compiler accepts as input a Fortran 77 program enhanced with specifications for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. A variety of performance results on the Intel iPSC/860 are presented.

A Framework for Optimizing Parallel {I/O}
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3417 and UMIACS-TR-95-20
R. Bennett, K. Byrant, A. Sussman, R. Das, J. Saltz
January 1995
There has been a great deal of recent interest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architecture that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that the I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations, called Jovian, on three application templates are also presented.

Index Translation Schemes for Adaptive Computations on Distributed Memory Multicomputers
Published in: Proceedings of the Ninth International Parallel Processing Symposium pg: 812-819, April 1995
B. Moon, M. Uysal, J. Saltz
Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with as adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance.

Data Parallel Programming in an Adaptive Environment
Published in: Proceedings of the Ninth International Parallel Processing Symposium April 1995 pg: 827-832
G. Edjlali, G. Agrawal, A. Sussman, J. Saltz
An extended version also available as University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
Distributed Memory Compiler Design for Sparse Problems
Published in: IEEE Transactions on Computers vol. 44, no.6, pg: 737, June 1995
J. Wu, R. Das, J. Saltz, H. Berryman, S. Hiranandani
This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accessed that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used effectively by a compiler to generate efficient codes in these situations. The compiler accepts as input a Fortran 77 program enhanced with specifications for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. A variety of performance results on the Intel iPSC/860 are presented.

A Framework for Optimizing Parallel {I/O}
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3417 and UMIACS-TR-95-20
R. Bennett, K. Byrant, A. Sussman, R. Das, J. Saltz
January 1995
There has been a great deal of recent interest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architecture that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that the I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations, called Jovian, on three application templates are also presented.

Index Translation Schemes for Adaptive Computations on Distributed Memory Multicomputers
Published in: Proceedings of the Ninth International Parallel Processing Symposium pg: 812-819, April 1995
B. Moon, M. Uysal, J. Saltz
Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with as adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance.

Data Parallel Programming in an Adaptive Environment
Published in: Proceedings of the Ninth International Parallel Processing Symposium April 1995 pg: 827-832
G. Edjlali, G. Agrawal, A. Sussman, J. Saltz
An extended version also available as University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
A Framework for Optimizing Parallel {I/O}
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3417 and UMIACS-TR-95-20
R. Bennett, K. Byrant, A. Sussman, R. Das, J. Saltz
January 1995
There has been a great deal of recent interest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architecture that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that the I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations, called Jovian, on three application templates are also presented.

Index Translation Schemes for Adaptive Computations on Distributed Memory Multicomputers
Published in: Proceedings of the Ninth International Parallel Processing Symposium pg: 812-819, April 1995
B. Moon, M. Uysal, J. Saltz
Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with as adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance.

Data Parallel Programming in an Adaptive Environment
Published in: Proceedings of the Ninth International Parallel Processing Symposium April 1995 pg: 827-832
G. Edjlali, G. Agrawal, A. Sussman, J. Saltz
An extended version also available as University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
Index Translation Schemes for Adaptive Computations on Distributed Memory Multicomputers
Published in: Proceedings of the Ninth International Parallel Processing Symposium pg: 812-819, April 1995
B. Moon, M. Uysal, J. Saltz
Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with as adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance.

Data Parallel Programming in an Adaptive Environment
Published in: Proceedings of the Ninth International Parallel Processing Symposium April 1995 pg: 827-832
G. Edjlali, G. Agrawal, A. Sussman, J. Saltz
An extended version also available as University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
Data Parallel Programming in an Adaptive Environment
Published in: Proceedings of the Ninth International Parallel Processing Symposium April 1995 pg: 827-832
G. Edjlali, G. Agrawal, A. Sussman, J. Saltz
An extended version also available as University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
Runtime Support to Parallelize Adaptive Irregular Programs
Published in: Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing,
May 1994
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, J. Saltz
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox
Also available in IEEE Parallel & Distributed Technology, Spring 1995
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
May 1994
R. Ponnusamy, Y.-S. Hwang, J. Saltz, A. Choudhary, G. Fox Also available in IEEE Parallel & Distributed Technology, Spring 1995 We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation. The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM. A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.
Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation
Published in: Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation pg: 258-269
June, 1995
G. Agrawal, J. Saltz, R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
June, 1995 G. Agrawal, J. Saltz, R. Das University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42 Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.
Efficient Support for Irregular Applications on Distributed-Memory Machines
Published in: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pg: 68-79
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

Interprocedural Compilation of Irregular Application for Distributed Memory Machines
University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447
G. Agrawal, J. Saltz
Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3
pg: 462-479, September 1994
R. Das, M. Uysal, J. Saltz, Y.-S. Hwang
Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization.
This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.

Jovian: A Framework for Optimizing Parallel I/O
Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994
R. Bennett, K. Bryant, A. Sussman, R. Das
There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.

Value-based Distributions in Fortran D: A Preliminary Report
Published: CRPC 1993, no. CRPC-TR-93-365-S
December 1993
R. von Hanxleden, K. Kennedy, J. Saltz
Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.

High Performance Computing for Land Cover Dynamics
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library
Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

Support for Distributed Dynamic Data Structures in C++
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19
January 1995
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

Runtime Support and Compilation Methods for User-Specified Data Distributions
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz.
This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures
Published in: Proceedings of the Scalable High Performance Computing Conference 1994
pg: 176-183, May 1994
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

{PARTI} Primitives for Unstructured and Block Structured Problems
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

Slicing Analysis and Indirect Access to Distributed Arrays
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168
August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.

Compiler Analysis for Irregular Problems in {Fortran-D}
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing
August, 1992
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs}
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266
March, 1994
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

The Dybbuk Runtime System
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993

R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis
Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz
One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.

Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993
Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis
October 1993
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

Compiler and Runtime Support for Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, Joel Saltz.
Published in Proceedings of Supercomputing '93 pg. 578-587

November 1993
Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz
July 1995
Published in: AIAA Journal of Thermophysics and Heat Transfer
Vol. 9, no. 3, pg: 471-477
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747
Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

The Design and Implementation of a Parallel Unstructured {E}uler Solver
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.

Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

Run-time and Compile-time Support for Adaptive Irregular Problems
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106
November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

[Applications | High Performance I/O | Compilers | Tools]

Questions about the system or webserver: webmaster@cs.umd.edu
Problems with publications homepage: wes@cs.umd.edu
July 1995
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, J. Saltz
ACM SIGPLAN Notices, vol. 30, no. 8
Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.
Interprocedural Compilation of Irregular Application for Distributed Memory Machines University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447 G. Agrawal, J. Saltz Published in: Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995. Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes. Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3 pg: 462-479, September 1994 R. Das, M. Uysal, J. Saltz, Y.-S. Hwang Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109 This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM). Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995 Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization. This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically. Jovian: A Framework for Optimizing Parallel I/O Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994 R. Bennett, K. Bryant, A. Sussman, R. Das There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates. Value-based Distributions in Fortran D: A Preliminary Report Published: CRPC 1993, no. CRPC-TR-93-365-S December 1993 R. von Hanxleden, K. Kennedy, J. Saltz Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines Vol. 2, no. 3 Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns. This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application. High Performance Computing for Land Cover Dynamics Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994 Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend. We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed. Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library Published in: IEEE Computational Science and Engineering pg: 18-29, vol. 2, no. 2, July 1995 Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125 Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek. CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors. In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented. The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines. Support for Distributed Dynamic Data Structures in C++ University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19 January 1995 Chialin Chang, Alan Sussman, Joel Saltz. Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures. A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library. Runtime Support and Compilation Methods for User-Specified Data Distributions Published in: IEEE Transactions on Parallel and Distributed Memory Systems vol 6, no. 8, pg: 815-831 Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135 Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox. This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing pg:187-192, March 1993 Raja Das, Joel Saltz. This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication. Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures Published in: Proceedings of the Scalable High Performance Computing Conference 1994 pg: 176-183, May 1994 Bongki Moon, Joel Saltz. In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented. Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics Published in: Proceedings of the Scalable Parallel Libraries Conference 1993 pg: 45-56 October 1993 Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman. This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM). {PARTI} Primitives for Unstructured and Block Structured Problems Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86 A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods. Slicing Analysis and Indirect Access to Distributed Arrays Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing pg: 152-168 August, 1993 Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42 Raja Das, Joel Saltz, Reinhard v. Hanxleden. An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection. This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress. Compiler Analysis for Irregular Problems in {Fortran-D} Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing August, 1992 R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz. We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors. In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing. Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs} University of Maryland: Department of Computer Science Technical Report, CS-TR-3266 March, 1994 Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz. Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors. We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems. Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication. The Dybbuk Runtime System Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993 R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations. Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications Published in: Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580 Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90. Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse Published in Proceedings Supercomputing '93 pg: 361-370, November 1993 Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32 Ravi Ponnusamy, Joel Saltz, Alok Choudhary. In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis October 1993 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic. Compiler and Runtime Support for Structured and Block Structured Applications Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587 November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually. Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures Published in: Journal of Parallel and Distributed Computing vol. 22, no. 3 pg: 462-479, September 1994 R. Das, M. Uysal, J. Saltz, Y.-S. Hwang Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109 This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM). Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995 Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization. This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically. Jovian: A Framework for Optimizing Parallel I/O Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994 R. Bennett, K. Bryant, A. Sussman, R. Das There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates. Value-based Distributions in Fortran D: A Preliminary Report Published: CRPC 1993, no. CRPC-TR-93-365-S December 1993 R. von Hanxleden, K. Kennedy, J. Saltz Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines Vol. 2, no. 3 Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns. This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application. High Performance Computing for Land Cover Dynamics Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994 Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend. We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed. Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library Published in: IEEE Computational Science and Engineering pg: 18-29, vol. 2, no. 2, July 1995 Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125 Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek. CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors. In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented. The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines. Support for Distributed Dynamic Data Structures in C++ University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19 January 1995 Chialin Chang, Alan Sussman, Joel Saltz. Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures. A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library. Runtime Support and Compilation Methods for User-Specified Data Distributions Published in: IEEE Transactions on Parallel and Distributed Memory Systems vol 6, no. 8, pg: 815-831 Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135 Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox. This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing pg:187-192, March 1993 Raja Das, Joel Saltz. This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication. Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures Published in: Proceedings of the Scalable High Performance Computing Conference 1994 pg: 176-183, May 1994 Bongki Moon, Joel Saltz. In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented. Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics Published in: Proceedings of the Scalable Parallel Libraries Conference 1993 pg: 45-56 October 1993 Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman. This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM). {PARTI} Primitives for Unstructured and Block Structured Problems Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86 A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods. Slicing Analysis and Indirect Access to Distributed Arrays Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing pg: 152-168 August, 1993 Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42 Raja Das, Joel Saltz, Reinhard v. Hanxleden. An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection. This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress. Compiler Analysis for Irregular Problems in {Fortran-D} Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing August, 1992 R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz. We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors. In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing. Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs} University of Maryland: Department of Computer Science Technical Report, CS-TR-3266 March, 1994 Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz. Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors. We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems. Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication. The Dybbuk Runtime System Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993 R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations. Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications Published in: Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580 Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90. Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse Published in Proceedings Supercomputing '93 pg: 361-370, November 1993 Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32 Ravi Ponnusamy, Joel Saltz, Alok Choudhary. In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis October 1993 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic. Compiler and Runtime Support for Structured and Block Structured Applications Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587 November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually. Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
pg: 462-479, September 1994 R. Das, M. Uysal, J. Saltz, Y.-S. Hwang Also available: University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109 This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).
Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines Published in: Software Practice & Experience vol. 25, no. 6, pg. 597-621, June 1995 Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, J. Saltz In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization. This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically. Jovian: A Framework for Optimizing Parallel I/O Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994 R. Bennett, K. Bryant, A. Sussman, R. Das There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates. Value-based Distributions in Fortran D: A Preliminary Report Published: CRPC 1993, no. CRPC-TR-93-365-S December 1993 R. von Hanxleden, K. Kennedy, J. Saltz Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines Vol. 2, no. 3 Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns. This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application. High Performance Computing for Land Cover Dynamics Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994 Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend. We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed. Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library Published in: IEEE Computational Science and Engineering pg: 18-29, vol. 2, no. 2, July 1995 Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125 Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek. CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors. In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented. The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines. Support for Distributed Dynamic Data Structures in C++ University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19 January 1995 Chialin Chang, Alan Sussman, Joel Saltz. Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures. A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library. Runtime Support and Compilation Methods for User-Specified Data Distributions Published in: IEEE Transactions on Parallel and Distributed Memory Systems vol 6, no. 8, pg: 815-831 Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135 Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox. This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing pg:187-192, March 1993 Raja Das, Joel Saltz. This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication. Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures Published in: Proceedings of the Scalable High Performance Computing Conference 1994 pg: 176-183, May 1994 Bongki Moon, Joel Saltz. In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented. Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics Published in: Proceedings of the Scalable Parallel Libraries Conference 1993 pg: 45-56 October 1993 Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman. This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM). {PARTI} Primitives for Unstructured and Block Structured Problems Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86 A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods. Slicing Analysis and Indirect Access to Distributed Arrays Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing pg: 152-168 August, 1993 Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42 Raja Das, Joel Saltz, Reinhard v. Hanxleden. An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection. This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress. Compiler Analysis for Irregular Problems in {Fortran-D} Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing August, 1992 R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz. We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors. In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing. Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs} University of Maryland: Department of Computer Science Technical Report, CS-TR-3266 March, 1994 Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz. Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors. We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems. Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication. The Dybbuk Runtime System Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993 R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations. Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications Published in: Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580 Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90. Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse Published in Proceedings Supercomputing '93 pg: 361-370, November 1993 Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32 Ravi Ponnusamy, Joel Saltz, Alok Choudhary. In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis October 1993 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic. Compiler and Runtime Support for Structured and Block Structured Applications Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587 November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually. Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
Jovian: A Framework for Optimizing Parallel I/O Published in: Proceedings of the 1994 Scalable Parallel Libraries Conference pg: 10-20, October 1994 R. Bennett, K. Bryant, A. Sussman, R. Das There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates. Value-based Distributions in Fortran D: A Preliminary Report Published: CRPC 1993, no. CRPC-TR-93-365-S December 1993 R. von Hanxleden, K. Kennedy, J. Saltz Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines Vol. 2, no. 3 Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns. This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application. High Performance Computing for Land Cover Dynamics Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994 Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend. We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed. Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library Published in: IEEE Computational Science and Engineering pg: 18-29, vol. 2, no. 2, July 1995 Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125 Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek. CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors. In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented. The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines. Support for Distributed Dynamic Data Structures in C++ University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19 January 1995 Chialin Chang, Alan Sussman, Joel Saltz. Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures. A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library. Runtime Support and Compilation Methods for User-Specified Data Distributions Published in: IEEE Transactions on Parallel and Distributed Memory Systems vol 6, no. 8, pg: 815-831 Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135 Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox. This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing pg:187-192, March 1993 Raja Das, Joel Saltz. This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication. Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures Published in: Proceedings of the Scalable High Performance Computing Conference 1994 pg: 176-183, May 1994 Bongki Moon, Joel Saltz. In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented. Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics Published in: Proceedings of the Scalable Parallel Libraries Conference 1993 pg: 45-56 October 1993 Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman. This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM). {PARTI} Primitives for Unstructured and Block Structured Problems Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86 A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods. Slicing Analysis and Indirect Access to Distributed Arrays Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing pg: 152-168 August, 1993 Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42 Raja Das, Joel Saltz, Reinhard v. Hanxleden. An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection. This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress. Compiler Analysis for Irregular Problems in {Fortran-D} Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing August, 1992 R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz. We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors. In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing. Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs} University of Maryland: Department of Computer Science Technical Report, CS-TR-3266 March, 1994 Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz. Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors. We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems. Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication. The Dybbuk Runtime System Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993 R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations. Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications Published in: Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580 Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90. Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse Published in Proceedings Supercomputing '93 pg: 361-370, November 1993 Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32 Ravi Ponnusamy, Joel Saltz, Alok Choudhary. In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis October 1993 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic. Compiler and Runtime Support for Structured and Block Structured Applications Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587 November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually. Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
Value-based Distributions in Fortran D: A Preliminary Report Published: CRPC 1993, no. CRPC-TR-93-365-S December 1993 R. von Hanxleden, K. Kennedy, J. Saltz Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines Vol. 2, no. 3 Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns. This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application. High Performance Computing for Land Cover Dynamics Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994 Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend. We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed. Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library Published in: IEEE Computational Science and Engineering pg: 18-29, vol. 2, no. 2, July 1995 Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125 Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek. CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors. In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented. The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines. Support for Distributed Dynamic Data Structures in C++ University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19 January 1995 Chialin Chang, Alan Sussman, Joel Saltz. Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures. A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library. Runtime Support and Compilation Methods for User-Specified Data Distributions Published in: IEEE Transactions on Parallel and Distributed Memory Systems vol 6, no. 8, pg: 815-831 Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135 Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox. This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing pg:187-192, March 1993 Raja Das, Joel Saltz. This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication. Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures Published in: Proceedings of the Scalable High Performance Computing Conference 1994 pg: 176-183, May 1994 Bongki Moon, Joel Saltz. In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented. Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics Published in: Proceedings of the Scalable Parallel Libraries Conference 1993 pg: 45-56 October 1993 Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman. This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM). {PARTI} Primitives for Unstructured and Block Structured Problems Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86 A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods. Slicing Analysis and Indirect Access to Distributed Arrays Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing pg: 152-168 August, 1993 Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42 Raja Das, Joel Saltz, Reinhard v. Hanxleden. An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection. This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress. Compiler Analysis for Irregular Problems in {Fortran-D} Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing August, 1992 R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz. We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors. In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing. Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs} University of Maryland: Department of Computer Science Technical Report, CS-TR-3266 March, 1994 Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz. Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors. We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems. Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication. The Dybbuk Runtime System Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993 R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations. Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications Published in: Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580 Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90. Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse Published in Proceedings Supercomputing '93 pg: 361-370, November 1993 Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32 Ravi Ponnusamy, Joel Saltz, Alok Choudhary. In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis October 1993 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic. Compiler and Runtime Support for Structured and Block Structured Applications Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587 November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually. Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
R. von Hanxleden, K. Kennedy, J. Saltz Final version appears in: Journal of Programming Languages- Special Issue on Compiling and Run-Time Issues for Distributed Access Machines
Vol. 2, no. 3 Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns. This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.
High Performance Computing for Land Cover Dynamics Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994 Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend. We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed. Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library Published in: IEEE Computational Science and Engineering pg: 18-29, vol. 2, no. 2, July 1995 Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125 Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek. CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors. In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented. The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines. Support for Distributed Dynamic Data Structures in C++ University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19 January 1995 Chialin Chang, Alan Sussman, Joel Saltz. Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures. A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library. Runtime Support and Compilation Methods for User-Specified Data Distributions Published in: IEEE Transactions on Parallel and Distributed Memory Systems vol 6, no. 8, pg: 815-831 Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135 Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox. This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing pg:187-192, March 1993 Raja Das, Joel Saltz. This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication. Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures Published in: Proceedings of the Scalable High Performance Computing Conference 1994 pg: 176-183, May 1994 Bongki Moon, Joel Saltz. In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented. Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics Published in: Proceedings of the Scalable Parallel Libraries Conference 1993 pg: 45-56 October 1993 Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman. This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM). {PARTI} Primitives for Unstructured and Block Structured Problems Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86 A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods. Slicing Analysis and Indirect Access to Distributed Arrays Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing pg: 152-168 August, 1993 Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42 Raja Das, Joel Saltz, Reinhard v. Hanxleden. An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection. This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress. Compiler Analysis for Irregular Problems in {Fortran-D} Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing August, 1992 R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz. We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors. In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing. Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs} University of Maryland: Department of Computer Science Technical Report, CS-TR-3266 March, 1994 Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz. Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors. We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems. Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication. The Dybbuk Runtime System Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993 R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations. Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications Published in: Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580 Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90. Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse Published in Proceedings Supercomputing '93 pg: 361-370, November 1993 Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32 Ravi Ponnusamy, Joel Saltz, Alok Choudhary. In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis October 1993 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic. Compiler and Runtime Support for Structured and Block Structured Applications Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587 November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually. Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
Published in Proceedings of the International Joint Conference on Pattern Recognition , IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994 Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend. We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.
Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library Published in: IEEE Computational Science and Engineering pg: 18-29, vol. 2, no. 2, July 1995 Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125 Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek. CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors. In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented. The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines. Support for Distributed Dynamic Data Structures in C++ University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19 January 1995 Chialin Chang, Alan Sussman, Joel Saltz. Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures. A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library. Runtime Support and Compilation Methods for User-Specified Data Distributions Published in: IEEE Transactions on Parallel and Distributed Memory Systems vol 6, no. 8, pg: 815-831 Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135 Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox. This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing pg:187-192, March 1993 Raja Das, Joel Saltz. This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication. Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures Published in: Proceedings of the Scalable High Performance Computing Conference 1994 pg: 176-183, May 1994 Bongki Moon, Joel Saltz. In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented. Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics Published in: Proceedings of the Scalable Parallel Libraries Conference 1993 pg: 45-56 October 1993 Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman. This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM). {PARTI} Primitives for Unstructured and Block Structured Problems Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86 A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods. Slicing Analysis and Indirect Access to Distributed Arrays Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing pg: 152-168 August, 1993 Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42 Raja Das, Joel Saltz, Reinhard v. Hanxleden. An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection. This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress. Compiler Analysis for Irregular Problems in {Fortran-D} Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing August, 1992 R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz. We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors. In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing. Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs} University of Maryland: Department of Computer Science Technical Report, CS-TR-3266 March, 1994 Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz. Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors. We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems. Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication. The Dybbuk Runtime System Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993 R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations. Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications Published in: Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580 Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90. Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse Published in Proceedings Supercomputing '93 pg: 361-370, November 1993 Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32 Ravi Ponnusamy, Joel Saltz, Alok Choudhary. In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis October 1993 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic. Compiler and Runtime Support for Structured and Block Structured Applications Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587 November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually. Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu

Published in: IEEE Computational Science and Engineering
pg: 18-29, vol. 2, no. 2, July 1995
Also published as: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125 Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek. CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors. In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented. The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.
Support for Distributed Dynamic Data Structures in C++ University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19 January 1995 Chialin Chang, Alan Sussman, Joel Saltz. Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures. A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library. Runtime Support and Compilation Methods for User-Specified Data Distributions Published in: IEEE Transactions on Parallel and Distributed Memory Systems vol 6, no. 8, pg: 815-831 Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135 Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox. This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing pg:187-192, March 1993 Raja Das, Joel Saltz. This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication. Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures Published in: Proceedings of the Scalable High Performance Computing Conference 1994 pg: 176-183, May 1994 Bongki Moon, Joel Saltz. In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented. Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics Published in: Proceedings of the Scalable Parallel Libraries Conference 1993 pg: 45-56 October 1993 Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman. This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM). {PARTI} Primitives for Unstructured and Block Structured Problems Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86 A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods. Slicing Analysis and Indirect Access to Distributed Arrays Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing pg: 152-168 August, 1993 Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42 Raja Das, Joel Saltz, Reinhard v. Hanxleden. An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection. This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress. Compiler Analysis for Irregular Problems in {Fortran-D} Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing August, 1992 R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz. We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors. In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing. Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs} University of Maryland: Department of Computer Science Technical Report, CS-TR-3266 March, 1994 Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz. Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors. We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems. Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication. The Dybbuk Runtime System Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993 R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations. Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications Published in: Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580 Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90. Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse Published in Proceedings Supercomputing '93 pg: 361-370, November 1993 Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32 Ravi Ponnusamy, Joel Saltz, Alok Choudhary. In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis October 1993 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic. Compiler and Runtime Support for Structured and Block Structured Applications Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587 November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually. Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19 January 1995 Chialin Chang, Alan Sussman, Joel Saltz. Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures. A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.
Runtime Support and Compilation Methods for User-Specified Data Distributions Published in: IEEE Transactions on Parallel and Distributed Memory Systems vol 6, no. 8, pg: 815-831 Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135 Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox. This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing pg:187-192, March 1993 Raja Das, Joel Saltz. This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication. Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures Published in: Proceedings of the Scalable High Performance Computing Conference 1994 pg: 176-183, May 1994 Bongki Moon, Joel Saltz. In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented. Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics Published in: Proceedings of the Scalable Parallel Libraries Conference 1993 pg: 45-56 October 1993 Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman. This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM). {PARTI} Primitives for Unstructured and Block Structured Problems Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86 A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods. Slicing Analysis and Indirect Access to Distributed Arrays Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing pg: 152-168 August, 1993 Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42 Raja Das, Joel Saltz, Reinhard v. Hanxleden. An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection. This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress. Compiler Analysis for Irregular Problems in {Fortran-D} Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing August, 1992 R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz. We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors. In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing. Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs} University of Maryland: Department of Computer Science Technical Report, CS-TR-3266 March, 1994 Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz. Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors. We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems. Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication. The Dybbuk Runtime System Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993 R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations. Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications Published in: Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580 Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90. Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse Published in Proceedings Supercomputing '93 pg: 361-370, November 1993 Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32 Ravi Ponnusamy, Joel Saltz, Alok Choudhary. In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis October 1993 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic. Compiler and Runtime Support for Structured and Block Structured Applications Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587 November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually. Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
Published in: IEEE Transactions on Parallel and Distributed Memory Systems
vol 6, no. 8, pg: 815-831
Also published as: University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135 Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox. This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.
Parallelizing Molecular Dynamics Codes using the {Parti} Software Primitives Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing pg:187-192, March 1993 Raja Das, Joel Saltz. This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication. Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures Published in: Proceedings of the Scalable High Performance Computing Conference 1994 pg: 176-183, May 1994 Bongki Moon, Joel Saltz. In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented. Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics Published in: Proceedings of the Scalable Parallel Libraries Conference 1993 pg: 45-56 October 1993 Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman. This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM). {PARTI} Primitives for Unstructured and Block Structured Problems Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86 A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods. Slicing Analysis and Indirect Access to Distributed Arrays Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing pg: 152-168 August, 1993 Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42 Raja Das, Joel Saltz, Reinhard v. Hanxleden. An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection. This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress. Compiler Analysis for Irregular Problems in {Fortran-D} Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing August, 1992 R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz. We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors. In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing. Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs} University of Maryland: Department of Computer Science Technical Report, CS-TR-3266 March, 1994 Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz. Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors. We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems. Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication. The Dybbuk Runtime System Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993 R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations. Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications Published in: Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580 Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90. Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse Published in Proceedings Supercomputing '93 pg: 361-370, November 1993 Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32 Ravi Ponnusamy, Joel Saltz, Alok Choudhary. In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis October 1993 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic. Compiler and Runtime Support for Structured and Block Structured Applications Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587 November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually. Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
Published in: Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing
pg:187-192, March 1993
Raja Das, Joel Saltz. This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.
Adaptive Runtime Support for Direct Simulation Monte Carlo Methods on Distributed Memory Architectures Published in: Proceedings of the Scalable High Performance Computing Conference 1994 pg: 176-183, May 1994 Bongki Moon, Joel Saltz. In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented. Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics Published in: Proceedings of the Scalable Parallel Libraries Conference 1993 pg: 45-56 October 1993 Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman. This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM). {PARTI} Primitives for Unstructured and Block Structured Problems Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86 A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods. Slicing Analysis and Indirect Access to Distributed Arrays Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing pg: 152-168 August, 1993 Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42 Raja Das, Joel Saltz, Reinhard v. Hanxleden. An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection. This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress. Compiler Analysis for Irregular Problems in {Fortran-D} Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing August, 1992 R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz. We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors. In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing. Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs} University of Maryland: Department of Computer Science Technical Report, CS-TR-3266 March, 1994 Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz. Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors. We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems. Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication. The Dybbuk Runtime System Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993 R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations. Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications Published in: Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580 Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90. Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse Published in Proceedings Supercomputing '93 pg: 361-370, November 1993 Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32 Ravi Ponnusamy, Joel Saltz, Alok Choudhary. In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis October 1993 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic. Compiler and Runtime Support for Structured and Block Structured Applications Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587 November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually. Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
Published in: Proceedings of the Scalable High Performance Computing Conference 1994 pg: 176-183, May 1994 Bongki Moon, Joel Saltz. In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.
Applying the {CHAOS/PARTI} Library to Irregular Problems in Computational Chemistry and Computational Aerodynamics Published in: Proceedings of the Scalable Parallel Libraries Conference 1993 pg: 45-56 October 1993 Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman. This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM). {PARTI} Primitives for Unstructured and Block Structured Problems Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86 A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods. Slicing Analysis and Indirect Access to Distributed Arrays Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing pg: 152-168 August, 1993 Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42 Raja Das, Joel Saltz, Reinhard v. Hanxleden. An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection. This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress. Compiler Analysis for Irregular Problems in {Fortran-D} Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing August, 1992 R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz. We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors. In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing. Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs} University of Maryland: Department of Computer Science Technical Report, CS-TR-3266 March, 1994 Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz. Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors. We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems. Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication. The Dybbuk Runtime System Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993 R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations. Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications Published in: Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580 Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90. Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse Published in Proceedings Supercomputing '93 pg: 361-370, November 1993 Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32 Ravi Ponnusamy, Joel Saltz, Alok Choudhary. In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis October 1993 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic. Compiler and Runtime Support for Structured and Block Structured Applications Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587 November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually. Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
Published in: Proceedings of the Scalable Parallel Libraries Conference 1993
pg: 45-56
October 1993
Raja Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman. This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).
{PARTI} Primitives for Unstructured and Block Structured Problems Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86 A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods. Slicing Analysis and Indirect Access to Distributed Arrays Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing pg: 152-168 August, 1993 Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42 Raja Das, Joel Saltz, Reinhard v. Hanxleden. An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection. This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress. Compiler Analysis for Irregular Problems in {Fortran-D} Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing August, 1992 R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz. We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors. In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing. Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs} University of Maryland: Department of Computer Science Technical Report, CS-TR-3266 March, 1994 Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz. Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors. We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems. Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication. The Dybbuk Runtime System Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993 R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations. Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications Published in: Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580 Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90. Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse Published in Proceedings Supercomputing '93 pg: 361-370, November 1993 Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32 Ravi Ponnusamy, Joel Saltz, Alok Choudhary. In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis October 1993 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic. Compiler and Runtime Support for Structured and Block Structured Applications Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587 November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually. Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
Published in: Computing Systems in Engineering vol. 3 num. 1-4 pg: 73-86 A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, K. Crowley This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.
Slicing Analysis and Indirect Access to Distributed Arrays Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing pg: 152-168 August, 1993 Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42 Raja Das, Joel Saltz, Reinhard v. Hanxleden. An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection. This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress. Compiler Analysis for Irregular Problems in {Fortran-D} Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing August, 1992 R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz. We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors. In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing. Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs} University of Maryland: Department of Computer Science Technical Report, CS-TR-3266 March, 1994 Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz. Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors. We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems. Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication. The Dybbuk Runtime System Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993 R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations. Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications Published in: Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580 Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90. Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse Published in Proceedings Supercomputing '93 pg: 361-370, November 1993 Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32 Ravi Ponnusamy, Joel Saltz, Alok Choudhary. In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis October 1993 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic. Compiler and Runtime Support for Structured and Block Structured Applications Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587 November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually. Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
Published in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing
pg: 152-168 August, 1993
Also available: University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42 Raja Das, Joel Saltz, Reinhard v. Hanxleden. An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection. This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress.
Compiler Analysis for Irregular Problems in {Fortran-D} Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing August, 1992 R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz. We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors. In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing. Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs} University of Maryland: Department of Computer Science Technical Report, CS-TR-3266 March, 1994 Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz. Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors. We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems. Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication. The Dybbuk Runtime System Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993 R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations. Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications Published in: Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580 Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90. Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse Published in Proceedings Supercomputing '93 pg: 361-370, November 1993 Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32 Ravi Ponnusamy, Joel Saltz, Alok Choudhary. In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis October 1993 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic. Compiler and Runtime Support for Structured and Block Structured Applications Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587 November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually. Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
Published in: Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing August, 1992 R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz. We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors. In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.
Multiprocessor Runtime Support for Fine-Grained, Irregular {DAGs} University of Maryland: Department of Computer Science Technical Report, CS-TR-3266 March, 1994 Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz. Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors. We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems. Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication. The Dybbuk Runtime System Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993 R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations. Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications Published in: Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580 Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90. Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse Published in Proceedings Supercomputing '93 pg: 361-370, November 1993 Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32 Ravi Ponnusamy, Joel Saltz, Alok Choudhary. In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis October 1993 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic. Compiler and Runtime Support for Structured and Block Structured Applications Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587 November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually. Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
University of Maryland: Department of Computer Science Technical Report, CS-TR-3266 March, 1994 Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz. Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors. We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems. Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.
The Dybbuk Runtime System Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993 R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations. Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications Published in: Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580 Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90. Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse Published in Proceedings Supercomputing '93 pg: 361-370, November 1993 Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32 Ravi Ponnusamy, Joel Saltz, Alok Choudhary. In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis October 1993 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic. Compiler and Runtime Support for Structured and Block Structured Applications Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587 November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually. Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
Published in: Proceedings of the IEEE Compcon 1993 pg: 361-370, February 1993
R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.
Runtime Support and Dynamic Load Balancing Strategies for Structured Adaptive Applications
Published in:
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing pg: 575-580

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath,Joel Saltz One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90.
Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse Published in Proceedings Supercomputing '93 pg: 361-370, November 1993 Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32 Ravi Ponnusamy, Joel Saltz, Alok Choudhary. In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation. On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis October 1993 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic. Compiler and Runtime Support for Structured and Block Structured Applications Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587 November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually. Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
Published in Proceedings Supercomputing '93 pg: 361-370, November 1993 Also available: University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32 Ravi Ponnusamy, Joel Saltz, Alok Choudhary. In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.
On Efficient Runtime Support for Multiblock and Multigrid Applications: Regular Section Analysis October 1993 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic. Compiler and Runtime Support for Structured and Block Structured Applications Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587 November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually. Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
October 1993 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.
Compiler and Runtime Support for Structured and Block Structured Applications Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587 November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually. Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
Gagan Agrawal, Alan Sussman, Joel Saltz. Published in Proceedings of Supercomputing '93 pg. 578-587
November 1993 Also available as University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.
Parallel Monte Carlo Simulation of Three-Dimensional Flow over a Flat Plate R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, J. Saltz July 1995 Published in: AIAA Journal of Thermophysics and Heat Transfer Vol. 9, no. 3, pg: 471-477 University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics. An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library. A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
Published in: IEEE Parallel and Distributed Systems July 1995, vol. 6, number 7, pg: 747 Also available: University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 Gagan Agrawal, Alan Sussman, Joel Saltz. Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.
A Manual for the Multiblock {PARTI} Runtime Primitives, Revision 4.1 University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations. The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993 Alan Sussman, Gagan Agrawal, Joel Saltz. There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D. Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.
The Design and Implementation of a Parallel Unstructured {E}uler Solver Published in AIAA Journal vol. 32, no. 3, pg: 489-496 March 1994 Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer. Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
Published in AIAA Journal vol. 32, no. 3, pg: 489-496
March 1994
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy. This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.
Parallel {DSMC} Solution of Three-Dimensional Flow over a Finite Flat Plane Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics. Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
Published in: Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference
June 1994 R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz. This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.
Run-time and Compile-time Support for Adaptive Irregular Problems Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994 Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu
Published in Supercomputing 1994 published by IEEE Press, pg: 97-106 November 1994
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz. In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions. [Applications | High Performance I/O | Compilers | Tools] Questions about the system or webserver: webmaster@cs.umd.edu Problems with publications homepage: wes@cs.umd.edu