Robert Bennet, Kelvin Bryant, Alan Sussman, Raja Das, Joel Saltz.
There has been a great deal of recent interest in paralell I/O. This paper discussed issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architechtures that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations , called Jovian, on three application templates are also presented.
Rahul Parulekar, Larry Davis, Rama Chellappa, J. Saltz, Alan Sussman, John Townshend.
We present the overall goals of our research program on the aplication of high performance computiong to remote sensing applications, specifically applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architechtures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.
Chialin Chang, Alan Sussman, Joel Saltz.
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indicies cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.
Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, Geoffrey Fox.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.
R. Das, J. Saltz.
No Abstract Available...
Bongki Moon, Joel Saltz.
In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.
R. Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).
Raja Das, Mustafa Uysal, Joel Saltz, Yuan-Shin Hwang.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload genera tor, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARM M).
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy.
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by paralle l computers makes heavy use of indirection arrays for indexing data arrays. Suc h irregular access patterns make it difficult for a compiler to generat e efficient parallel code. A limitation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. How ever, many codes using sparse data structures across their data through mult iple levels of indirection.
This paper presents a method for transforming progr ams using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can p arallelize efficiently. a central concept of our algorithm is to perform p rogram slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph elim inates redundant preprocessing and gives an ordering in which to compute the sli ces. We present our work in the context of High Performance Fortran, an impleme ntation in Fortran D prototype compiler is in progress.
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for trac king and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situatio ns which allow runtime preprocessing overheads to be amortized. This dataflow aanalysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.
Ravi Ponnusamy, Yuan-Shin Hwang, Joel Saltz, Alok Choudhary, Geoffrey Fox.
We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, Joel Saltz.
mine multiprocessor runtime support for fine-grained, ir regular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance onextremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent net work polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.
R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis, Alok Choudhary.
No abstract available...
Ravi Ponnusamy, Joel Saltz, Alok Choudhary.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular ssection move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.
Gagan Agrawal, Alan Sussman, Joel Saltz.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.
Alan Sussman, Gagan Agrawal, Joel Saltz.
There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.
R. Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
No abstract available...
R. Nance, R. Wilmoth, B. Moon, H. Hassan, J. Saltz.
No Abstract Available...
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.