Published as: University of Maryland Technical Report CS-TR-3510 and UMIACS-TR-95-83
For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programming in such an adaptive environment. Excecuting programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers of HPF-like languages to generate code for an adaptive environment. We present performace results for Navier-Stokes solver on a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that ifthe number of processors is not varies frequently, the cost of data distribution is not significant when compared to the time required for the actual computation. Overall, our work establishes the feasability of compiling HPF for a network on non-dedicated workstations, which are likely to be an important resource for parallel programming in the future.
Published in: Conference on Programming Languange Design and Implementation '95
Gagan Agrawal, Joel Saltz, Raja Das
Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper, we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.
Published in: SuperComputing '95
Department of Computer Science Technical Reports CR-TR- 3447
Gagan Agrawal, Joel Saltz
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implemenation system as the nessecary infrastracture. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.
Gagan Agrawal, Joel Saltz
Gagan Agrawal, Joel Saltz
Published in:IEEE Transactions on Computers vol. 44, no.6, 1995
J. Wu, Raja Das, Joel Saltz, H. Berryman, S. Hiranandani
This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accessed that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used effectively by a compiler to generate efficient codes in these situations. The compiler accepts as input a Fortran 77 program enhanced with specifications for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. A variety of performance results on the Intel iPSC/860 are presented.
Published in:
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors. In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for trac king and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situatio ns which allow runtime preprocessing overheads to be amortized. This dataflow aanalysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime pre-processing.
Published in: Proceedings of the 6th Workshop on Languages and Compilers for Parallelism Workshop, August 1993, pg: 152-168.
An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A limitation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. How ever, many codes using sparse data structures across their data through multiple levels of indirection. This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. a central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an impleme ntation in Fortran D prototype compiler is in progress.
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.
Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple fuctions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer toconveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an oppurtunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns. This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.
Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.
Questions about the system or webserver:
webmaster@cs.umd.edu
Problems with publications homepage:
wes@cs.umd.edu