These publications may be viewed using either
Postscript Files or PDF Files
Click here to
remove the abstracts
Chialin Chang, Anurag Acharya, Alan Sussman,
and Joel Saltz
Proceedings of the ACM
SIGMOD Record, pages 58-66, Volume
27, Number 1, March 1998
University of Maryland Technical Report
CS-TR-3867 and UMIACS-TR-98-04
As computational power and storage capacity
increase, processing and analyzing large volumes
of multi-dimensional datasets play an
increasingly important part in many domains of
scientific research. Several database research
groups and vendors have developed
object-relational database systems to provide
some support for managing and/or visualizing
multi-dimensional datasets. These systems,
however, provide little or no support for
analyzing or processing these datasets -- the
assumption is that this is too
application-specific to warrant common support.
As a result, applications that process these
datasets are usually decoupled from data storage
and management, resulting in inefficiency due to
copying and loss of locality. Furthermore, every
application developer has to implement complex
support for managing and scheduling the
processing.
Our study of a large set of scientific
applications over the past three years indicates
that the processing for such datasets is often
highly stylized and shares several important
characteristics. Usually, both the input dataset
as well as the result being computed have
underlying multi-dimensional grids. The basic
processing step usually consists of transforming
individual input items, mapping the transformed
items to the output grid and computing output
items by aggregating, in some way, all the
transformed input items mapped to the
corresponding grid point. In this paper, we
present the design of T2, a customizable parallel
database that integrates storage, retrieval and
processing of multi-dimensional datasets. T2
provides support for common operations including
index generation, data retrieval, memory
management, scheduling of processing across a
parallel machine and user interaction. It
achieves its primary advantage from the ability
to seamlessly integrate data retrieval and
processing for a wide variety of applications and
from the ability to maintain and jointly process
multiple datasets with different underlying
grids.
Renato Ferreira, Bongki Moon, Jim Humphries,
Alan Sussman, Joel Saltz, Robert Miller, and
Angelo Demarzo
Proceedings of the 1997 AMIA Annual Fall
Symposium, Nashville, Tennessee,
October 1997
University of Maryland Technical Report:
CS-TR-3777 and UMIACS-TR-97-35
We present the design of the Virtual
Microscope, a soft-ware system employing a
client/server architecture to provide a realistic
emulation of a high power light microscope.
We discuss several technical challenges related
to providing the performance necessary to achieve
rapid response time, mainly in dealing with the
enormous amounts of data (tens to hundreds of
gigabytes per slide) that must be retrieved from
secondary storage and processed. To
effectively implement the data server, the system
design relies on the computational power and high
I/O through-put available from an appropriately
configured parallel computer.
Yuan-Shin Hwang and Joel Saltz
Proceedings of the 10th International
Workshop on Languages and Compilers for Parallel
Computing, August 1997
Pointer analysis is essential for optimizing
and parallelizing compilers. It examines
pointer assignment statements and estimates
pointer-induced aliases among pointer variables
or possible shapes of dynamic recursive data
structures. However, previously proposed
techniques perform pointer analysis without the
knowledge of traversal patterns of dynamic
recursive data structures to be
constructed. This paper presents an
algorithm to identify the traversal patterns of
recursive data structures and propagate this
information back to those statements that define
the data structures. This approach
recognizes the DEF/USE relationships between the
statements that define and traverse dynamic
recursive data structures. The outcome of
this technique will be useful for pointer
analysis and parallelization. Algorithms to
perform pointer analysis and dependence test
using the knowledge of traversal patterns will
also be presented.
M. Uysal, A. Acharya and J. Saltz
University of Maryland Technical
Report: CS-TR-3802 and
UMIACS-TR-97-49
May 1997
I/O intensive parallel programs have emerged
as one of the leading consumers of cycles on
parallel machines. This change has been
driven by two trends. First, parallel
scientific applications are being used to process
larger datasets that do not fit in memory.
Second, a large number of parallel machines are
being used for non-scientific applications.
Efficient execution of these applications
requires high-performance I/O systems which have
been designed to meet their I/O
requirements. In this paper, we examine the
I/O requirements for data-intensive parallel
applications and the implications of these
requirements for the design of I/O systems for
parallel machines. We attempt to answer the
following questions. First, what is the
steady-state as well peak I/O rate
required? Second, what spatial patterns, if
any, occur in the sequence of I/O requests for
individual applications? Third, what is the
degree of intra-processor and inter-processor
locality in I/O accesses? Fourth, does the
application structure allow programmers to
disclose future I/O requests to the I/O
system? Fifth, what patterns, if any, exist
in the sequence of inter-arrival times of I/O
requests? To address these questions, we
have analyzed I/O request traces for a diverse
set of I/O-intensive parallel applications.
This set includes seven scientific applications
and four non-scientific applications.
M. Uysal, A. Acharya, R. Bennett, and J.
Saltz
International Parellel Processing Symposium
'97, April 1997
University of Maryland Technical Report
CS-TR-3690 and UMIACS-TR-96-68 (extended
version)
We present a customizable simulator called netsim
for high-performance point-to-point workstation
networks that is accurate enough to be used for
application-level performance analysis yet is
easy enough to customize for multiple
architectures and software configurations.
Customization is accomplished without using any
proprietary information, using only publicly
available hardware specifications and information
that can be readily determined using a suite of
test programs. We customized netsim
for two platforms: a 16-node IBM SP-2 and
less than 10% error on the Alpha Farm for most
test cases. It achieves this accuracy at
the cost of a 7-36 fold simulation slowdown with
respect to the SP-2 and a 3-8 fold slowdown with
respect to the Alpha Farm.
Chialin Chang, Bongki Moon, Anurag Acharya,
Carter Shock, Alan Sussman, and Joel Saltz
Proceedings of the 13th International
Conference on Data Engineering, Birmingham,
United Kingdom, April 1997
University of Maryland Technical Report
CS-TR-3689 and UMIACS-TR-96-67
1997
International Conference on Data Engineering
- There are two major challenges for a
high-performance remote-sensing database.
First, it must provide low-latency
retrieval of very large volumes of
spatio-temporal data. Second, the order
of magnitude reduction in data-size due
to post-processing must make it
imperative, from a performance
perspective, that the postprocessing be
done on the machine that holds the data.
This paper describes the design,
implementation and evaluations of Titan,
a parallel shared-nothing database
designed for remote sensing data. The
computational platform for Titan is
16-processor IBM SP-2 with four fast
disks attached to each processor. Titan
is currently operational and contains
about 24 GB of data from the Advanced
Very High Resolution Radiometer, (AVHRR)
on the NOAA-7 satellite. The experiemtnal
results show that Titan provides good
performance for global queries, and the
interactive response times for local
queries.
Bongki Moon and Joel Saltz
To appear in: IEEE Transactions on
Knowledge and Data Engineering
1997
- Efficient storage and retrieval of
multi-attribute datasets have become one
of the essential requirements for many
data-intensive applications. The
Cartesian product file has been known as
an effective multi-attribute file
structure for partial-match and
best-match queries. Several
heuristic methods have been developed to
decluster Cartesian product files across
multiple disks to obtain high performance
for disk accesses. Though the
scalability of the declustering methods
becomes increasingly important for
systems equipped with a large number of
disks, no analytic studies have been done
so far. In this paper we derive
formulas describing the scalability of
two popular declustering methods Disk
Modulo and Fieldwise Xor for range
queries, which are the most common type
of queries. These formulas disclose
the limited scalability of the
declustering methods and are corroborated
by extensive simulation
experiments. From the practical
point of view, the formulas given in this
paper provide a simple measure which can
be used to predict the response time of a
given range query and to guide the
selection of a declustering method under
various conditions.
Guy Edjlali, Alan Sussman, and Joel
Saltz
International Parallel Processing Symposium
1997, April 1997
- This paper describes a framework for
providing the ability to use multiple
specialized data parallel libraries
and/or languages within a single
applications. The ability to use multiple
libraries is required in many application
areas, such as multidisciplinary complex
physical simulations and remote sensing
image database applications. An
application can consist of one program or
multiple programming that use different
libraries- to parallelize operations on
distributed data structures. The
framework is embodied in a runtime
library called Meta--Chaos that has been
used to exchange data between data
parallel programs written using High
Performance Fortran, the Chaos and
Multiblock Parti libraries developed at
Maryland for handling various types of
unstructured problems, and the runtime
library for pc++, a data parallel version
of C++ from Indiana University.
Experimental results show that Meta-Chaos
is able to move data between libraries
efficiently, and that Meta-Chaos provides
effective support, for complex
applications.
Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim
Humphries, and Joel Saltz
Scientific Programming, January
1997
University of Maryland Technical Report
CS-TR-3510 and UMIACS-TR-95-83
- For better utilization of computing
resources, it is important to consider
parallel programming environments in
which the number of available processors
varies at runtime. In this paper, we
discuss runtime support for data parallel
programming in such an adaptive
environment. Executing programs in an
adaptive environment requires
redistributing data when the number of
processors changes, and also requires
determining new loop bounds and
communication patterns for the new set of
processors. We have developed a runtime
library to provide this support. We
discuss how the runtime library can be
used by compilers of HPF-like languages
to generate code for an adaptive
environment. We present performance
results for Navier-Stokes solver on a
multigrid template run on a network of
workstations and an IBM SP-2. Our
experiments show that if the number of
processors is not varies frequently, the
cost of data distribution is not
significant when compared to the time
required for the actual computation.
Overall, our work establishes the
feasibility of compiling HPF for a
network on non-dedicated workstations,
which are likely to be an important
resource for parallel programming in the
future.
M. Ranganathan, Anurag Acharya, Shamik Sharma,
and Joel Saltz
Proceedings of the Usenix
1997 Technical Conference Anaheim,
Californa, June 1996
University of Maryland Technical Report:
CS-TR-3659 and UMIACS-TR-96-46
- In this paper, we investigate network-aware
mobile programs, programs that can use
mobility as a tool to adapt to variations
in network characteristics. We present
infrastructural support for mobility and
network monitoring and show how adaptalk,
a Java-based mobile Internet chat
application can take advantage of this
support to dynamically place the chat
server so as to minimize response time.
Our conclusion was that on-line network
monitoring and adaptive placement of
shared data-structures can significantly
improve performance of distributed
applications on the Internet.
Guy Edjlali, Serge Petition, and Nahid
Emad
Proceedings of the International
Conference on Information Systems, Analysis and
Synthesis, June 1996
- The data-parallel paradigm which is
effective for many numerical algebra
computations is no more sufficient when
programming heterogeneous machines. To
efficiently utilize these new resources
we need to combine the two new paradigms:
data-parallelism and task parallelism.
This leads us to define new algorithims.
- In this paper, throughout all examples we
will illustrate this statement. The
chosen example is the computation of some
eigenvalues of large non-structured
sparse non-Hermitian matrices. We present
a new algorithm, Interleaved Arnoldi, and
compare its performances to the parallel
Arnoldi algorithm. The different
environments we consider are: a CM5, a
CM5 and a Sparc 20, a CM5 and 2 Sparc 20.
We show that significant results can be
obtained even whell the het- erogeneous
machine consists of a CM5 and 2 Sparc 20
workstations.
Guy Edjlali, Alan Sussman, and Joel
Saltz
University of Maryland Technical Report:
CS-TR-3633 and UMIACS-TR-96-30
May 1996
- This paper describes a framework for
providing the ability to use multiple
specialized data parallel libraries
and/or languages within a single
applications. The ability to use multiple
libraries is required in many application
areas, such as multidisciplinary complex
physical simulations and remote sensing
image database applications. An
application can consist of one program or
multiple programming that use different
libraries- to parallelize operations on
distributed data structures. The
framework is embodied in a runtime
library called Meta--Chaos that has been
used to exchange data between data
parallel programs written using High
Performance Fortran, the Chaos and
Multiblock Parti libraries developed at
Maryland for handling various types of
unstructured problems, and the runtime
library for pc++, a data parallel version
of C++ from Indiana University.
Experimental results show that Meta-Chaos
is able to move data between libraries
efficiently, and that Meta-Chaos provides
effective support, for complex
applications.
- M. Ujaldon Martinez, S. Sharma, J. Saltz,
and E. Zapata
-
Proceedings of the ACM International
Conference of Supercomputing, pages
78-85, Philadelphia, Pennsylvania, May 25-27,
1996
- Sparse matrix problems are difficult to
parallelize efficiently on distributed
memory machines since non-zero elements
are unevenly scattered and and are
accessed via multiple levels of
indirection. Irregular distributions that
acheive good load balance and locality
are hard to compute, have high memory
overheads and also lead to further
indirection in locating distributed data.
This paper evaluates alternative
semi-regular distribution strategies
which rade off the quality of the load
balance and locality for power
distribution overheads and efficient
lookup. The proposed techniques are
compared to an irregular sparse matrix
partitioner and the relative merits of
each distribution method are outlined.
Gagan Agrawal, Anurag Acharya, and Joel
Saltz
Proceedings of the ACM International
Conference of Supercomputing, pages 358-365,
Philadelphia, Pennsylvania, May 25-27, 1996
- Overlapping memory accesses with
computations is a standard technique for
improving performance of modern
architectures, which have deep memory
hierarchies. In this paper, we present a
compiler technique for overlapping
accesses to secondary memory (disks) with
computation. We have developed an
Interprocedural Balanced Code Placement
(IBCP) framework, which performs analysis
on arbitrary recursive procedures and
arbitrary control flow and replaces
synchronous I/O operations with a
balanced pair of asynchronous operations.
We demonstrate how this analysis is
useful for the applications which perform
frequent and large accesses to secondary
memory, analysis is useful for the
applications which perform frequent and
large accesses to secondary memory,
including the applications which snapshot
or checkpoint their computations or the
out-of-core applications.
Anurag Acharya, Mustafa Uysal, Robert Bennett,
Assaf Mendelson, Michael Beynon, Jeff
Hollingsworth, Joel Saltz, and Alan Sussman
Fourth Annual Workshop on I/O in Parallel
and Distributed Systems, Philadelphia,
Pennsylvania, May 27, 1996
- Getting good I/O performance from
parallel programs is a critical problem
for many application domains. In this
paper, we report our experience tuning
the I/O performance of four application
programs from the areas of sensor data
processing and linear algebra. After
tuning, three of the four applications
achieve effective I/O rates of over
100MB/s, on 16 processors. The total
volume of I/O required by the programs
ranged from about 75MB to over 200GB. We
report the lessons learned in achieving
high I/O performance from these
applications, including the need for code
restructuring, local disks on every node
and overlapping I/O with computation. We
also report our experience on achieving
high performance on peer-to-peer
configurations. Finally, we comment on
the necessity of complex I/O interfaces
like collective I/O and strided requests
to achieve high performance.
Anurag Acharya
Proceedings of the ACM International
Conference of Supercomputing, pages 325-332,
Philadelphia, Pennsylvania, May 1996
- A rule-based program consists of a set of
if-then rules and a tuple-space. The
rules are the code for the program and
the tuple-space contains the data being
processed by the program. Previous
efforts to parallelize rule-based
programs have achieved limited speedups.
The main reason for these disappointing
results is a high frequency of barrier
synchronizations. Since little work is
done between successive barrier
synchronizations, the number of
processors that can be effectively
utilized is bounded. Even though required
by language semantics, a large fraction
of the barrier synchronizations are not
necessary for most programs. This paper
proposes a pair of simple language
extensions that allow an implementation
to efficiently detect and eliminate
redundant barrier synchronizations.
Simulation results based on a real
implementation show that for a set of
five benchmarks, this scheme is able to
eliminate between 95.6% and 99.9% of the
barrier synchronizations. This results in
a multiplicative speedup of between 2.2
and 52.3 fold over and above the speedup
achieved by a parallelizing compiler. For
the programs studied, simulations
indicate that speedups up to 115 fold
relative to an optimized sequential
version are possible.
Mudumbai Ranganathan, Anurag Acharya, Guy
Edjlali, Alan Sussman, and Joel Saltz
Proceedings of the ACM International
Conference of Supercomputing, pages 229-236,
Philadelphia, Pennsylvania, May 25-27, 1996
University of Maryland: Department of Computer
Science and UMIACS Technical Reports CS-TR-3565
and UMIACS TR-95-116
May 1996
- We consider the problem of efficiently
computing multiple data-parallel programs
at runtime. We propose an approach that
establishes a mapping between data
structures in different data-parallel
programs and implements a user-specified
consistency model. Mappings are
established at runtime and can be added
and deleted while the programs being
coupled are in execution. Mappings, or
the identity of the processors involved,
do not have to be known at compile-time
or even link-time. Programs can be made
to interact with different granularities
of interaction without requiring any
re-coding. A-priori knowledge of
consistency requirements aflows buffering
of data as well as concurrent execution
of the coupled applications. Efficient
data movement is achieved by
pre-computing an optimized schedule. We
describe our prototype implementation and
evaluate its performance using a set of
synthetic benchmarks. We examine the
variation of performance with variation
in the consistency requirement. We
demonstrate that the cost of the
flexibility provided by our coupling
scheme is not prohibitive when compared
with a monolithic program that performs
the same computation.
Bongki Moon, Anurag Acharya, and Joel
Saltz
Proceedings of the 1996 Parallel Processing
Symposium, April 1996
University
of Maryland Technical Report: CS-TR-3589 and
UMIACS-TR-96-4 Lengthened version
- Efficient storage and retrieval of large
multidimensional datasets has been one of
the key issues in large-scale scientific
computations such as long running
time-dependent simulations periodically
generating snapshots of the state. The
main challenge for efficiently handling
such datasets is to minimize response
time for multidimensional range queries.
The grid file is one of the well known
access methods for multidimensional and
spatial data. We investigate effective
and scalable declustering techniques for
grid files with the primary goal of
minimizing response time and the
secondary goal of maximizing the fairness
of data distribution. The main
contributions of this paper is (1) the
analytic and experimental evaluation of
existing index-based declustering
techniques and their extensions for grid
files, and (2) the development of an
improved proximity-based declustering
algorithm called minimax which has been
experimentally shown to scale well and
consistently achieve better response time
compared to all the other algorithms
while maintaining perfect disk
distribution.
Bongki Moon and Joel H. Saltz
IEEE Transactions on Software Engineering,
April 1996
University of Maryland Technical Report:
CS-TR-3590 and UMIACS-TR-96-5
- Efficient storage and retrieval of
multi-attribute datasets has become one
of the essential requirements for many
data-intensive applications. The
Cartesian product file has been known as
an effective multi-attribute file
structure for partial-match and
best-match queries. Several heuristic
methods have been developed to decluster
Cartesian product files over multiple
disks to obtain high performance for disk
accesses. Though the scalability of the
declustering methods becomes increasingly
important for systems equipped with a
large number of disks, no analytic
studies have been done so far In this
paper we derive formulas describing the,
scalability of two popular declustering
methods Disk Modulo and Fieldwise Xor for
range queries, which are the most common
type of queries. These formulas disclose
the limited scalability of the
declustering methods and are corroborated
by extensive simulation experiments. From
the practical point of view, the formulas
given in this paper provide a simple
measure which can be used to predict the
response time of a given range query and
to guide the selection of a declustering
method under various conditions.
Bongki Moon, H. V. Jagadish, Christos
Faloutsos, and Joel H. Saltz
IEEE Transactions on Knowledge and
Data Engineering, March 1996
University of Maryland Department of Computer
Science Technical Report, CS-TR-3611 and
UMIACS-TR-96-20
- Several schemes for linear mapping of
multidimensional space have been proposed
for many applications such as access
methods for spatio-temporal databases,
image compression and so on. In all these
applications, one of the most desired
properties from such linear mappings is
clustering, which means the locality
between objects in the multidimensional
space is preserved in the linear space.
It is widely believed that Hilbert
space-filling curve achieves the best
clustering. In this paper we provide
closed-fortran formulas of the number of
clusters required by a given query region
of an arbitrary shape (e.g., polygons and
polyhedra) for Hilbert space-filling
curve. Both the asymptotic solution for a
general case and the exact solution for a
special case generalize the previous
work, and they agree with the empirical
results that the number of clusters
depends on the hyper-surface area of the
query region and not on its hyper-volume.
We have also shown that Hilbert curve
achieves better clustering than z-curve.
From the practical point of view, the
formulas given in this paper provide a
simple measure which can be used to
predict the required disk access
behaviors and hence the total access
time.
Index Terms: locality-preserving linear
mapping, range queries, multi-attribute access
methods, data clustering, Hilbert curve,
space-filling curves, fractals.
G. Agrawal, J. Saltz
Proceedings of Supercomputing 95, San
Diego, California, December 3-8, 1995.
University of Maryland: Department of Computer
Science Technical Reports CR-TR- 3447
- Data parallel languages like High
Performance Fortran (HPF) are emerging as
the architecture independent mode of
programming distributed memory parallel
machines. In this paper, we present the
interprocedural optimizations required
for compiling applications that have
irregular data access patterns, when
coded in such data parallel languages. We
have developed an Interprocedural Partial
Redundancy Elimination (PRE) algorithm
for optimized placement of runtime
preprocessing routine and collective
communication routines inserted for
managing communication in such codes. We
also present three new interprocedural
optimizations: placement of scatter
routines, deletion of data structures and
use of coalescing and incremental
routines. We then describe how program
slicing can be used for further applying
IPRE in more complex scenarios. We have
done a preliminary implementation of the
schemes presented here using the Fortran
D implementation system as the necessary
infrastructure. We present experimental
results from two codes compiled using our
system to demonstrate the efficacy of the
presented schemes.
Manuel Ujaldon, Shamik D. Sharma, Joel Saltz,
and Emilio Zapata
Proceedings of the 1995 Workshop on
Irregular Problems, pages 78-85, September
1995
- Sparse matrix problems are difficult to
parallelize efficiently on
message-passing machines, since they
access data through multiple levels of
indirection. Inspector/executor
strategies, which are typically used to
parallelize such problems impose
insignificant preprocessing overheads.
This paper describes the runtime support
required by new compilation techniques
for sparse matrices and evaluates their
performance, highlighting optimizations
and improvements over previous
techniques.
Ravi Ponnusamy, Joel Saltz, Alok Choudhary,
Yuan-Shin Hwang, and Geoffrey Fox
IEEE
Transactions on Parallel and Distributed Memory
Systems, pages 815-831,Volume 6, Number
8, August 1995
University of Maryland: Department of Computer
Science Technical Report CS-TR-3194 and
UMIACS-TR-93-135
- This paper describes two new ideas by
which an HPF compiler can deal with
irregular computations effectively. The
first mechanism invokes a user specified
mapping procedure via a set of compiler
directives. The directives allow use of
program arrays to describe graph
connectivity, spatial location of array
elements and computational load. The
second mechanism is a simple conservative
method that in many cases enables a
compiler to recognize that it is possible
to reuse previously computed information
from inspectors (e.g. communication
schedules, loop iteration partitions,
information that associates off-processor
data copies with on-processor buffer
locations). We present performance
results for these mechanisms from a
Fortran 90D compiler implementation.
R. P. Nance, R. G. Wilmouth, B. Moon, H.
A. Hassan, and J. Saltz
AIAA Journal of Thermophysics and Heat
Transfer, pages 471-477, Volume 9, Number 3,
July-September 1995
University of Maryland Technical Report:
CR-TR-3425 and UMIACS-TR-95-25
- This paper describes a parallel
implementation of the direct simulation
Monte Carlo method. Runtime library
support is usedfor scheduling and
execution of communication between nodes,
and domain decomposition is perfonned
dynamically to maintain a favorable load
balance. Perfonnance tests are conducted
using the code to evaluate various
remapping and remapping- interval
policies, and it is shown that a
one-dimensional chain-partitioning method
works bestfor the problems considered.
The parallel code is then used to
simulate the Math 20 nitrogen flow over a
finite-thickness flat plate. It will be
shown that the parallel algorithm
produces results which are very similar
to previous DSMC results, despite the
increased resolution available. However,
it yields significantlyfaster execution
times than the scalar code, as well as
very good load-balance and scalability
characteristics.
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A.
Rogers, and J. Saltz
Proceedings of the Fifth ACM SIGPLAN
Symposium on Principles & Practice of
Parallel Programming '95, pages
68-79, Santa Barbara, California, July 19-21,
1995
ACM SIGPLAN Notices: Volume 30, Number
8
- Irregular computation problems underlie
many important scientific applications .
Although these problems are
computationally expensive, and so it
would seem appropriate for parallel
machines, their irregular and
unpredictable run-time behavior makes
this type of parallel program difficult
to write and adversely affects run-time
performance.
- This paper explores three
issues--partitioning,mutual exclusion,
and data transfer, --crucial to the
efficient execution of irregular problems
on distributed-memory machines. Unlike
previous work, we studied the same
programs running in three software
alternative systems running on the same
hardware base (A Thinking Machines CM-5):
the CHAOS irregular application library,
transparent shared memory (TSM), and the
eXtensible shared memory (XSM). CHAOS and
XSM performed equivalently for all three
applications. Both systems are somewhat
(13%) to significantly faster (991%) than
TSM.
Gagan Agrawal, Alan Sussman, and Joel
Saltz.
IEEE Parallel and Distributed Systems,
pages 747-754, Volume. 6, Number 7, July
1995
University of Maryland: Department of Computer
Science and UMIACS Technical Reports CS-TR-3143,
UMIACS-TR-93-94
- Scientific and engineering applications
often involve structured meshes. These
meshes may be nested (for multigrid
codes) and/or irregularly coupled (called
multiblock or irregularly coupled regular
mesh problems). In this paper, we present
a combined runtime and compile-time
approach for parallelizing these
applications on distributed memory
parallel machines in an efficient and
machine-independent fashion. We have
designed and implemented a runtime
library which can be used to port these
applications on distributed memory
machines. The library is currently
implemented on several different systems.
Since the design of the library is
machine independent, it can be easily
ported on other distributed memory
machines and environments which support
message passing. To further ease the task
of application programmers, we have
developed methods for integrating this
runtime library with compilers for
HPF-like parallel programming languages.
We discuss how we have integrated this
runtime library with the Fortran 90D
compiler being developed at Syracuse
University. We present experimental
results to demonstrate the efficacy of
our approach. We have experimented with a
multiblock Navier-Stokes solver template
and a multigrid code. Our experimental
results show that our primitives have low
runtime communication overheads. Further,
the compiler parallelized codes perform
within 20% of the code parallelized by
manually inserting calls to the runtime
library.
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard
Brooks, and Milan Hodoscek
IEEE Computational Science and Engineering,
pages 18-29, Volume 2, Number 2, Summer
1995
University of Maryland: Department of Computer
Science and UMIACS Technical Reports CS-TR-3374,
UMIACS-TR-94-125
- CHARMM (Chemistry at Harvard
Macromolecular Mechanics) is a program
that is widely used to model and simulate
macromolecular systems. CHARMM has been
parallelized by using the CHAOS runtime
support library on distributed memory
architectures. This implementation
distributes both data and computations
over processors. This data-parallel
strategy should make it possible to
simulate very large molecules on large
numbers of processors.
- In order to minimize communication among
processors and to balance computational
load, a variety of partitioning
approaches are employed to distribute the
atoms and computations over processors.
In this implementation, atoms are
partitioned based on geometrical
positions and computational load by using
unweighted or weighted recursive
coordinate bisection. The experimental
results reveal that taking computational
load into account is essential. The
performance of two iteration partitioning
algorithms, atom decompositions and force
decomposition, is also compared. A new
irregular force decompositional algorithm
is introduced and implemented.
- The CHAOS library is designed to
facilitate parallelization of irregular
applications. This library (1) couples
partitioners to the application programs,
(2) remaps data and partitions work among
processors, and (3) optimizes
interprocessor communications. This paper
presents and application of CHAOS that
can be used to support efficient
execution of irregular problems on
distributed memory machines.
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy,
R. Das, and J. Saltz
Software Practice & Experience,
pages 597-621, Volume 25, Number 6, June
1995
In many scientific applications, arrays
containing data are indirectly indexed through
indirection arrays. Such scientific applications
are called irregular programs and are a distinct
class of applications that require special
techniques for parallelization.
- This paper presents a library called
CHAOS, which helps users implement
irregular programs on distributed
memory-passing machines, such as Paragon,
Delta, CM-5, and SP-1. The CHAOS library
provides efficient runtime primitives for
distributing data and computation over
processors; it supports efficient
translation mechanisms and provides users
high-level mechanisms for optimizing
communication. CHAOS subsume the previous
PARTI library and supports a larger class
of applications. In particular, it
provides efficient support for
parallelization of adaptive irregular
programs where indirectional arrays are
modified during the course of
computation. To demonstrate the efficacy
of CHAOS, two challenging real-life
applications were parallelized using
CHAOS primitives: a molecular dynamics
code, CHARMM, and a particle-in-cell
code, DSMC.
- Besides providing runtime support to
users, CHAOS can also be used by
compilers to automatically parallelize
irregular applications. This paper shows
how CHAOS can be used in such a
framework. By embedding CHAOS primitive
in Syracuse Fortran 90D/HPF compiler,
kernels taken from CHARMM and DSMC codes
have been parallelized automatically.
-
G. Agrawal, J. Saltz, and R. Das
Proceedings of the SIGPLAN '95 Conference
on Programming Language Design and
Implementation, pages 258-269, La Jolla,
California, June 18-21, 1995
University of Maryland: Department of Computer
Science and UMIACS Technical Reports CR-TR-3446
and UMIACS-TR-95-42
- Partial Redundancy Elimination (PRE) is a
general scheme for suppressing partial
redundancies which encompasses
traditional optimizations like loop
invariant code motion and redundant code
elimination. In this paper we address the
problem of performing this optimization
interprocedurally. We use interprocedural
partial redundancy elimination for
placement of communication and
communication preprocessing statements
while compiling for distributed memory
parallel machines.
J. Wu, R. Das, J. Saltz, H. Berryman, and S.
Hiranandani
IEEE Transactions on Computers, pages
737-753, Volume 44, Number 6, June 1995
- This paper addresses the issue of
compiling concurrent loop nests in the
presence of complicated array references
and irregularly distributed arrays.
Arrays accessed within loops may contain
accessed that make it impossible to
precisely determine the reference pattern
at compile time. This paper proposes a
run time support mechanism that is used
effectively by a compiler to generate
efficient codes in these situations. The
compiler accepts as input a Fortran 77
program enhanced with specifications for
distributing data, and outputs a message
passing program that runs on the nodes of
a distributed memory machine. The runtime
support for the compiler consists of a
library of primitives designed to support
irregular patterns of distributed array
accesses and irregularly distributed
array partitions. A variety of
performance results on the Intel iPSC/860
are presented.
G. Edjlali, G. Agrawal, A. Sussman, and J.
Saltz
Proceedings of the Ninth International
Parallel Processing Symposium, pages 827-832,
Santa Barbara, California, April 24-28 1995
University of Maryland Technical Report:
CS-TR-3350 and UMIACS-TR-94-109 Lengthened
version
- For better utilization of computing
resources, it is important to consider
parallel programming environments in
which the number of available processors
varies at runtime. In this paper, we
discuss runtime support for data parallel
programs in such an adaptive environment.
Executing data parallel programs in an
adaptive environment requires determining
new loop bounds and communication
patterns for the new set of processors.
We have developed a runtime library to
provide this support. We discuss how the
runtime library can be used by compilers
to generate code for an adaptive
environment. We also present performance
results for a multiblock Navier-Stokes
solver run on a network of workstations
using PVM for message passing. Our
experiments show that if the number of
processors is not varied frequently, the
cost of data redistribution is not
significant compared to the time required
for the actual computations.
B. Moon, M. Uysal, and J. Saltz
Proceedings of the Ninth International
Parallel Processing Symposium, pages
812-819, Santa Barbara, California, April 24-28,
1995
- Current research in parallel programming
is focused on closing the gap between
globally indexed algorithms and the
separate address spaces of processors on
distributed memory multicomputers. A set
of index translation schemes have been
implemented as a part of CHAOS runtime
support library, so that the library
functions can be used for implementing a
global index space across a collection of
separate local index spaces. These
schemes include two software-cached
translation schemes aimed at adaptive
irregular problems as well as a
distributed translation table technique
for statically irregular problems. To
evaluate and demonstrate the efficiency
of the software-cached translation
schemes, experiments have been performed
with as adaptively irregular loop kernel
and a full-fledged 3D DSMC code from NASA
Langely on the Intel Paragon and Cray
T3D. This paper also discusses and
analyzes the operational conditions under
which each scheme can produce optimal
performance.
C. Chang, A. Sussman, and J. Saltz
University of Maryland: Department of Computer
Science and UMIACS Technical Reports CR-TR-3438
and UMIACS-TR-95-35
March, 1995
- Object-oriented applications utilize
language constructs such as pointers to
synthesize dynamic complex data
structures, such as linked lists, trees
and graphs, with elements consisting of
complex data types. Traditionally,
however, applications executed on
distributed memory parallel architectures
in single-program multiple-data (SPMD)
mode use on distributed
(multi-dimensional) data arrays. Good
performance has been achieved by applying
runtime techniques that to such
applications executing in a loosely
synchronous manner. Existing runtime
systems that rely solely on global
indices are not always applicable to
object-oriented applications, since no
global names or indices are imposed upon
dynamic complex data structures linked by
pointers.
-
- We describe a portable object-oriented
runtime library that has been designed to
support applications that use dynamic
distributed data structures, including
both arrays and pointer-based data
structures. In particular, CHAOS++ deals
with complex data types and pointer-based
data structures, by providing two
abstractions, mobile objects and globally
addressable objects. CHAOS++ uses
preprocessing techniques to analyze
communication patterns, and provides data
exchange routines to perform efficient
transfers between processors. Results for
applications taken from three distinct
classes are also presented to demonstrate
the wide applicability and good
performance characteristics of the
runtime library.
-
J. Saltz, R. Ponnusammy, S. Sharma, B. Moon,
Y.-S. Hwang, M. Uysal, and R. Das
University of Maryland: Department of Computer
Science and UMIACS Technical Reports CS-TR-3437
and UMIACS-TR-95-34
March 1995
- Procedures are presented that are
designed to help users efficiently
program irregular problems (e. g.
unstructured mesh sweeps, sparse matrix
codes, adaptive mesh partial differential
equations solvers) on distributed memory
machines. These procedures are also
designed for use in compilers for
distributed memory multiprocessors. The
portable CHAOS procedures are designed to
support dynamic data distributions and to
automatically generate send and receive
message by capturing communications
patterns at runtime.
- Bongki Moon, Gopal Patnaik, Robert
Bennett, David Fyfe, Alan Sussman, Craig
Douglas, K. Kailasanath, and Joel Saltz
-
- Proceedings of the Seventh SIAM
Conference on Parallel Processing for
Scientific Computing, pages
575-580, February 1995.
-
One class of scientific and engineering
applications involves structured meshes. One
example of a code in this class is a flame
modelling code developed at the Naval Research
Laboratory (NRL). The numerical model used in the
NRL flame code is predominantly based on
structured finite volume methods. The chemistry
process of the reactive flow is modeled by a
system of ordinary differential equations which
is solved independantly at each grid point. Thus,
though the model uses a mesh structure, the
workload at each grid point can vary
considerably. It is this feature that requires
the use of both structured and unstructured
methods in the same code. We have applied the
Multiblock PARTI and CHAOS runtime support
libraries to parallelize the NRL flame code with
minimal changes to the sequential code. We have
also developed parallel algorithms to carry out
dynamic load balancing. It has been observed that
the overall performance scales reasonably up to
256 Paragon processors and that the total runtime
on a 256-node PAragon is about half that of a
single processor Cray C90.
Chialin Chang, Alan Sussman, and Joel
Saltz
University of Maryland: Department of Computer
Science Technical Report CS-TR-3416 and
UMIACS-TR-95-19
January 1995
- Traditionally, applications executed on
distributed memory architectures in
single-program multiple-data (SPMD) mode
use distributed (multi-dimensional) data
arrays. Good performance has been
achieved by applying runtime techniques
to such applications executing in a
loosely synchronous manner. However, many
applications utilize language constructs
such as pointers to synthesize dynamic
complex data structures, such as linked
lists, trees and graphs, with elements
consisting of complex composite data
types. Existing runtime systems that rely
on global indices cannot be used for
these applications, as no global names or
indicies are imposed upon the elements of
these data structures.
- A portable object-oriented runtime
library is presented to support
applications that use dynamic distributed
data structures, including both arrays
and pointer-based data structures. In
particular, CHAOS++ deals with complex
data types and pointer-based data
structures by providing mobile objects
and globally addressable objects .
Preprocessing techniques are used to
analyze communication patterns, and data
exchange primitives are provided to carry
out efficient data transfer. Performance
results for applications taken from three
distinct classes are also included to
demonstrate the wide applicability of the
runtime library.
R. Bennett, K. Byrant, A. Sussman, R. Das, J.
Saltz
University of Maryland: Department of Computer
Science and UMIACS Technical Reports CS-TR-3417
and UMIACS-TR-95-20
January 1995
- There has been a great deal of recent
interest in parallel I/O. This paper
discusses issues in the design and
implementation of a portable I/O library
designed to optimize the performance of
multiprocessor architecture that include
multiple disks or disk arrays. The major
emphasis of the paper is on optimizations
that are made possible by the use of
collective I/O, so that the I/O requests
for multiple processors can be combined
to improve performance. Performance
measurements from benchmarking our
implementation of an I/O library that
currently performs collective local
optimizations, called Jovian, on
three application templates are also
presented.
F. Chong, S. Sharma, E. Brewer, and J.
Saltz
International Conference on Supercomputing
1995, December 1994
- This paper examines parallelization of
computations that can be characterized as
a fine-grained, irregular, directed
acyclic graph of tasks. Such computations
typically arise in sparse matrix
applications where loops may have loop
carried data dependencies. The fine grain
of the computation and the irregular
nature of the dependencies make such
loops extremely hard to parallelize
efficiently. We use runtime preprocessing
to parallelize such computations. A
preprocessing step analyzes the DAG and
determines an optimized mapping of tasks
to processors, an appropriate order of
executing these tasks on each processor
and a set of synchronization constraints.
This information is used to distribute
data across processors, and to schedule
computation, communication and
synchronization. We use sparse-matrix
triangular solution as our example
application and compare a range of
runtime preprocessed techniques for its
parallelization. Whereas previous studies
have achieved maximum speedups of less
that 4 on even simple banded matrices, we
achieve a speedup of almost 35 on 128
processors of the CM-5 on an irregular
matrix with 5300 rows. We find that and
receive overheads dominate the time of
out application. Efficient
parallelization of such loops requires
even more support for low latency
communication.
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon,
Yuan-Shin Hwang, Raja Das, and Joel Saltz
Supercomputing 1994, pages: 97-106,
Washington, District of Columbia, November 14-18,
1994
- In adaptive irregular problems the data
arrays are accessed via indirection
arrays, and data access patterns change
during computation. Implementing such
problems on distributed memory machines
requires support for dynamic data
partitioning, efficient preprocessing and
fast data migration. This research
presents efficient runtime primitives for
such problems. This new set of primitives
is part of the CHAOS library. It subsumes
the previous PARTI library which targeted
only static irregular problems. To
demonstrate the efficacy of the runtime
support, two real adaptive irregular
applications have been parallelized using
CHAOS primitives: a molecular dynamics
code (CHARMM) and a particle-in-cell code
(DSMC). The paper also proposes
extensions to Fortran D which can allow
compilers to generate more efficient code
for adaptive problems. These language
extensions have been implemented in the
Syracuse Fortran 90D/HPF prototype
compiler. The performance of the compiler
parallelized codes is compared with the
hand parallelized versions.
R. Bennett, K. Bryant, A. Sussman, and R.
Das
Proceedings of the 1994 Scalable Parallel
Libraries Conference, pages 10-20,
Mississippi State, Mississippi, October 12-14,
1994
- There has been a great deal of recent
interest in parallel I/O. In this paper,
we discuss the design and implementation
of the Jovian library, which is intended
to optimize the I/O performance of
multiprocessor architectures that include
multiple disks or disk arrays. We also
present preliminary performance
measurements from benchmarking the Jovian
I/O library on the IBM SP1 distributed
memory parallel machine for two
application templates.
-
Rahul Parulekar, Larry Davis, Rama Chellappa,
Joel Saltz, Alan Sussman, and John
Townshend
Proceedings of the International Joint
Conference on Pattern Recognition, IEEE
Computer Society Press, pages 234-238, Jerusalem,
Israel, October 9-13, 1994
- We present the overall goals of our
research program on the application of
high performance computing to remote
sensing applications, specifically,
applications in land cover dynamics. This
involves developing scalable and portable
programs for a variety of image and map
data processing applications, eventually
integrated with new models for parallel
I/O of large scale images and maps. After
an overview of the multiblock PARTI run
time support system, we explain
extensions made to that system to support
image processing applications, and then
present an example involving
multiresolution image processing. Results
of running the parallel code on both a
TMC CM5 and an Intel Paragon are
discussed.
R. Das, M. Uysal, J. Saltz, and Y.-S.
Hwang
Journal of Parallel and Distributed
Computing, pages: 462-479, Volume 22, Number
3, September 1994
University of Maryland Technical Report
CS-TR-3163 and UMIACS-TR-93-109
- This paper describes a number of
optimizations that can be used to support
the efficient execution of irregular
problems on distributed memory parallel
machines. We describe software primitives
that (1) coordinate interprocessor data
movement, (2) manage the storage of, and
access to, copies of off-processor data,
(3) minimize interprocessor communication
requirements and (4) support a shared
name space. We present a detailed
performance and scalability analysis of
the communication primitives. This
performance and scalability analysis is
carried out using a workload generator,
kernels from real applications and a
large unstructured adaptive application
(the molecular dynamics code CHARMM).
R. Nance, R. Wilmoth, B. Moon, H. Hassan, and
J. Saltz.
Proceedings of 6th {AIAA/ASME} Joint
Thermophysics and Heat Transfer Conference,
Colorado Springs, Colorado, June 20-23,
1994
- This paper describes a parallel
implementation of the direct simulation
Monte Carlo (DSMC) method. Runtime
library support is used for scheduling
and execution of communication between
nodes, and domain decomposition is
performed dynamically to maintain a good
load balance. Performance tests are
conducted using the code to evaluate
various remapping and remapping-interval
policies, and it is shown that
one-dimensional chain partitioning method
works best for the dimensional problems
considered. The parallel code is then
used to simulation Mach 20 nitrogen flow
over a finite thickness flat plate. It is
shown that the parallel algorithm
produces results which compare well with
experimental data. Moreover, it yields
significantly faster execution times that
the scalar code, as well as very good
load-balance characteristics.
Ravi Ponnusamy, Yuan-Shin Hwang, Joel Saltz,
Alok Choudhary, and Geoffrey Fox
IEEE Parallel & Distributed Technology,
Spring 1995 (revised version)
University of Maryland: Department of Computer
Science and UMIACS Technical Reports CR-TR-3268
and UMIACS-TR-94-57
May 1994
- We present methods that make it possible
to efficiently support an important
subclass of irregular problems using data
parallel languages. The approach we
describe involves the use of a portable,
compiler-independent, runtime support
library called CHAOS. The CHAOS runtime
support library contains procedures that
support static and dynamic distributed
array partitioning, partition loop
iterations and indirection arrays, remap
arrays from one distribution to another,
and carry out index translation, buffer
allocation and communication schedule
generation.
- The CHAOS runtime procedures are used by
a prototype Fortran 90D compiler as
runtime support for irregular problems.
We present performance results of
compiler-generated and hand-parallelized
versions of two stripped down
applications codes. The first code is
derived from an unstructured mesh
computational fluid dynamics flow solver
and the second is derived from the
molecular dynamics code CHARMM.
- A method is described that makes it
possible to emulate irregular
distributions in HPF by reordering
elements of data arrays and renumbering
indirection arrays. We present results
that suggest that an HPF compiler could
use reordering and renumbering extrinsic
functions to obtain performance
comparable to that achieved by a compiler
for a language (such as Fortran 90D) that
directly supports irregular
distributions.
S. Sharma, R. Ponnusamy, B. Moon, Y. Hwang, R.
Das, and J. Saltz
University of Maryland and UMIACS Technical
Report: CS-TR-3266 and UMIACS-TR-94-55
May 1994
A revised version appears in Proceedings of
Supercomputing '94
- In adaptive irregular problems, data
arrays are accessed via indirection
arrays, and data access patterns change
during computation. parallelizing
such problems on distributed memory
machines requires support for dynamic
data partitioning, efficient
preprocessing and fast data
migration. This paper describes
CHAOS, a library of efficient runtime
primitives that provides such
support. To demonstrate the
effectiveness of the runtime support, two
adaptive irregular applications have been
parallilized using CHAOS
primitives: a molecular dynamics
code (CHARMM) and a code for simulating
gas flows ( DSMC). We have also
proposed minor extensions to Fortran D
which would enable compilers to
parallelize irregular forall loops in
such adaptive applications by embedding
calls to primitives provided by a
runtime library. We have
implemented our proposed extensions in
the Syracuse Fortran 90D/HPF prototype
compiler, and have used the compiler to
parallelize kernels from two adaptive
applications.
Bongki Moon and Joel Saltz.
Proceedings of the Scalable High
Performance Computing Conference 1994, pages
176-183, Knoxville, Tennessee, May 23-25,
1994
- In highly adaptive irregular problems
such as many Particle-In-Cell (PIC) codes
and Direct Simulation Monte Carlo (DSMC)
codes, data access patterns may vary from
time step to time step. This fluctuation
may hinder efficient utilization of
distributed memory parallel computers
because of the resulting overhead for
data redistribution and dynamic load
balancing. This may hinder efficient
utilization of runtime pre-processing
because the pre-processing requirements
are sensitive to perturbations in the
data access patterns. To efficiently
parallelize such adaptive irregular
problems on distributed memory parallel
computers, several issues such as
effective methods for domain
partitioning, efficient index
dereferencing and fast data
transportation must be addressed. This
paper presents efficient runtime support
methods for such problems. These new
runtime support primitives have recently
been implemented and added to the CHAOS
library. A new domain partitioning
algorithm is introduced A simple
one-dimensional domain partitioning
method is implemented and compared with
unstructured mesh partitioners such as
recursive coordinate bisection and
recursive inertial bisection. A remapping
decision policy has been investigated for
dynamic load balancing on 3-dimensional
DSMC codes. Performance results are
presented.
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, and
J. Saltz
Proceeding of the Workshop on Environments
and Tools for Parallel Scientific Computing,
pages 19-32, May 1994
- This paper describes how a runtime
support library can be used as compiler
runtime support in irregular
applications. The CHAOS runtime support
library carries out optimizations
designed to reduce communication costs by
performing software caching,
communication coalescing and
inspector/executor preprocessing. CHAOS
also supplies special purpose routines to
support specific types of irregular
reduction and runtime support for
partitioning data and work between
processors. A number of adaptive
irregular codes have been parallelized
using the CHAOS library and performance
results from these codes are also
presented in this paper.
Frederick T. Chong, Shamik D. Sharma, Eric
Brewer, and Joel Saltz
Parallel Processing Letters,
pages 671-683, Volume 5, Number 4, 1995
University of Maryland: Department of Computer
Science Technical Report, CS-TR-3266
March, 1994
- Multiprocessor runtime support for
fine-grained, irregular directed acyclic
graphs (DAGs) such as those that arise
from sparse-matrix triangular solves. We
conduct our experiments on the CM-5,
whose lower latencies and active-message
support allow us to achieve unprecedented
speedups for a general multiprocessor.
Where as previous implementations have
maximum speedups of less than 4 on even
simple banded matrices, we are able to
obtain scalable performance on extremely
small and irregular problems. On a matrix
with only 5300 rows, we are able to
achieve scalable performance with a
speedup of almost 35 for 128 processors.
- We achieve these speedups with
non-matrix-specific methods which are
applicable to any DAG. We compare a range
of run-time preprocessed and dynamic
approaches on matrices from the
Harwell-Boeing benchmark set. Although
precomputed data distributions and
execution schedules produce the best
performance, we find that it is
challenging to keep their cost low enough
to make them worthwhile on small,
fine-grained problems.
- Additionally, we find that a policy of
frequent network polling can reduce
communication overhead by a factor of
three over the standard CM-5 policies. We
present a detailed study of runtime
overheads and demonstrate that send and
receive processor overhead still dominate
these applications on the CM-5. We
conclude that these applications would
highly benefit from architectural support
for low-overhead communication.
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta,
and R. Ponnusamy
Journal of the American Institute of
Aeronautics and Astronautics (AIAA), pages
489-496, Volume 32, Number 3, March 1994
- This paper is concerned primarily with
the implementation of a three-dimensional
unstructured-grid Euler-solver on
massively parallel distributed memory
computer architectures. The goal is to
minimize solution time by achieving high
computational rates with a numerically
efficient algorithm. A unstructured
multigrid algorithm with an edge-based
data-structure has been adopted, and a
number of implementations have been
devised and implemented in order to
accelerate the parallel computational
rates. The implementation is carried out
by creating a set of software tools,
which provide an interface between the
parallelization issues and the sequential
tools, while providing a basis for future
automatic run-time support. Large
practical unstructured grid problems are
solved on the Intel iPSC/860 hypercube
and the Intel Touchstone Delta machine.
The quantitative effect of the various
optimizations are demonstrated, and we
show that the combined effects of these
optimizations leads to roughly a factor
of three performance improvement. The
overall solution effeciency is compared
with that obtained on the CRAY-YMP vector
supercomputer.
-
Alan Sussman, Gagan Agrawal, and Joel
Saltz
University of Maryland: Department of Computer
Science and UMIACS Technical Reports
CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
- There exists a large class of scientific
applications that are composed of
irregularly coupled regular mesh (ICRM)
computations. These problems are often
referred to as block structured or
multiblock, problems and include the
Block Structured Navier-Stokes solver
developed as NASA Langley called TLNS3D.
- Primitives are presented that are
designed to help users to efficiently
program such problems on distributed
memory machines. These primitives are
also designed for use by compilers for
distributed memory multiprocessors.
Communications patterns are captured at
runtime, and the appropriate send and
received messages are automatically
generated. The primitives are also useful
for parallelizing regular computations,
since block structured computations also
require all the runtime support necessary
for regular computations.
Gagan Agrawal, Alan Sussman, and Joel
Saltz.
Proceedings of Supercomputing '93,
pages 578-587, Portland Oregon, November 15-19,
1993
University of Maryland Technical Report:
CS-TR-3052 and UMIACS-TR-93-29
- Scientific and engineering applications
often involve structured meshes. These
meshes may be nested (for multigrid or
adaptive codes) and/or irregularly
coupled (called Irregularly Coupled
Regular Meshes). We have designed and
implemented a runtime library for
parallelizing this general class of
applications on distributed memory
parallel machines in an efficient and
machine independent manner. In this paper
we present how this runtime library can
be integrated with compilers for High
Performance Fortran (HPF) style parallel
programming languages. We discuss how we
have integrated this runtime library with
the Fortran 90D compiler being developed
at Syracuse University and provide
experimental data on a block structured
Navier-Stokes solver template and a small
multigrid example parallelized using this
compiler and run on an Intel iPSC/860. We
show that the compiler parallelized code
performs within 20% of the code
parallelized by inserting calls to the
runtime library manually.
Ravi Ponnusamy, Joel Saltz, and Alok
Choudhary
Proceedings Supercomputing '93,
pages 361-370, Portland, Oregon, November 15-19,
1993
University of Maryland and UMIACS Technical
Report: CS-TR-3055 and UMIACS-TR-93-32
- In this paper, we describe two new ideas
by which HPF compiler can deal with
irregular computations effectively. The
first mechanism invokes a user specifies
mapping procedure via a set of compiler
directives. The directives allow the user
to use program arrays to describe graph
connectivity, spatial location of array
elements and computational load. The
second is a simple conservative method
that in many cases enables a compiler to
recognize that it is possible to reuse
previously computed results from
inspectors(e.g. communication schedules,
loop iteration partitions, information
that associates off-processor data copies
with on-processor buffer locations). We
present performance results for these
mechanisms from a Fortran 90D compiler
implementation.
Raja Das, Y. Hwang, M. Uysal, J. Saltz, and A.
Sussman
Proceedings of the Scalable Parallel
Libraries Conference 1993, pages 45-56,
Starkville, Mississippi, October 6-8, 1993
- This paper describes a number of
optimizations that can be used to support
the efficient execution of irregular
problems on distributed memory parallel
machines. We describe software primitives
that (1) coordinate interprocessor data
movement, (2) manage the storage of, and
access to, copies of off-processor data,
(3) minimize interprocessor communication
requirements and (4) support a shared
name space. The performance of the
primitives is characterized by
examination of kernels from real
applications and from a full
implementation of a large unstructured
adaptive application (the molecular
dynamics code CHARMM).
Gagan Agrawal, Alan Sussman, and Joel
Saltz
University of Maryland: Department of Computer
Science and UMIACS Technical Reports CS-TR-3140,
UMIACS-TR-93-92
October 1993
- Scientific and engineering applications
often involve structured meshes. These
meshes may be nested (for multigrid or
adaptive codes) and/or irregularly
coupled (called Irregularly Coupled
Regular Meshes). We have designed and
implemented a runtime library for
parallelizing this general class of
applications on distributed memory
parallel machines in an efficient and
machine independent manner. One important
communication primitive supported by this
library is the regular section move,
which copies a regular section from a
distributed array to another distributed
array, potentially involving changes in
offset, stride and rotation of
dimensions. In this paper we discuss the
regular section analysis which is
required for efficiently generating
schedules for this kind of communication.
We discuss the details of the analysis
required when the distributions of arrays
may be block or cyclic.
Raja Das, Joel Saltz, and Reinhard v.
Hanxleden.
Proceedings of the 6th Workshop on
Languages and Compilers for Parallel Computing,
pages 152-168, Portland Oregon, August 12-14,
1993
University of Maryland Technical Report:
CS-TR-3076 and UMIACS-TR-93-42
- An increasing fraction of the
applications targeted by parallel
computers makes heavy use of indirection
arrays for indexing data arrays. Such irregular
access patterns make it difficult for
a compiler to generate efficient parallel
code. A lim itation of existing
techniques addressing this problem is
that they are only applicable for a
single level of indirection. However,
many codes using sparse data structures
across their data through multiple
levels of indirection.
This paper presents a method for transforming
programs using multiple levels of indirection
into programs with at most one level of
indirection, thereby broadening the range of
applications that a compiler can parallelize
efficiently. A central concept of our algorithm
is to perform program slicing on the
subscript expressions of the indirect array
accesses. Such slices peel off the levels of
indirection, one by one, and create opportunities
for aggregated data prefetching on between. A slice
graph eliminates redundant preprocessing and
gives an ordering in which to compute the slices.
We present our work in the context of High
Performance Fortran, an implementation in Fortran
D prototype compiler is in progress.
Raja Das and Joel Saltz
Proceedings of the Sixth SIAM Conference on
Parallel Processing for Scientific Computing,
pages 187-192, Norfolk, Virginia, March,
1993
- This paper is concerned with the
implementation of a molecular dynamics
code, CHARMM, on massively parallel
distributed memory computer architectures
using a data-parallel approach.
The implementation is carried out by
creating a set of software tools, which
will provide an interface between the
parallelization issues and the sequential
code. Large practical MD problems is
solved on the Intel iPSC/860 hypercube.
The overall solution efficiency is
compared with that obtained when
implementation is done using data-replication.
R. Ponnusamy, R. Das, J. Saltz, and D.
Mavriplis
Proceedings of COMPCON 1993, pages
205-212, San Francisco, California, February
23-26, 1993
- Over the past few years, we have
developed methods that make it possible
for a compiler to efficiently map many
sparse and unstructured scientific
problems to scalable multiprocessor
architectures. These methods are
implemented using compiler embeddable
runtime support procedures. These
procedures can be though of as supporting
a type of weakly coherent distributed
shared memory that can be closely coupled
to distributed shared memory compilers.
These procedures (1) coordinate
inter-processor data movement, (2) manage
the storage of and access to copies of
off-processor data, (3) support a shared
name space, and (4) couple runtime data
and workload partitioners to compilers.
We are in the process of implementing
prototype High Performance Fortran
compilers which use this runtime support
to handle unstructured computations.
R. von Hanxleden, K. Kennedy, and J.
Saltz
Journal of Programming Languages, Special
Issue on Compiling and Run-Time Issues for
Distributed Access Machines, pages 259-282,
Volume 2, Number 3, December 1993
CRPC 1993, no. CRPC-TR93365-S
December 1993
- Compiling irregular applications written
in a data-parallel, High Performance
Fortran-like language presents a
challenging problem of growing
importance. One principal difficulty with
irregular problems is the general lack of
access locality when distributing data
and computation naively, based on simple
functions of array and iteration indices.
To address this problem, we extend
classical, index-based data distributions
to value-based distributions. This type
of distribution allows the programmer to
conveniently express for example the
locality characteristics of an underlying
physical problem. This gives the compiler
an opportunity to improve both inter- and
intra- processor locality, resulting in
better performance and scalability, even
when this locality is not inherent in the
original data structures and access
patterns.
- This paper reports on the experience
gained from implementing value-based
distributions in a Fortran 77D compiler
prototype. We propose a natural extension
of index-based distributions as already
present in Fortran D and HPF. This paper
addresses the compilation issues involved
and makes a quantitative comparison of
index and value-based distributions for a
Molecular Dynamics application.
-
A. Sussman, J. Saltz, R. Das, S. Gupta, D.
Mavriplis, R. Ponnusamy, and K. Crowley
Computing Systems in Engineering, pages
73-86, Volume 3, Number 1-4, 1992
- This paper describes a set of primitives
(PARTI) developed to efficiently execute
unstructured and block structured
problems on distributed memory parallel
machines. We present experimental data
from a 3-D unstructured Euler solver run
on the Intel Touchstone Delta to
demonstrate the usefulness of out
methods.
R. v. Hanxleden, K. Kennedy, C. Koelbel, R.
Das, and J. Saltz
Proceedings of 5th Workshop on Languages
and Compilers for Parallel Computing, pages
97-111, New Haven, Connecticut, August 3-5,
1992
- We developed a dataflow framework which
provides a basis for a rigorously
defining strategies to make use of
runtime preprocessing methods for
distributed memory multiprocessors.
- In many programs, several loops access
the same of f-processor memory locations.
Our runtime support gives as a mechanism
for tracking and reusing copies of
off-processor data. A key aspect of our
compiler analysis strategy is to
determine when it is safe to reuse copies
of off-process or data. Another crucial
function of the compiler analysis is to
identify situations which allow runtime
preprocessing overheads to be amortized.
This dataflow analysis will make it
possible to effectively use the results
of interprocedural analysis in our
efforts to reduce interprocessor
communication and the need for runtime
preprocessing.
[Applications | High Performance
I/O | Compilers | Tools]
|