These publications may be viewed using either Postscript Files or PDF Files 
Click here to remove the abstracts
- Scalability Analysis of Declustering Methods for
Multidimensional Range Queries
Bongki Moon and Joel Saltz
IEEE Transactions on Knowledge and Data Engineering, pages 310-327, Volume 10, Number 2, March/April 1998
- Efficient storage and retrieval of multi-attribute datasets have become one of the
essential requirements for many data-intensive applications. The Cartesian product
file has been known as an effective multi-attribute file structure for partial-match and
best-match queries. Several heuristic methods have been developed to decluster
Cartesian product files across multiple disks to obtain high performance for disk
accesses. Though the scalability of the declustering methods becomes increasingly
important for systems equipped with a large number of disks, no analytic studies have been
done so far. In this paper we derive formulas describing the scalability of two
popular declustering methods Disk Modulo and Fieldwise Xor for range queries, which are
the most common type of queries. These formulas disclose the limited scalability of
the declustering methods and are corroborated by extensive simulation experiments.
From the practical point of view, the formulas given in this paper provide a simple
measure which can be used to predict the response time of a given range query and to guide
the selection of a declustering method under various conditions.
-
- Infrastructure for Building Parallel Database Systems for
Multi-dimensional Data
Chialin Chang, Alan Sussman, and Joel Saltz
University of Maryland Technical Report CS-TR-3894 and UMIACS-TR-98-24
March 1998
Submitted to VLDB'98
As computational power and storage capacity increase, processing and analyzing large
volumes of multi-dimensional datasets play an increasingly important part in many domains
of scientific research. Our study of a large set of scientific applications over the past
three years indicates that the processing for such datasets is often highly stylized and
shares several important characteristics. Usually, both the input dataset as well as the
result being computed have underlying multi-dimensional grids. The basic processing step
usually consists of transforming individual input items, mapping the transformed items to
the output grid and computing output items by aggregating, in some way, all the
transformed input items mapped to the corresponding grid point. In this paper, we present
the design of {\em T2}, a customizable parallel database that integrates storage,
retrieval and processing of multi-dimensional datasets. T2 provides support for common
operations including index generation, data retrieval, memory management, scheduling of
processing across a parallel machine and user interaction. It achieves its primary
advantage from the ability to seamlessly integrate data retrieval and processing for a
wide variety of applications and from the ability to maintain and jointly process multiple
datasets with different underlying grids. We also present some preliminary performance
results comparing the implementation of a remote-sensing image database using the T2
services with a custom-built integrated implementation.
- Digital Dynamic Telepathology - The Virtual Microscope
Asmara Afework, Michael D. Beynon, Fabian Bustamante, Angelo Demarzo,
Renato Ferreira, Robert Miller, Mark Silberman, Joel Saltz, Alan Sussman, and Hubert Tsang
University of Maryland Technical Report CS-TR-3892 and UMIACS-TR-98-23
March 1998
To appear in: Proceedings of the 1998 AMIA Fall Symposium
The Virtual Microscope is being designed as an integrated computer hardware and
software system that generates a highly realistic digital simulation of analog, mechanical
light microscopy. We present our work over the past year in meeting the challenges in
building such a system. The enhancements we made are discussed, as well as the planned
future improvements. Performance results are provided that show that the system scales
well, so that many clients can be adequately serviced by an appropriately configured data
server.
- T2: A Customizable Parallel Database for
Multi-dimensional Data
Chialin Chang, Anurag Acharya, Alan Sussman, and Joel Saltz
ACM SIGMOD Record,
pages 58-66, Volume 27, Number 1, March 1998
University of Maryland Technical Report CS-TR-3867 and UMIACS-TR-98-04
As computational power and storage capacity increase, processing and analyzing large
volumes of multi-dimensional datasets play an increasingly important part in many domains
of scientific research. Several database research groups and vendors have developed
object-relational database systems to provide some support for managing and/or visualizing
multi-dimensional datasets. These systems, however, provide little or no support for
analyzing or processing these datasets -- the assumption is that this is too
application-specific to warrant common support. As a result, applications that process
these datasets are usually decoupled from data storage and management, resulting in
inefficiency due to copying and loss of locality. Furthermore, every application developer
has to implement complex support for managing and scheduling the processing.
Our study of a large set of scientific applications over the past three years indicates
that the processing for such datasets is often highly stylized and shares several
important characteristics. Usually, both the input dataset as well as the result being
computed have underlying multi-dimensional grids. The basic processing step usually
consists of transforming individual input items, mapping the transformed items to the
output grid and computing output items by aggregating, in some way, all the transformed
input items mapped to the corresponding grid point. In this paper, we present the design
of T2, a customizable parallel database that integrates storage, retrieval and processing
of multi-dimensional datasets. T2 provides support for common operations including index
generation, data retrieval, memory management, scheduling of processing across a parallel
machine and user interaction. It achieves its primary advantage from the ability to
seamlessly integrate data retrieval and processing for a wide variety of applications and
from the ability to maintain and jointly process multiple datasets with different
underlying grids.
Renato Ferreira, Bongki Moon, Jim Humphries, Alan Sussman, Joel Saltz, Robert Miller,
and Angelo Demarzo
Proceedings of the 1997 AMIA Annual Fall Symposium, pages 449-453,
Nashville, Tennessee, October 1997
University of Maryland Technical Report: CS-TR-3777 and UMIACS-TR-97-35
We present the design of the Virtual Microscope, a soft-ware system employing a
client/server architecture to provide a realistic emulation of a high power light
microscope. We discuss several technical challenges related to providing the
performance necessary to achieve rapid response time, mainly in dealing with the enormous
amounts of data (tens to hundreds of gigabytes per slide) that must be retrieved from
secondary storage and processed. To effectively implement the data server, the
system design relies on the computational power and high I/O through-put available from an
appropriately configured parallel computer.
- Semantic Indexing for Complex Patient Grouping
Kilian Stoffel, Joel Saltz, Jim Hendler, Jim Dick, William Merz, and Robert Miller
Proceedings of the 1997 AMIA Fall
Symposium, page 891, Nashville, Tennessee, October 27-29, 1997
A joint effort between computer scientists at the University of Maryland and clinicians
at Johns Hopkins Hospital focuses on using high performance computing technology in
support of medical applications. One focus is on providing tools that support medical data
warehousing. One particular project focuses on providing database support to
micro-biologists and infectious disease specialists.
The taxonomy used to describe micro-biological data is complex; this terminology also
changes with time. A problem we have encountered is the difficulty clinicians and
researches face in formulating queries that capture the kinds of questions they wish to
pose.
As a means for overcoming this sort of difficulty, we are looking at how to efficiently
integrate ``semantic'' knowledge, stored in the form of thesauri (or more technically,
ontologies) in ways that support efficient indexing of large databases. In particular, we
are working to provide medical specialists with the ability to express more complex
queries without becoming experts on the underlying data model. In this paper we show how
we can support this efficient indexing, facilitate complex data access, and support
high-level querying for users untrained in the details of the underlying database forms.
- Identifying DEF/USE Information of Statements that
Construct and Traverse Dynamic Recursive Data Structures
Yuan-Shin Hwang and Joel Saltz
Proceedings of the 10th International Workshop on Languages and Compilers for
Parallel Computing, August 1997
Pointer analysis is essential for optimizing and parallelizing compilers. It
examines pointer assignment statements and estimates pointer-induced aliases among pointer
variables or possible shapes of dynamic recursive data structures. However,
previously proposed techniques perform pointer analysis without the knowledge of traversal
patterns of dynamic recursive data structures to be constructed. This paper presents
an algorithm to identify the traversal patterns of recursive data structures and propagate
this information back to those statements that define the data structures. This
approach recognizes the DEF/USE relationships between the statements that define and
traverse dynamic recursive data structures. The outcome of this technique will be
useful for pointer analysis and parallelization. Algorithms to perform pointer
analysis and dependence test using the knowledge of traversal patterns will also be
presented.
- Requirements of I/O Systems for Parallel Machines:
An Application-driven Study
M. Uysal, A. Acharya and J. Saltz
University of Maryland Technical Report: CS-TR-3802 and
UMIACS-TR-97-49
May 1997
I/O intensive parallel programs have emerged as one of the leading consumers of cycles
on parallel machines. This change has been driven by two trends. First,
parallel scientific applications are being used to process larger datasets that do not fit
in memory. Second, a large number of parallel machines are being used for
non-scientific applications. Efficient execution of these applications requires
high-performance I/O systems which have been designed to meet their I/O
requirements. In this paper, we examine the I/O requirements for data-intensive
parallel applications and the implications of these requirements for the design of I/O
systems for parallel machines. We attempt to answer the following questions.
First, what is the steady-state as well peak I/O rate required? Second, what spatial
patterns, if any, occur in the sequence of I/O requests for individual applications?
Third, what is the degree of intra-processor and inter-processor locality in I/O
accesses? Fourth, does the application structure allow programmers to disclose
future I/O requests to the I/O system? Fifth, what patterns, if any, exist in the
sequence of inter-arrival times of I/O requests? To address these questions, we have
analyzed I/O request traces for a diverse set of I/O-intensive parallel
applications. This set includes seven scientific applications and four
non-scientific applications.
- A Customizable Simulator for Workstation Networks
M. Uysal, A. Acharya, R. Bennett, and J. Saltz
International Parellel Processing Symposium '97, April 1997
University of Maryland Technical Report CS-TR-3690 and UMIACS-TR-96-68 (extended
version)
We present a customizable simulator called netsim for high-performance
point-to-point workstation networks that is accurate enough to be used for
application-level performance analysis yet is easy enough to customize for multiple
architectures and software configurations. Customization is accomplished without
using any proprietary information, using only publicly available hardware specifications
and information that can be readily determined using a suite of test programs. We
customized netsim for two platforms: a 16-node IBM SP-2 and less than 10%
error on the Alpha Farm for most test cases. It achieves this accuracy at the cost
of a 7-36 fold simulation slowdown with respect to the SP-2 and a 3-8 fold slowdown with
respect to the Alpha Farm.
- Titan a High Performance Remote-Sensing Database
Chialin Chang, Bongki Moon, Anurag Acharya, Carter Shock, Alan Sussman, and Joel
Saltz
Proceedings
of the 13th International Conference on Data Engineering, pages 375-384,
Birmingham, United Kingdom, April 1997
University of Maryland Technical Report CS-TR-3689 and UMIACS-TR-96-67
- There are two major challenges for a high-performance remote-sensing database. First, it
must provide low-latency retrieval of very large volumes of spatio-temporal data. Second,
the order of magnitude reduction in data-size due to post-processing must make it
imperative, from a performance perspective, that the postprocessing be done on the machine
that holds the data. This paper describes the design, implementation and evaluations of Titan,
a parallel shared-nothing database designed for remote sensing data. The computational
platform for Titan is 16-processor IBM SP-2 with four fast disks attached to each
processor. Titan is currently operational and contains about 24 GB of data from the
Advanced Very High Resolution Radiometer, (AVHRR) on the NOAA-7 satellite. The
experiemtnal results show that Titan provides good performance for global queries, and the
interactive response times for local queries.
- Interoperability of Data Parallel Runtime Libraries
Guy Edjlali, Alan Sussman, and Joel Saltz
International Parallel Processing Symposium 1997, April 1997
- This paper describes a framework for providing the ability to use multiple specialized
data parallel libraries and/or languages within a single applications. The ability to use
multiple libraries is required in many application areas, such as multidisciplinary
complex physical simulations and remote sensing image database applications. An
application can consist of one program or multiple programming that use different
libraries- to parallelize operations on distributed data structures. The framework is
embodied in a runtime library called Meta--Chaos that has been used to exchange data
between data parallel programs written using High Performance Fortran, the Chaos and
Multiblock Parti libraries developed at Maryland for handling various types of
unstructured problems, and the runtime library for pc++, a data parallel version of C++
from Indiana University. Experimental results show that Meta-Chaos is able to move data
between libraries efficiently, and that Meta-Chaos provides effective support, for complex
applications.
- Compiler and Runtime Support for Programming in Adaptive
Parallel Environments
Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, and Joel Saltz
Scientific Programming, January 1997
University of Maryland Technical Report CS-TR-3510 and UMIACS-TR-95-83
- For better utilization of computing resources, it is important to consider parallel
programming environments in which the number of available processors varies at runtime. In
this paper, we discuss runtime support for data parallel programming in such an adaptive
environment. Executing programs in an adaptive environment requires redistributing data
when the number of processors changes, and also requires determining new loop bounds and
communication patterns for the new set of processors. We have developed a runtime library
to provide this support. We discuss how the runtime library can be used by compilers of
HPF-like languages to generate code for an adaptive environment. We present performance
results for Navier-Stokes solver on a multigrid template run on a network of workstations
and an IBM SP-2. Our experiments show that if the number of processors is not varies
frequently, the cost of data distribution is not significant when compared to the time
required for the actual computation. Overall, our work establishes the feasibility of
compiling HPF for a network on non-dedicated workstations, which are likely to be an
important resource for parallel programming in the future.
- A Study of Internet Round-Trip Delay
Anurag Acharya and Joel Saltz
University of Maryland Technical Report CS-TR 3736 and UMIACS-TR-96-97
December 1996
We present the results of a study of Internet round-trip delay. The links chosen
include links to frequently accessed commercial hosts as well as well-known academic and
foreign hosts. Each link was studied for a 48-hour period. We attempt to answer the
following questions: (1) how rapidly and in what manner does the delay change - in this
study, we focus on medium-grain (seconds/minutes) and coarse-grain time-scales (tens of
minutes/hours); (2) what does the frequency distribution of delay look like and how
rapidly does it change; (3) what is a good metric to characterize the delay for the
purpose of adaptation. Our conclusions are: (a) there is large temporal and spatial
variation in round-trip time (RTT); (b) RTT distribution is usually unimodal and
asymmetric and has a long tail on the right hand side: (c) RTT observations in most time
periods are tightly clustered around the mode; (d) the mode is a good characteristic value
for RTT distributions; (e) RTT distributions change slowly; (f) persistent changes in RTT
occur slowly, sharp changes are undone very shortly; (g) jitter in RTT observations is
small and (h) inherent RTT occurs frequently.
- A High Performance Image Database System for Remotely
Sensed Imagery
C. T. Shock, C. Chang, L. Davis, S. Goward, J. Saltz, and A. Sussman
Proceedings of Euro-Par'96, Springer, pages 109--122, Volume 2, August
1996
As computational power and storage capacity increase, processing and
analyzing large volumes of multidimensional datasets play an increasingly important part
in many domains of scientific research. Our study of a large set of scientific
applications over the past three years indicates that the processing for such datasets is
often highly stylized and shares several important characteristics. Usually, both the
input dataset as well as the result being computed have underlying multidimensional
grids. The basic processing step usually consists of transforming individual input items,
mapping the transformed items to the output grid and computing output items by
aggregating, in some way, all the transformed input items mapped to the corresponding grid
point. In this paper, we present the design of T2, a customizable parallel database that
integrates storage, retrieval and processing of multidimensional datasets. T2 provides
support for common operations including index generation, data retrieval, memory
management, scheduling of processing across a parallel machine and user interaction. It
achieves its primary advantage from the ability to seamlessly integrate data retrieval and
processing for a wide variety of applications and from the ability to maintain and jointly
process multiple datasets with different underlying grids. We also present some
preliminary performance results comparing the implementation of a remotesensing image
database using the T2 services with a custombuilt integrated implementation.
- Network-aware Mobile Programs
M. Ranganathan, Anurag Acharya, Shamik Sharma, and Joel Saltz
Proceedings of the Usenix 1997
Technical Conference Anaheim, Californa, June 1996
University of Maryland Technical Report: CS-TR-3659 and UMIACS-TR-96-46
- In this paper, we investigate network-aware mobile programs, programs that can
use mobility as a tool to adapt to variations in network characteristics. We present
infrastructural support for mobility and network monitoring and show how adaptalk, a
Java-based mobile Internet chat application can take advantage of this support to
dynamically place the chat server so as to minimize response time. Our conclusion was that
on-line network monitoring and adaptive placement of shared data-structures can
significantly improve performance of distributed applications on the Internet.
- Interleaved Parallel Hybrid Arnoldi Method for a Parallel
Machine and a Network of Workstations
Guy Edjlali, Serge Petition, and Nahid Emad
Proceedings of the International Conference on Information Systems, Analysis
and Synthesis, June 1996
- The data-parallel paradigm which is effective for many numerical algebra computations is
no more sufficient when programming heterogeneous machines. To efficiently utilize these
new resources we need to combine the two new paradigms: data-parallelism and task
parallelism. This leads us to define new algorithims.
- In this paper, throughout all examples we will illustrate this statement. The chosen
example is the computation of some eigenvalues of large non-structured sparse
non-Hermitian matrices. We present a new algorithm, Interleaved Arnoldi, and compare its
performances to the parallel Arnoldi algorithm. The different environments we consider
are: a CM5, a CM5 and a Sparc 20, a CM5 and 2 Sparc 20. We show that significant results
can be obtained even whell the het- erogeneous machine consists of a CM5 and 2 Sparc 20
workstations.
- Interoperability of Data Parallel Runtime Libraries with
Meta-Chaos
Guy Edjlali, Alan Sussman, and Joel Saltz
University of Maryland Technical Report: CS-TR-3633 and UMIACS-TR-96-30
May 1996
- This paper describes a framework for providing the ability to use multiple specialized
data parallel libraries and/or languages within a single applications. The ability to use
multiple libraries is required in many application areas, such as multidisciplinary
complex physical simulations and remote sensing image database applications. An
application can consist of one program or multiple programming that use different
libraries- to parallelize operations on distributed data structures. The framework is
embodied in a runtime library called Meta--Chaos that has been used to exchange data
between data parallel programs written using High Performance Fortran, the Chaos and
Multiblock Parti libraries developed at Maryland for handling various types of
unstructured problems, and the runtime library for pc++, a data parallel version of C++
from Indiana University. Experimental results show that Meta-Chaos is able to move data
between libraries efficiently, and that Meta-Chaos provides effective support, for complex
applications.
- Experimental Evaluation of Sparse Matrix Distributions
- M. Ujaldon, S. Sharma, J. Saltz, and E. Zapata
-
Proceedings of the ACM International Conference of Supercomputing, pages
78-85, Philadelphia, Pennsylvania, May 25-27, 1996
- Sparse matrix problems are difficult to parallelize efficiently on distributed memory
machines since non-zero elements are unevenly scattered and and are accessed via multiple
levels of indirection. Irregular distributions that acheive good load balance and locality
are hard to compute, have high memory overheads and also lead to further indirection in
locating distributed data. This paper evaluates alternative semi-regular distribution
strategies which rade off the quality of the load balance and locality for power
distribution overheads and efficient lookup. The proposed techniques are compared to an
irregular sparse matrix partitioner and the relative merits of each distribution method
are outlined.
- An Interprocedural Framework for Placement of
Asynchronous I/O Operations
Gagan Agrawal, Anurag Acharya, and Joel Saltz
Proceedings of the ACM International Conference of Supercomputing, pages
358-365, Philadelphia, Pennsylvania, May 25-27, 1996
- Overlapping memory accesses with computations is a standard technique for improving
performance of modern architectures, which have deep memory hierarchies. In this paper, we
present a compiler technique for overlapping accesses to secondary memory (disks) with
computation. We have developed an Interprocedural Balanced Code Placement (IBCP)
framework, which performs analysis on arbitrary recursive procedures and arbitrary control
flow and replaces synchronous I/O operations with a balanced pair of asynchronous
operations. We demonstrate how this analysis is useful for the applications which perform
frequent and large accesses to secondary memory, analysis is useful for the applications
which perform frequent and large accesses to secondary memory, including the applications
which snapshot or checkpoint their computations or the out-of-core applications.
- Tuning the Performance of I/O-Intensive Parallel
Applications
Anurag Acharya, Mustafa Uysal, Robert Bennett, Assaf Mendelson, Michael Beynon, Jeff
Hollingsworth, Joel Saltz, and Alan Sussman
Fourth Annual Workshop on I/O in Parallel and Distributed Systems, Philadelphia,
Pennsylvania, May 27, 1996
- Getting good I/O performance from parallel programs is a critical problem for many
application domains. In this paper, we report our experience tuning the I/O performance of
four application programs from the areas of sensor data processing and linear algebra.
After tuning, three of the four applications achieve effective I/O rates of over 100MB/s,
on 16 processors. The total volume of I/O required by the programs ranged from about 75MB
to over 200GB. We report the lessons learned in achieving high I/O performance from these
applications, including the need for code restructuring, local disks on every node and
overlapping I/O with computation. We also report our experience on achieving high
performance on peer-to-peer configurations. Finally, we comment on the necessity of
complex I/O interfaces like collective I/O and strided requests to achieve high
performance.
- Eliminating Redundant Barrier Synchronizations in
Rule-based Programs
Anurag Acharya
Proceedings of the ACM International Conference of Supercomputing, pages
325-332, Philadelphia, Pennsylvania, May 1996
- A rule-based program consists of a set of if-then rules and a tuple-space. The rules are
the code for the program and the tuple-space contains the data being processed by the
program. Previous efforts to parallelize rule-based programs have achieved limited
speedups. The main reason for these disappointing results is a high frequency of barrier
synchronizations. Since little work is done between successive barrier synchronizations,
the number of processors that can be effectively utilized is bounded. Even though required
by language semantics, a large fraction of the barrier synchronizations are not necessary
for most programs. This paper proposes a pair of simple language extensions that allow an
implementation to efficiently detect and eliminate redundant barrier synchronizations.
Simulation results based on a real implementation show that for a set of five benchmarks,
this scheme is able to eliminate between 95.6% and 99.9% of the barrier synchronizations.
This results in a multiplicative speedup of between 2.2 and 52.3 fold over and above the
speedup achieved by a parallelizing compiler. For the programs studied, simulations
indicate that speedups up to 115 fold relative to an optimized sequential version are
possible.
- Runtime Coupling of Data-Parallel Programs
Mudumbai Ranganathan, Anurag Acharya, Guy Edjlali, Alan Sussman, and Joel Saltz
Proceedings of the ACM International Conference of Supercomputing, pages
229-236, Philadelphia, Pennsylvania, May 25-27, 1996
University of Maryland: Department of Computer Science and UMIACS Technical Reports
CS-TR-3565 and UMIACS TR-95-116
May 1996
- We consider the problem of efficiently computing multiple data-parallel programs at
runtime. We propose an approach that establishes a mapping between data structures in
different data-parallel programs and implements a user-specified consistency model.
Mappings are established at runtime and can be added and deleted while the programs being
coupled are in execution. Mappings, or the identity of the processors involved, do not
have to be known at compile-time or even link-time. Programs can be made to interact with
different granularities of interaction without requiring any re-coding. A-priori knowledge
of consistency requirements aflows buffering of data as well as concurrent execution of
the coupled applications. Efficient data movement is achieved by pre-computing an
optimized schedule. We describe our prototype implementation and evaluate its performance
using a set of synthetic benchmarks. We examine the variation of performance with
variation in the consistency requirement. We demonstrate that the cost of the flexibility
provided by our coupling scheme is not prohibitive when compared with a monolithic program
that performs the same computation.
- Study of Scalable Declustering Algorithms for Parallel
Grid Files
Bongki Moon, Anurag Acharya, and Joel Saltz
Proceedings of the 1996 Parallel Processing Symposium, April 1996
University of Maryland
Technical Report: CS-TR-3589 and UMIACS-TR-96-4 Lengthened version
- Efficient storage and retrieval of large multidimensional datasets has been one of the
key issues in large-scale scientific computations such as long running time-dependent
simulations periodically generating snapshots of the state. The main challenge for
efficiently handling such datasets is to minimize response time for multidimensional range
queries. The grid file is one of the well known access methods for multidimensional and
spatial data. We investigate effective and scalable declustering techniques for grid files
with the primary goal of minimizing response time and the secondary goal of maximizing the
fairness of data distribution. The main contributions of this paper is (1) the analytic
and experimental evaluation of existing index-based declustering techniques and their
extensions for grid files, and (2) the development of an improved proximity-based
declustering algorithm called minimax which has been experimentally shown to scale well
and consistently achieve better response time compared to all the other algorithms while
maintaining perfect disk distribution.
- Scalability Analysis of Declustering Methods for
Cartesian product files
Bongki Moon and Joel H. Saltz
Submitted to IEEE Transactions on Knowledge and Data Engineering, April
1996
University of Maryland Technical Report: CS-TR-3590 and UMIACS-TR-96-5
- Efficient storage and retrieval of multi-attribute datasets has become one of the
essential requirements for many data-intensive applications. The Cartesian product file
has been known as an effective multi-attribute file structure for partial-match and
best-match queries. Several heuristic methods have been developed to decluster Cartesian
product files over multiple disks to obtain high performance for disk accesses. Though the
scalability of the declustering methods becomes increasingly important for systems
equipped with a large number of disks, no analytic studies have been done so far In this
paper we derive formulas describing the, scalability of two popular declustering methods
Disk Modulo and Fieldwise Xor for range queries, which are the most common type of
queries. These formulas disclose the limited scalability of the declustering methods and
are corroborated by extensive simulation experiments. From the practical point of view,
the formulas given in this paper provide a simple measure which can be used to predict the
response time of a given range query and to guide the selection of a declustering method
under various conditions.
- Analysis of the Clustering Properties of Hilbert
Space-filling Curve
Bongki Moon, H. V. Jagadish, Christos Faloutsos, and Joel H. Saltz
Submitted to IEEE Transactions on Knowledge and Data Engineering, March
1996
University of Maryland Department of Computer Science Technical Report, CS-TR-3611 and
UMIACS-TR-96-20
- Several schemes for linear mapping of multidimensional space have been proposed for many
applications such as access methods for spatio-temporal databases, image compression and
so on. In all these applications, one of the most desired properties from such linear
mappings is clustering, which means the locality between objects in the multidimensional
space is preserved in the linear space. It is widely believed that Hilbert space-filling
curve achieves the best clustering. In this paper we provide closed-fortran formulas of
the number of clusters required by a given query region of an arbitrary shape (e.g.,
polygons and polyhedra) for Hilbert space-filling curve. Both the asymptotic solution for
a general case and the exact solution for a special case generalize the previous work, and
they agree with the empirical results that the number of clusters depends on the
hyper-surface area of the query region and not on its hyper-volume. We have also shown
that Hilbert curve achieves better clustering than z-curve. From the practical point of
view, the formulas given in this paper provide a simple measure which can be used to
predict the required disk access behaviors and hence the total access time.
Index Terms: locality-preserving linear mapping, range queries, multi-attribute access
methods, data clustering, Hilbert curve, space-filling curves, fractals.
- Interprocedural Compilation of Irregular Application for
Distributed Memory Machines
G. Agrawal, J. Saltz
Proceedings of Supercomputing 95, San Diego, California, December 3-8,
1995.
University of Maryland: Department of Computer Science Technical Reports CR-TR-
3447
- Data parallel languages like High Performance Fortran (HPF) are emerging as the
architecture independent mode of programming distributed memory parallel machines. In this
paper, we present the interprocedural optimizations required for compiling applications
that have irregular data access patterns, when coded in such data parallel languages. We
have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for
optimized placement of runtime preprocessing routine and collective communication routines
inserted for managing communication in such codes. We also present three new
interprocedural optimizations: placement of scatter routines, deletion of data structures
and use of coalescing and incremental routines. We then describe how program slicing can
be used for further applying IPRE in more complex scenarios. We have done a preliminary
implementation of the schemes presented here using the Fortran D implementation system as
the necessary infrastructure. We present experimental results from two codes compiled
using our system to demonstrate the efficacy of the presented schemes.
- Runtime Techniques for Parallelizing Sparse Matrix
Applications
Manuel Ujaldon, Shamik D. Sharma, Joel Saltz, and Emilio Zapata
Proceedings of the 1995 Workshop on Irregular Problems, pages 78-85, September
1995
- Sparse matrix problems are difficult to parallelize efficiently on message-passing
machines, since they access data through multiple levels of indirection.
Inspector/executor strategies, which are typically used to parallelize such problems
impose insignificant preprocessing overheads. This paper describes the runtime support
required by new compilation techniques for sparse matrices and evaluates their
performance, highlighting optimizations and improvements over previous techniques.
- Runtime Support and Compilation Methods for
User-Specified Irregular Data Distributions
Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, and Geoffrey Fox
IEEE Transactions on Parallel and Distributed
Memory Systems, pages 815-831,Volume 6, Number 8, August 1995
University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and
UMIACS-TR-93-135
- This paper describes two new ideas by which an HPF compiler can deal with irregular
computations effectively. The first mechanism invokes a user specified mapping procedure
via a set of compiler directives. The directives allow use of program arrays to describe
graph connectivity, spatial location of array elements and computational load. The second
mechanism is a simple conservative method that in many cases enables a compiler to
recognize that it is possible to reuse previously computed information from inspectors
(e.g. communication schedules, loop iteration partitions, information that associates
off-processor data copies with on-processor buffer locations). We present performance
results for these mechanisms from a Fortran 90D compiler implementation.
- Parallel Monte Carlo Simulation of Three-Dimensional Flow
over a Flat Plate
R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, and J. Saltz
AIAA Journal of Thermophysics and Heat Transfer, pages 471-477, Volume 9, Number
3, July-September 1995
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25
- This paper describes a parallel implementation of the direct simulation Monte Carlo
method. Runtime library support is usedfor scheduling and execution of communication
between nodes, and domain decomposition is perfonned dynamically to maintain a favorable
load balance. Perfonnance tests are conducted using the code to evaluate various remapping
and remapping- interval policies, and it is shown that a one-dimensional
chain-partitioning method works bestfor the problems considered. The parallel code is then
used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be
shown that the parallel algorithm produces results which are very similar to previous DSMC
results, despite the increased resolution available. However, it yields
significantlyfaster execution times than the scalar code, as well as very good
load-balance and scalability characteristics.
- Efficient Support for Irregular Applications on
Distributed-Memory Machines
S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, and J. Saltz
Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of
Parallel Programming '95, pages 68-79, Santa Barbara, California, July
19-21, 1995
ACM SIGPLAN Notices: Volume 30, Number 8
- Irregular computation problems underlie many important scientific applications .
Although these problems are computationally expensive, and so it would seem appropriate
for parallel machines, their irregular and unpredictable run-time behavior makes this type
of parallel program difficult to write and adversely affects run-time performance.
- This paper explores three issues--partitioning,mutual exclusion, and data transfer,
--crucial to the efficient execution of irregular problems on distributed-memory machines.
Unlike previous work, we studied the same programs running in three software alternative
systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular
application library, transparent shared memory (TSM), and the eXtensible shared memory
(XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are
somewhat (13%) to significantly faster (991%) than TSM.
- An Integrated Runtime and Compile-Time Approach for
Parallelizing Structured and Block Structured Applications
Gagan Agrawal, Alan Sussman, and Joel Saltz.
IEEE Parallel and Distributed Systems, pages 747-754, Volume. 6, Number 7, July
1995
University of Maryland: Department of Computer Science and UMIACS Technical Reports
CS-TR-3143, UMIACS-TR-93-94
- Scientific and engineering applications often involve structured meshes. These meshes
may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or
irregularly coupled regular mesh problems). In this paper, we present a combined runtime
and compile-time approach for parallelizing these applications on distributed memory
parallel machines in an efficient and machine-independent fashion. We have designed and
implemented a runtime library which can be used to port these applications on distributed
memory machines. The library is currently implemented on several different systems. Since
the design of the library is machine independent, it can be easily ported on other
distributed memory machines and environments which support message passing. To further
ease the task of application programmers, we have developed methods for integrating this
runtime library with compilers for HPF-like parallel programming languages. We discuss how
we have integrated this runtime library with the Fortran 90D compiler being developed at
Syracuse University. We present experimental results to demonstrate the efficacy of our
approach. We have experimented with a multiblock Navier-Stokes solver template and a
multigrid code. Our experimental results show that our primitives have low runtime
communication overheads. Further, the compiler parallelized codes perform within 20% of
the code parallelized by manually inserting calls to the runtime library.
- Parallelizing Molecular Dynamics Programs for Distributed
Memory Machines: An Application of the CHAOS Runtime Support Library
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, and Milan Hodoscek
IEEE Computational Science and Engineering, pages 18-29, Volume 2, Number
2, Summer 1995
University of Maryland: Department of Computer Science and UMIACS Technical Reports
CS-TR-3374, UMIACS-TR-94-125
- CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used
to model and simulate macromolecular systems. CHARMM has been parallelized by using the
CHAOS runtime support library on distributed memory architectures. This implementation
distributes both data and computations over processors. This data-parallel strategy should
make it possible to simulate very large molecules on large numbers of processors.
- In order to minimize communication among processors and to balance computational load, a
variety of partitioning approaches are employed to distribute the atoms and computations
over processors. In this implementation, atoms are partitioned based on geometrical
positions and computational load by using unweighted or weighted recursive coordinate
bisection. The experimental results reveal that taking computational load into account is
essential. The performance of two iteration partitioning algorithms, atom decompositions
and force decomposition, is also compared. A new irregular force decompositional algorithm
is introduced and implemented.
- The CHAOS library is designed to facilitate parallelization of irregular applications.
This library (1) couples partitioners to the application programs, (2) remaps data and
partitions work among processors, and (3) optimizes interprocessor communications. This
paper presents and application of CHAOS that can be used to support efficient execution of
irregular problems on distributed memory machines.
- Runtime and Language Support for Compiling Adaptive
Irregular Programs on Distributed Memory Machines
Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, and J. Saltz
Software Practice & Experience, pages 597-621, Volume 25, Number 6, June
1995
In many scientific applications, arrays containing data are indirectly indexed through
indirection arrays. Such scientific applications are called irregular programs and are a
distinct class of applications that require special techniques for parallelization.
- This paper presents a library called CHAOS, which helps users implement irregular
programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1.
The CHAOS library provides efficient runtime primitives for distributing data and
computation over processors; it supports efficient translation mechanisms and provides
users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI
library and supports a larger class of applications. In particular, it provides efficient
support for parallelization of adaptive irregular programs where indirectional arrays are
modified during the course of computation. To demonstrate the efficacy of CHAOS, two
challenging real-life applications were parallelized using CHAOS primitives: a molecular
dynamics code, CHARMM, and a particle-in-cell code, DSMC.
- Besides providing runtime support to users, CHAOS can also be used by compilers to
automatically parallelize irregular applications. This paper shows how CHAOS can be used
in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler,
kernels taken from CHARMM and DSMC codes have been parallelized automatically.
-
- Interprocedural Partial Redundancy Elimination and Its
Application to Distributed Memory Compilation
G. Agrawal, J. Saltz, and R. Das
Proceedings of the SIGPLAN '95 Conference on Programming Language Design and
Implementation, pages 258-269, La Jolla, California, June 18-21, 1995
University of Maryland: Department of Computer Science and UMIACS Technical Reports
CR-TR-3446 and UMIACS-TR-95-42
- Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial
redundancies which encompasses traditional optimizations like loop invariant code motion
and redundant code elimination. In this paper we address the problem of performing this
optimization interprocedurally. We use interprocedural partial redundancy elimination for
placement of communication and communication preprocessing statements while compiling for
distributed memory parallel machines.
- Distributed Memory Compiler Design for Sparse Problems
J. Wu, R. Das, J. Saltz, H. Berryman, and S. Hiranandani
IEEE Transactions on Computers, pages 737-753, Volume 44, Number 6, June
1995
- This paper addresses the issue of compiling concurrent loop nests in the presence of
complicated array references and irregularly distributed arrays. Arrays accessed within
loops may contain accessed that make it impossible to precisely determine the reference
pattern at compile time. This paper proposes a run time support mechanism that is used
effectively by a compiler to generate efficient codes in these situations. The compiler
accepts as input a Fortran 77 program enhanced with specifications for distributing data,
and outputs a message passing program that runs on the nodes of a distributed memory
machine. The runtime support for the compiler consists of a library of primitives designed
to support irregular patterns of distributed array accesses and irregularly distributed
array partitions. A variety of performance results on the Intel iPSC/860 are presented.
- Data Parallel Programming in an Adaptive Environment
G. Edjlali, G. Agrawal, A. Sussman, and J. Saltz
Proceedings of the Ninth International Parallel Processing Symposium, pages
827-832, Santa Barbara, California, April 24-28 1995
University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109 Lengthened
version
- For better utilization of computing resources, it is important to consider parallel
programming environments in which the number of available processors varies at runtime. In
this paper, we discuss runtime support for data parallel programs in such an adaptive
environment. Executing data parallel programs in an adaptive environment requires
determining new loop bounds and communication patterns for the new set of processors. We
have developed a runtime library to provide this support. We discuss how the runtime
library can be used by compilers to generate code for an adaptive environment. We also
present performance results for a multiblock Navier-Stokes solver run on a network of
workstations using PVM for message passing. Our experiments show that if the number of
processors is not varied frequently, the cost of data redistribution is not significant
compared to the time required for the actual computations.
- Index Translation Schemes for Adaptive Computations on
Distributed Memory Multicomputers
B. Moon, M. Uysal, and J. Saltz
Proceedings of the Ninth International Parallel Processing Symposium, pages
812-819, Santa Barbara, California, April 24-28, 1995
University of Maryland: Department of Computer Science and UMIACS Technical Reports
CS-TR-3428 and UMIACS-TR-95-28
- Current research in parallel programming is focused on closing the gap between globally
indexed algorithms and the separate address spaces of processors on distributed memory
multicomputers. A set of index translation schemes have been implemented as a part of
CHAOS runtime support library, so that the library functions can be used for implementing
a global index space across a collection of separate local index spaces. These schemes
include two software-cached translation schemes aimed at adaptive irregular problems as
well as a distributed translation table technique for statically irregular problems. To
evaluate and demonstrate the efficiency of the software-cached translation schemes,
experiments have been performed with as adaptively irregular loop kernel and a
full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper
also discusses and analyzes the operational conditions under which each scheme can produce
optimal performance.
- Object-Oriented Runtime Support for Complex Distributed
Data Structures
C. Chang, A. Sussman, and J. Saltz
University of Maryland: Department of Computer Science and UMIACS Technical Reports
CS-TR-3438 and UMIACS-TR-95-35
March, 1995
- Object-oriented applications utilize language constructs such as pointers to synthesize
dynamic complex data structures, such as linked lists, trees and graphs, with elements
consisting of complex data types. Traditionally, however, applications executed on
distributed memory parallel architectures in single-program multiple-data (SPMD) mode use
on distributed (multi-dimensional) data arrays. Good performance has been achieved by
applying runtime techniques that to such applications executing in a loosely synchronous
manner. Existing runtime systems that rely solely on global indices are not always
applicable to object-oriented applications, since no global names or indices are imposed
upon dynamic complex data structures linked by pointers.
-
- We describe a portable object-oriented runtime library that has been designed to support
applications that use dynamic distributed data structures, including both arrays and
pointer-based data structures. In particular, CHAOS++ deals with complex data types and
pointer-based data structures, by providing two abstractions, mobile objects and globally
addressable objects. CHAOS++ uses preprocessing techniques to analyze communication
patterns, and provides data exchange routines to perform efficient transfers between
processors. Results for applications taken from three distinct classes are also presented
to demonstrate the wide applicability and good performance characteristics of the runtime
library.
-
- A Manual for the CHAOS Runtime Library
J. Saltz, R. Ponnusammy, S. Sharma, B. Moon, Y.-S. Hwang, M. Uysal, and R. Das
University of Maryland: Department of Computer Science and UMIACS Technical Reports
CS-TR-3437 and UMIACS-TR-95-34
March 1995
- Procedures are presented that are designed to help users efficiently program irregular
problems (e. g. unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial
differential equations solvers) on distributed memory machines. These procedures are also
designed for use in compilers for distributed memory multiprocessors. The portable CHAOS
procedures are designed to support dynamic data distributions and to automatically
generate send and receive message by capturing communications patterns at runtime.
- Runtime Support and Dynamic Load Balancing Strategies for
Structured Adaptive Applications
- Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K.
Kailasanath, and Joel Saltz
-
- Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific
Computing, pages 575-580, February 1995.
-
One class of scientific and engineering applications involves structured meshes. One
example of a code in this class is a flame modelling code developed at the Naval Research
Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on
structured finite volume methods. The chemistry process of the reactive flow is modeled by
a system of ordinary differential equations which is solved independantly at each grid
point. Thus, though the model uses a mesh structure, the workload at each grid point can
vary considerably. It is this feature that requires the use of both structured and
unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS
runtime support libraries to parallelize the NRL flame code with minimal changes to the
sequential code. We have also developed parallel algorithms to carry out dynamic load
balancing. It has been observed that the overall performance scales reasonably up to 256
Paragon processors and that the total runtime on a 256-node PAragon is about half that of
a single processor Cray C90.
- Support for Distributed Dynamic Data Structures in C++
Chialin Chang, Alan Sussman, and Joel Saltz
University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and
UMIACS-TR-95-19
January 1995
- Traditionally, applications executed on distributed memory architectures in
single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays.
Good performance has been achieved by applying runtime techniques to such applications
executing in a loosely synchronous manner. However, many applications utilize language
constructs such as pointers to synthesize dynamic complex data structures, such as linked
lists, trees and graphs, with elements consisting of complex composite data types.
Existing runtime systems that rely on global indices cannot be used for these
applications, as no global names or indicies are imposed upon the elements of these data
structures.
- A portable object-oriented runtime library is presented to support applications that use
dynamic distributed data structures, including both arrays and pointer-based data
structures. In particular, CHAOS++ deals with complex data types and pointer-based data
structures by providing mobile objects and globally addressable objects .
Preprocessing techniques are used to analyze communication patterns, and data exchange
primitives are provided to carry out efficient data transfer. Performance results for
applications taken from three distinct classes are also included to demonstrate the wide
applicability of the runtime library.
- A Framework for Optimizing Parallel I/O
R. Bennett, K. Byrant, A. Sussman, R. Das, J. Saltz
University of Maryland: Department of Computer Science and UMIACS Technical Reports
CS-TR-3417 and UMIACS-TR-95-20
January 1995
- There has been a great deal of recent interest in parallel I/O. This paper discusses
issues in the design and implementation of a portable I/O library designed to optimize the
performance of multiprocessor architecture that include multiple disks or disk arrays. The
major emphasis of the paper is on optimizations that are made possible by the use of
collective I/O, so that the I/O requests for multiple processors can be combined to
improve performance. Performance measurements from benchmarking our implementation of an
I/O library that currently performs collective local optimizations, called Jovian,
on three application templates are also presented.
- Parallelization of Fine Grained, Irregular DAGs
F. Chong, S. Sharma, E. Brewer, and J. Saltz
International Conference on Supercomputing 1995, December 1994
- This paper examines parallelization of computations that can be characterized as a
fine-grained, irregular, directed acyclic graph of tasks. Such computations typically
arise in sparse matrix applications where loops may have loop carried data dependencies.
The fine grain of the computation and the irregular nature of the dependencies make such
loops extremely hard to parallelize efficiently. We use runtime preprocessing to
parallelize such computations. A preprocessing step analyzes the DAG and determines an
optimized mapping of tasks to processors, an appropriate order of executing these tasks on
each processor and a set of synchronization constraints. This information is used to
distribute data across processors, and to schedule computation, communication and
synchronization. We use sparse-matrix triangular solution as our example application and
compare a range of runtime preprocessed techniques for its parallelization. Whereas
previous studies have achieved maximum speedups of less that 4 on even simple banded
matrices, we achieve a speedup of almost 35 on 128 processors of the CM-5 on an irregular
matrix with 5300 rows. We find that and receive overheads dominate the time of out
application. Efficient parallelization of such loops requires even more support for low
latency communication.
- Run-time and Compile-time Support for Adaptive Irregular
Problems
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, and Joel
Saltz
Supercomputing 1994, pages: 97-106, Washington, District of Columbia, November
14-18, 1994 (revised version)
University of Maryland and UMIACS Technical Report: CS-TR-3266 and UMIACS-TR-94-55
- In adaptive irregular problems the data arrays are accessed via indirection arrays, and
data access patterns change during computation. Implementing such problems on distributed
memory machines requires support for dynamic data partitioning, efficient preprocessing
and fast data migration. This research presents efficient runtime primitives for such
problems. This new set of primitives is part of the CHAOS library. It subsumes the
previous PARTI library which targeted only static irregular problems. To demonstrate the
efficacy of the runtime support, two real adaptive irregular applications have been
parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a
particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can
allow compilers to generate more efficient code for adaptive problems. These language
extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The
performance of the compiler parallelized codes is compared with the hand parallelized
versions.
- Jovian: A Framework for Optimizing Parallel I/O
R. Bennett, K. Bryant, A. Sussman, and R. Das
Proceedings of the 1994 Scalable Parallel Libraries Conference, pages 10-20,
Mississippi State, Mississippi, October 12-14, 1994
- There has been a great deal of recent interest in parallel I/O. In this paper, we
discuss the design and implementation of the Jovian library, which is intended to optimize
the I/O performance of multiprocessor architectures that include multiple disks or disk
arrays. We also present preliminary performance measurements from benchmarking the Jovian
I/O library on the IBM SP1 distributed memory parallel machine for two application
templates.
-
- High Performance Computing for Land Cover Dynamics
Rahul Parulekar, Larry Davis, Rama Chellappa, Joel Saltz, Alan Sussman, and John
Townshend
Proceedings of the International Joint Conference on Pattern Recognition, IEEE
Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994
- We present the overall goals of our research program on the application of high
performance computing to remote sensing applications, specifically, applications in land
cover dynamics. This involves developing scalable and portable programs for a variety of
image and map data processing applications, eventually integrated with new models for
parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run
time support system, we explain extensions made to that system to support image processing
applications, and then present an example involving multiresolution image processing.
Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.
-
- Communication Optimizations for Irregular Scientific
Computations on Distributed Memory Architectures
R. Das, M. Uysal, J. Saltz, and Y.-S. Hwang
Journal of Parallel and Distributed Computing, pages: 462-479, Volume 22, Number
3, September 1994
University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109
- This paper describes a number of optimizations that can be used to support the efficient
execution of irregular problems on distributed memory parallel machines. We describe
software primitives that (1) coordinate interprocessor data movement, (2) manage the
storage of, and access to, copies of off-processor data, (3) minimize interprocessor
communication requirements and (4) support a shared name space. We present a detailed
performance and scalability analysis of the communication primitives. This performance and
scalability analysis is carried out using a workload generator, kernels from real
applications and a large unstructured adaptive application (the molecular dynamics code
CHARMM).
- Parallel DSMC Solution of Three-Dimensional Flow over a
Finite Flat Plane
R. Nance, R. Wilmoth, B. Moon, H. Hassan, and J. Saltz.
Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference,
Colorado Springs, Colorado, June 20-23, 1994
- This paper describes a parallel implementation of the direct simulation Monte Carlo
(DSMC) method. Runtime library support is used for scheduling and execution of
communication between nodes, and domain decomposition is performed dynamically to maintain
a good load balance. Performance tests are conducted using the code to evaluate various
remapping and remapping-interval policies, and it is shown that one-dimensional chain
partitioning method works best for the dimensional problems considered. The parallel code
is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is
shown that the parallel algorithm produces results which compare well with experimental
data. Moreover, it yields significantly faster execution times that the scalar code, as
well as very good load-balance characteristics.
- Supporting Irregular Distributions in FORTRAN 90D/HPF
Compilers
Ravi Ponnusamy, Yuan-Shin Hwang, Joel Saltz, Alok Choudhary, and Geoffrey Fox
IEEE Parallel & Distributed Technology, Spring 1995 (revised version)
University of Maryland: Department of Computer Science and UMIACS Technical Reports
CR-TR-3268 and UMIACS-TR-94-57
May 1994
- We present methods that make it possible to efficiently support an important subclass of
irregular problems using data parallel languages. The approach we describe involves the
use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS
runtime support library contains procedures that support static and dynamic distributed
array partitioning, partition loop iterations and indirection arrays, remap arrays from
one distribution to another, and carry out index translation, buffer allocation and
communication schedule generation.
- The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime
support for irregular problems. We present performance results of compiler-generated and
hand-parallelized versions of two stripped down applications codes. The first code is
derived from an unstructured mesh computational fluid dynamics flow solver and the second
is derived from the molecular dynamics code CHARMM.
- A method is described that makes it possible to emulate irregular distributions in HPF
by reordering elements of data arrays and renumbering indirection arrays. We present
results that suggest that an HPF compiler could use reordering and renumbering extrinsic
functions to obtain performance comparable to that achieved by a compiler for a language
(such as Fortran 90D) that directly supports irregular distributions.
- Adaptive Runtime Support for Direct Simulation Monte
Carlo Methods on Distributed Memory Architectures
Bongki Moon and Joel Saltz.
Proceedings of the Scalable High Performance Computing Conference 1994, pages
176-183, Knoxville, Tennessee, May 23-25, 1994
University of Maryland and UMIACS Technical Report: CS-TR-3427 and UMIACS-TR-95-27
February 1995
- In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and
Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step
to time step. This fluctuation may hinder efficient utilization of distributed memory
parallel computers because of the resulting overhead for data redistribution and dynamic
load balancing. This may hinder efficient utilization of runtime pre-processing because
the pre-processing requirements are sensitive to perturbations in the data access
patterns. To efficiently parallelize such adaptive irregular problems on distributed
memory parallel computers, several issues such as effective methods for domain
partitioning, efficient index dereferencing and fast data transportation must be
addressed. This paper presents efficient runtime support methods for such problems. These
new runtime support primitives have recently been implemented and added to the CHAOS
library. A new domain partitioning algorithm is introduced A simple one-dimensional domain
partitioning method is implemented and compared with unstructured mesh partitioners such
as recursive coordinate bisection and recursive inertial bisection. A remapping decision
policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes.
Performance results are presented.
- Runtime Support to Parallelize Adaptive Irregular
Programs
Y.-S. Hwang, B. Moon, S. Sharma, R. Das, and J. Saltz
Proceeding of the Workshop on Environments and Tools for Parallel Scientific
Computing, pages 19-32, May 1994
- This paper describes how a runtime support library can be used as compiler runtime
support in irregular applications. The CHAOS runtime support library carries out
optimizations designed to reduce communication costs by performing software caching,
communication coalescing and inspector/executor preprocessing. CHAOS also supplies special
purpose routines to support specific types of irregular reduction and runtime support for
partitioning data and work between processors. A number of adaptive irregular codes have
been parallelized using the CHAOS library and performance results from these codes are
also presented in this paper.
- Multiprocessor Runtime Support for Fine-Grained,
Irregular DAGs
Frederick T. Chong, Shamik D. Sharma, Eric Brewer, and Joel Saltz
Parallel Processing Letters, pages 671-683, Volume 5, Number 4,
1995
University of Maryland: Department of Computer Science Technical Report,
CS-TR-3267
March, 1994
- Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs
(DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our
experiments on the CM-5, whose lower latencies and active-message support allow us to
achieve unprecedented speedups for a general multiprocessor. Where as previous
implementations have maximum speedups of less than 4 on even simple banded matrices, we
are able to obtain scalable performance on extremely small and irregular problems. On a
matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of
almost 35 for 128 processors.
- We achieve these speedups with non-matrix-specific methods which are applicable to any
DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from
the Harwell-Boeing benchmark set. Although precomputed data distributions and execution
schedules produce the best performance, we find that it is challenging to keep their cost
low enough to make them worthwhile on small, fine-grained problems.
- Additionally, we find that a policy of frequent network polling can reduce communication
overhead by a factor of three over the standard CM-5 policies. We present a detailed study
of runtime overheads and demonstrate that send and receive processor overhead still
dominate these applications on the CM-5. We conclude that these applications would highly
benefit from architectural support for low-overhead communication.
- The Design and Implementation of a Parallel Unstructured
Euler Solver Using Software Primitives
Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy
Journal of the American Institute of Aeronautics and Astronautics (AIAA), pages
489-496, Volume 32, Number 3, March 1994
- This paper is concerned primarily with the implementation of a three-dimensional
unstructured-grid Euler-solver on massively parallel distributed memory computer
architectures. The goal is to minimize solution time by achieving high computational rates
with a numerically efficient algorithm. A unstructured multigrid algorithm with an
edge-based data-structure has been adopted, and a number of implementations have been
devised and implemented in order to accelerate the parallel computational rates. The
implementation is carried out by creating a set of software tools, which provide an
interface between the parallelization issues and the sequential tools, while providing a
basis for future automatic run-time support. Large practical unstructured grid problems
are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The
quantitative effect of the various optimizations are demonstrated, and we show that the
combined effects of these optimizations leads to roughly a factor of three performance
improvement. The overall solution effeciency is compared with that obtained on the
CRAY-YMP vector supercomputer.
-
- A Manual for the Multiblock PARTI Runtime Primitives,
Revision 5
Alan Sussman, Gagan Agrawal, and Joel Saltz
University of Maryland: Department of Computer Science and UMIACS Technical Reports
CS-TR-3070.1, UMIACS-TR-93-36.1
December 1993
- There exists a large class of scientific applications that are composed of irregularly
coupled regular mesh (ICRM) computations. These problems are often referred to as block
structured or multiblock, problems and include the Block Structured Navier-Stokes solver
developed as NASA Langley called TLNS3D.
- Primitives are presented that are designed to help users to efficiently program such
problems on distributed memory machines. These primitives are also designed for use by
compilers for distributed memory multiprocessors. Communications patterns are captured at
runtime, and the appropriate send and received messages are automatically generated. The
primitives are also useful for parallelizing regular computations, since block structured
computations also require all the runtime support necessary for regular computations.
- Compiler and Runtime Support for Structured and Block
Structured Applications
Gagan Agrawal, Alan Sussman, and Joel Saltz.
Proceedings of Supercomputing '93, pages 578-587, Portland Oregon,
November 15-19, 1993
University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29
- Scientific and engineering applications often involve structured meshes. These meshes
may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called
Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library
for parallelizing this general class of applications on distributed memory parallel
machines in an efficient and machine independent manner. In this paper we present how this
runtime library can be integrated with compilers for High Performance Fortran (HPF) style
parallel programming languages. We discuss how we have integrated this runtime library
with the Fortran 90D compiler being developed at Syracuse University and provide
experimental data on a block structured Navier-Stokes solver template and a small
multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show
that the compiler parallelized code performs within 20% of the code parallelized by
inserting calls to the runtime library manually.
- Runtime-Compilation Techniques for Data Partitioning and
Communication Schedule Reuse
Ravi Ponnusamy, Joel Saltz, and Alok Choudhary
Proceedings Supercomputing '93, pages 361-370, Portland, Oregon, November
15-19, 1993
University of Maryland and UMIACS Technical Report: CS-TR-3055 and
UMIACS-TR-93-32
- In this paper, we describe two new ideas by which HPF compiler can deal with irregular
computations effectively. The first mechanism invokes a user specifies mapping procedure
via a set of compiler directives. The directives allow the user to use program arrays to
describe graph connectivity, spatial location of array elements and computational load.
The second is a simple conservative method that in many cases enables a compiler to
recognize that it is possible to reuse previously computed results from inspectors(e.g.
communication schedules, loop iteration partitions, information that associates
off-processor data copies with on-processor buffer locations). We present performance
results for these mechanisms from a Fortran 90D compiler implementation.
- Applying the CHAOS/PARTI Library to Irregular Problems in
Computational Chemistry and Computational Aerodynamics
Raja Das, Y. Hwang, M. Uysal, J. Saltz, and A. Sussman
Proceedings of the Scalable Parallel Libraries Conference 1993, pages 45-56,
Starkville, Mississippi, October 6-8, 1993
- This paper describes a number of optimizations that can be used to support the efficient
execution of irregular problems on distributed memory parallel machines. We describe
software primitives that (1) coordinate interprocessor data movement, (2) manage the
storage of, and access to, copies of off-processor data, (3) minimize interprocessor
communication requirements and (4) support a shared name space. The performance of the
primitives is characterized by examination of kernels from real applications and from a
full implementation of a large unstructured adaptive application (the molecular dynamics
code CHARMM).
- On Efficient Runtime Support for Multiblock and Multigrid
Applications: Regular Section Analysis
Gagan Agrawal, Alan Sussman, and Joel Saltz
University of Maryland: Department of Computer Science and UMIACS Technical Reports
CS-TR-3140, UMIACS-TR-93-92
October 1993
- Scientific and engineering applications often involve structured meshes. These meshes
may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called
Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library
for parallelizing this general class of applications on distributed memory parallel
machines in an efficient and machine independent manner. One important communication
primitive supported by this library is the regular section move, which copies a regular
section from a distributed array to another distributed array, potentially involving
changes in offset, stride and rotation of dimensions. In this paper we discuss the regular
section analysis which is required for efficiently generating schedules for this kind of
communication. We discuss the details of the analysis required when the distributions of
arrays may be block or cyclic.
- Slicing Analysis and Indirect Access to Distributed
Arrays
Raja Das, Joel Saltz, and Reinhard v. Hanxleden.
Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing,
pages 152-168, Portland Oregon, August 12-14, 1993
University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42
- An increasing fraction of the applications targeted by parallel computers makes heavy
use of indirection arrays for indexing data arrays. Such irregular access patterns
make it difficult for a compiler to generate efficient parallel code. A lim itation of
existing techniques addressing this problem is that they are only applicable for a single
level of indirection. However, many codes using sparse data structures across their data
through multiple levels of indirection.
This paper presents a method for transforming programs using multiple levels of
indirection into programs with at most one level of indirection, thereby broadening the
range of applications that a compiler can parallelize efficiently. A central concept of
our algorithm is to perform program slicing on the subscript expressions of the
indirect array accesses. Such slices peel off the levels of indirection, one by one, and
create opportunities for aggregated data prefetching on between. A slice graph eliminates
redundant preprocessing and gives an ordering in which to compute the slices. We present
our work in the context of High Performance Fortran, an implementation in Fortran D
prototype compiler is in progress.
- Parallelizing Molecular Dynamics Codes using the PARTI
Software Primitives
Raja Das and Joel Saltz
Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific
Computing, pages 187-192, Norfolk, Virginia, March, 1993
- This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on
massively parallel distributed memory computer architectures using a data-parallel approach.
The implementation is carried out by creating a set of software tools, which will provide
an interface between the parallelization issues and the sequential code. Large practical
MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is
compared with that obtained when implementation is done using data-replication.
- The Dybbuk Runtime System
R. Ponnusamy, R. Das, J. Saltz, and D. Mavriplis
Proceedings of COMPCON 1993, pages 205-212, San Francisco, California, February
23-26, 1993
- Over the past few years, we have developed methods that make it possible for a compiler
to efficiently map many sparse and unstructured scientific problems to scalable
multiprocessor architectures. These methods are implemented using compiler embeddable
runtime support procedures. These procedures can be though of as supporting a type of
weakly coherent distributed shared memory that can be closely coupled to distributed
shared memory compilers. These procedures (1) coordinate inter-processor data movement,
(2) manage the storage of and access to copies of off-processor data, (3) support a shared
name space, and (4) couple runtime data and workload partitioners to compilers. We are in
the process of implementing prototype High Performance Fortran compilers which use this
runtime support to handle unstructured computations.
- Value-based Distributions in Fortran D: A Preliminary
Report
R. von Hanxleden, K. Kennedy, and J. Saltz
Journal of Programming Languages, Special Issue on Compiling and Run-Time Issues
for Distributed Access Machines, pages 259-282, Volume 2, Number 3, December 1993
CRPC 1993, no. CRPC-TR93365-S
December 1993
- Compiling irregular applications written in a data-parallel, High Performance
Fortran-like language presents a challenging problem of growing importance. One principal
difficulty with irregular problems is the general lack of access locality when
distributing data and computation naively, based on simple functions of array and
iteration indices. To address this problem, we extend classical, index-based data
distributions to value-based distributions. This type of distribution allows the
programmer to conveniently express for example the locality characteristics of an
underlying physical problem. This gives the compiler an opportunity to improve both inter-
and intra- processor locality, resulting in better performance and scalability, even when
this locality is not inherent in the original data structures and access patterns.
- This paper reports on the experience gained from implementing value-based distributions
in a Fortran 77D compiler prototype. We propose a natural extension of index-based
distributions as already present in Fortran D and HPF. This paper addresses the
compilation issues involved and makes a quantitative comparison of index and value-based
distributions for a Molecular Dynamics application.
-
- PARTI Primitives for Unstructured and Block Structured
Problems
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, and K.
Crowley
Computing Systems in Engineering, pages 73-86, Volume 3, Number 1-4,
1992
- This paper describes a set of primitives (PARTI) developed to efficiently execute
unstructured and block structured problems on distributed memory parallel machines. We
present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone
Delta to demonstrate the usefulness of out methods.
- Compiler Analysis for Irregular Problems in Fortran-D
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, and J. Saltz
Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing,
pages 97-111, New Haven, Connecticut, August 3-5, 1992
- We developed a dataflow framework which provides a basis for a rigorously defining
strategies to make use of runtime preprocessing methods for distributed memory
multiprocessors.
- In many programs, several loops access the same of f-processor memory locations. Our
runtime support gives as a mechanism for tracking and reusing copies of off-processor
data. A key aspect of our compiler analysis strategy is to determine when it is safe to
reuse copies of off-process or data. Another crucial function of the compiler analysis is
to identify situations which allow runtime preprocessing overheads to be amortized. This
dataflow analysis will make it possible to effectively use the results of interprocedural
analysis in our efforts to reduce interprocessor communication and the need for runtime
preprocessing.
|