Active Data Repository
Project
High Performance Multidimensional
Databases
Over the next few years, increasingly large datasets will be generated
by detailed three (or more) dimensional scientific simulations and by sensors attached to
satellites, aircraft, telescopes and microscopes, etc. These datasets describe computed or
measured attributes associated with spatial regions. Portions of datasets may be rendered
and directly viewed, or datasets can be used to couple applications. For instance in one
of our applications (joint work with Mary Wheeler at UT Austin), a fluids code is used to
compute circulation of water in a bay. The output of the fluids code is stored as a
dataset and portions of the dataset are postprocessed and used as inputs to a chemistry
code in order to carry out disaster planning and pollution remediation scenario
simulations. A single dataset may be used either sequentially or simultaneously by a
number of different applications. Applications typically do not make use of all possible
sensor data available in a dataset, instead, the applications select data associated with
a spatio-temporal region. Sensor datasets frequently consist of a number of different
attributes (e.g. telemetry bands in satellite data). A given application may only use
particular sensor data attributes. The application may also combine existing attributes to
synthesize new attributes. Since the same physical entity may be described in a
complementary manner by different types of sensor datasets, applications may need to
generate new datasets by joining preexisting datasets.
We have developed a software tool, the Active Data Repository (ADR) that optimizes
storage, retrieval and processing of very large multi-dimensional datasets. In ADR,
datasets can be described by a multidimensional coordinate system. In some cases datasets
may be viewed as structured or unstructured grids, in other cases (e.g. multiscale or
multiresolution problems), datasets are hierarchical with varying levels of coarse or fine
meshes describing the same spatial region. Processing takes place in one or several steps
during which new datasets are created, preexisting datasets are transformed or particular
data are output. Each step of processing can be formulated by specifying mappings between
dataset coordinate systems. Results are computed by aggregating (with a user defined
procedure), all the items mapped to particular sets of coordinates. ADR is designed to
make it possible to carry out data aggregation on processors that are tightly coupled to
disks. Since the output of a data aggregation is typically much smaller than the input,
use of ADR can significantly reduce the overhead associated with obtaining postprocessed
results from large datasets.
The Active Data Repository can be categorized as a type of database; correspondingly
retrieval and processing operations may be thought of as queries. ADR provides support for
common operations including index generation, data retrieval, memory management,
scheduling of processing across a parallel machine and user interaction. ADR assumes a
distributed memory (or in database terminology, shared-nothing) architecture consisting of
one or more I/O devices attached to each of one or more processors. Datasets are
partitioned and stored on the disks. An application implemented using ADR consists of one
or more clients, front-end processes, and a parallel backend. A client program,
implemented for a specific domain, generates requests that are translated into ADR queries
by ADR front-end processes. The front-end translates the requests into ADR queries and
performs flow control, prioritization and scheduling of ADR queries that resulted from
client requests.
An ADR query consists of: (1) a reference to two datasets A and B, (2) a range query
that specifies a particular spatial region in dataset A or dataset B, (3) a reference to a
projection function that maps elements of dataset A to dataset B, (4) a reference to an
aggregation function that describes how elements of dataset A are to be combined and
accumulated into elements of dataset B, and (5) a specification of what to do with the
output (i.e. update a currently existing dataset, send the data over the network to
another application, create a new dataset etc.) An ADR query specifies a reference to an
index, which is used by ADR to quickly locate the data items of interest. Before
processing a query, ADR formulates a query plan to define the order the data items are
retrieved and processed, and how the output items are generated.
The current ADR infrastructure has targeted a small but carefully selected set of high
end scientific and medical applications, and is currently customized for each application
through the use of C++ class inheritance. We are currently in the process of efforts that
should lead to a significant extension of ADR functionality. Our plan is to: (1) develop
the compiler and runtime support needed to make it possible for programmers to use a high
level representation to specify ADR datasets, queries and computations, (2) target
additional challenging data intensive driving applications to motivate extension of ADR
functionality, and (3) generalize ADR functionality to include queries and computations
that combine datasets resident at different physical locations.
Related Information: |
- Software distribution
- Publication List
- Crisis Management in Bays and Estuaries NPACI enVision, Volume 15, Number 1, January - March 1999
- Compiling for ADR, NPACI enVision, Volume
15, Number 1, January - March 1999
- Active Data Repository Accelerates Access
to Large Data Sets NPACI enVision, Volume 14, Number 2, April - June 1998
|
Presentations: |
- NPACI All Hands Meeting Tutorial, San Diego Supercomputing Center, San Diego, CA,
Feb. 9, 2000
- Databases
and Systems Software for Multi-Scale Problems, NSF/DOE Workshop on
Large-Scale Visualization and Data Management, Computational Steering and Interactive
Visualization, University of Utah, Salt Lake City, May 20-22, 1999.
- Large Irregular Datasets and the Computational Grid, Workshop on
Programming Environments, Clusters, and Computational Grids for Scientific Computing,
Walland, Tennessee, September 2-4, 1998
- NPACI Programming Tools and Environments, Pasadena,
California, August 1998
|
|