Active Data Repository |
|
Software DistributionRelated Information |
High Performance Multidimensional DatabasesOver the next few years, increasingly large datasets will be generated by detailed three (or more) dimensional scientific simulations and by sensors attached to satellites, aircraft, telescopes and microscopes, etc. These datasets describe computed or measured attributes associated with spatial regions. Portions of datasets may be rendered and directly viewed, or datasets can be used to couple applications. For instance in one of our applications (joint work with Mary Wheeler at UT Austin), a fluids code is used to compute circulation of water in a bay. The output of the fluids code is stored as a dataset and portions of the dataset are postprocessed and used as inputs to a chemistry code in order to carry out disaster planning and pollution remediation scenario simulations. A single dataset may be used either sequentially or simultaneously by a number of different applications. Applications typically do not make use of all possible sensor data available in a dataset, instead, the applications select data associated with a spatio-temporal region. Sensor datasets frequently consist of a number of different attributes (e.g. telemetry bands in satellite data). A given application may only use particular sensor data attributes. The application may also combine existing attributes to synthesize new attributes. Since the same physical entity may be described in a complementary manner by different types of sensor datasets, applications may need to generate new datasets by joining preexisting datasets. We have developed a software tool, the Active Data Repository (ADR) that optimizes storage, retrieval and processing of very large multi-dimensional datasets. In ADR, datasets can be described by a multidimensional coordinate system. In some cases datasets may be viewed as structured or unstructured grids, in other cases (e.g. multiscale or multiresolution problems), datasets are hierarchical with varying levels of coarse or fine meshes describing the same spatial region. Processing takes place in one or several steps during which new datasets are created, preexisting datasets are transformed or particular data are output. Each step of processing can be formulated by specifying mappings between dataset coordinate systems. Results are computed by aggregating (with a user defined procedure), all the items mapped to particular sets of coordinates. ADR is designed to make it possible to carry out data aggregation on processors that are tightly coupled to disks. Since the output of a data aggregation is typically much smaller than the input, use of ADR can significantly reduce the overhead associated with obtaining postprocessed results from large datasets. The Active Data Repository can be categorized as a type of database; correspondingly retrieval and processing operations may be thought of as queries. ADR provides support for common operations including index generation, data retrieval, memory management, scheduling of processing across a parallel machine and user interaction. ADR assumes a distributed memory (or in database terminology, shared-nothing) architecture consisting of one or more I/O devices attached to each of one or more processors. Datasets are partitioned and stored on the disks. An application implemented using ADR consists of one or more clients, front-end processes, and a parallel backend. A client program, implemented for a specific domain, generates requests that are translated into ADR queries by ADR front-end processes. The front-end translates the requests into ADR queries and performs flow control, prioritization and scheduling of ADR queries that resulted from client requests. An ADR query consists of: (1) a reference to two datasets A and B, (2) a range query that specifies a particular spatial region in dataset A or dataset B, (3) a reference to a projection function that maps elements of dataset A to dataset B, (4) a reference to an aggregation function that describes how elements of dataset A are to be combined and accumulated into elements of dataset B, and (5) a specification of what to do with the output (i.e. update a currently existing dataset, send the data over the network to another application, create a new dataset etc.) An ADR query specifies a reference to an index, which is used by ADR to quickly locate the data items of interest. Before processing a query, ADR formulates a query plan to define the order the data items are retrieved and processed, and how the output items are generated. The current ADR infrastructure has targeted a small but carefully selected set of high end scientific and medical applications, and is currently customized for each application through the use of C++ class inheritance. We are currently in the process of efforts that should lead to a significant extension of ADR functionality. Our plan is to: (1) develop the compiler and runtime support needed to make it possible for programmers to use a high level representation to specify ADR datasets, queries and computations, (2) target additional challenging data intensive driving applications to motivate extension of ADR functionality, and (3) generalize ADR functionality to include queries and computations that combine datasets resident at different physical locations. |