Anurag Acharya, Joel Saltz Alan Sussman
Department of Computer Science
The goals of this project were two-fold: to understand the I/O requirements of algorithms for data product generation and to develop techniques that help meet these requirements on suitably configured machines. Towards the first goal, we have analyzed a variety of existing data product generation programs and have successfully parallelized two of them, pathfinder and climate which generate the Level 2 and Level 3 data products for the AVHRR Land Pathfinder effort. We have also developed a parallel I/O library suitable for parallel data generation programs. We will describe our experiences in Section 2. We will also present our suggestions regarding the structure of EOSDIS data product generation programs, the organization of the files used to store the data products and the runtime support needed for effective parallelization of data product generation applications.
Based on our understanding of the I/O and processing requirements of these applications, we have developed several techniques to help meet them on suitably configured machines. These techniques deal with (1) declustering multi-dimensional datasets on large disk farms, (2) partitioning satellite data products for efficient retrieval, (3) overlapping I/O, computation and communication to perform data retrieval and processing on the same processors and (4) interprocedural analysis to automate the placement of asynchronous I/O operations. We describe these techniques in Sections 4 and 5. Based on these techniques, we have developed Titan, a high-performance database for remote-sensing data. The computational platform for Titan is a 16-processor IBM SP-2 with four fast disks attached to each processor. Titan is currently operational and contains about 24~GB of AVHRR data on the NOAA-7 satellite. Titan supports interactive queries over its data and supports full-globe queries as well localized queries. Experimental results show that Titan provides good performance for global queries, and interactive response times for localized queries. We describe the design and evaluation ofTitan in Section 3.
Based on our experience with Titan, we are currently in the process of developing an extensible framework for managing extremely large multi-dimensional datasets. We plan to implement this framework both as a stand-alone system for efficient storage, retrieval and processing of large data repositories and as a database extender which allows multi-dimensional datasets to be integrated with commercial relational databases which store other forms of data, in particular metadata asociated with the datasets.
Inspite of significant differences in the science algorithxms and the organization of the code, all these applications have a common structure. Programs that generate Level 2 products process Level 1 files which contain information from a single satellite orbit and generate a single file that contains a composited multi-band image for the area of interest. Before composition, individual values are corrected for various distortions and are navigated to the desired projection and resolution. The composition operation is a complex max-reduction operation - the specific predicate used to determine when one pixel is preferable to another depends on the program and the dataset. The reduction operation is performed by: (1) creating a temporary image; (2) processing all the inputs with a fixed chunk size;(3) processing and navigating the IFOVs in a chunk; (4) performing the max-reduction. Given the similarities, we selected pathfinder and climate as the prototypical programs for the generation of Level 2 and Level 3 data products.
We parallelized pathfinder by partitioning the output image in equal-sized horizontal strips. Each processor is responsible for all processing needed to generate its partition of the output image. We chose to partition the output image (instead of the input data) as this allows all combination operations to be local to individual processors. No inter-processor communication is needed. We chose a horizontal partitioning scheme to take advantage of the row-major storage format used in all files (input, ancillary as well as output files). Horizontal striping allows I/O to be performed in large contiguous blocks.
Each processor computes the map from the input data set to the output image by subsampling (one scan line per chunk) all input files. It then reads the chunks that intersect with its partition of the output image. For each chunk, it maps each input pixel into the output image. Pixels that map into its partition are processed further, others are ignored. The individual partitions of the output image are also too large to be stored in main memory. Therefore, the composition operation is still out-of-core. Once all processing is completed, the final result is produced by concatenating the individual partitions.
In climate, the mapping between the pixels of the input image and those of the output image is data-independent and can be computed a-priori. The amount of computation to be done is proportional to the amount of input data. We parallelized climate by horizontally partitioning the output image. Each processor reads the data that maps to its partition of the output image. Load balance is achieved by ensuring that all processors read approximately equal amounts of data.
The total I/O performed by pathfinder is over 28GB and the total I/O performed by climate is 75.5MB. The original version of pathfinder ran for 18,800 seconds on a single processor of an IBM SP-2. Of this, about 13,600 seconds (76% of the time) were spent waiting for I/O. The final version took 1200 seconds using 12 processors. Of this, 10-15% time was spent waiting for I/O -- pathfinder is now computation-bound. The maximum aggregate application level I/O rate was 644 MB/s. For climate, the execution time was reduced from 200 seconds to 32 seconds (on eight processors) of which 4-5% was spent waiting for I/O. The maximum aggregate application-level I/O bandwidth for climate was 36 MB/s. These experiments were conducted on an IBM SP-2 which has been configured with 16 thin nodes, two Fast/Wide SCSI adaptors per node, and three IBM Starfire 7200 disks (7 MB/s application-level I/O bandwidth) per SCSI adaptor. More details about this tuning and evaluation effort can be found in our paper titled"Tuning the Performance of I/O-intensive Parallel Applications" which appeared in the Fourth Annual Workshop on I/O in Parallel and Distributed Systems (IOPADS'96).
Titan partitions its data set into coarse-grained data blocks and uses a simplified R-tree to index these chunks. This index is stored at the front-end which uses it to build a plan for the retrieval and processing of the required data blocks. The size of this index for 24 GB of AVHRR data is 11.6 MB, which is small enough to be held in primary memory.
Titan queries specify four constraints: (1) temporal bounds (a range in universal coordinated time), (2) spatial bounds (a quadrilateral on the surface of the globe), (3) sensor type and number, and (4) resolution of the output image. The result of a query is a multi-band image. Each pixel in the result image is generated by composition over all the sensor readings for the corresponding area on the earth's surface.
When the front-end receives a query, it searches the index for all data blocks that intersect with the query. It uses the location information for each block (which is stored in the index) to determine the set of data blocks to be retrieved by each back-end node. In addition, the front-end partitions the output image among all the back-end nodes. Currently, the output image is evenly partitioned by blocks of rows and columns, assigning each back-end node approximately the same number of output pixels. Under this partitioning scheme, data blocks residing on the disks of a node may be processed by other nodes; each back-end node processes the data blocks corresponding to its partition of the output image. The front-end distributes the data block requests and output image partitions to all back-end nodes.
Each back-end node computes a schedule for retrieving the blocks from its disks. This schedule tries to balance the needs of all nodes that will process these data blocks. As soon as a data block arrives in primary memory, it is dispatched to all nodes that will process it. Once a data block is available for processing (either retrieved from local disk or forwarded by another node), a simple quadrature scheme is used to search for sensor readings that intersect with the local partition of the output image. After all data blocks have been processed, the output image can either be returned to the front-end for forwarding to the querying client, or it can be stored in a file for later retrieval.
Data layout decisions in Titan were motivated by the format of AVHRR data and the common query patterns identified by NASA researchers and our collaborators in the University of Maryland Geography Department. We distributed the AVHRR data on a large disk farm. We used the declustering algorithms described in Section 3 to compute the data distribution.
Titan is currently operational on a 16-processor IBM SP-2 with four IBM Starfire 7200 disks attached to each processor. It contains about 24 GB of AVHRR data from the NOAA-7 satellite.
We have run a sequence of experiments on Titan to evaluate our techniques for partitioning the images into chunks, declustering the chunks over a large disk farm and placement of the chunks assigned to individual disks. Experimental results show that Titan provides good performance for global queries, and interactive response times for local queries. A global query for a 10-day composite of normalized vegetation index takes less than 100 seconds; similar queries for Australia and the United Kingdom takes 4 seconds and 1.5 seconds respectively. Our data distribution techniques improved the disk parallelism, the number of disks active for individual queries by 48 to 70 percent. The total estimated retrieval time was reduced by between 8 and 33 percent. We also evaluated schemes for placement of data blocks assigned to a single disk. We found that the average length of a read (without an intervening seek) can be improved by about a factor of two. Design, implementation and evaluation of Titan has been described in our paper titled "Titan: A High-Performance Remote-sensing Database" which appeared in the International Conference on Data Engineering, 1997.
Based on our experience with Titan, we are currently in the process of developing an extensible framework for managing extremely large multi-dimensional datasets. We plan to implement this framework both as a stand-alone system for efficient storage, retrieval and processing of large data repositories and as a database extender which allows multi-dimensional datasets to be integrated with commercial relational databases which store other forms of data, in particular metadata asociated with the datasets.
Satellite data generation programs are relatively easy to parallelize: Given the common structure of different data product generation programs, we believe that the parallelization scheme described in this report should be suitable for most, if not all, data product generation programs. Since communication between peers is needed only for putting together the final output, this scheme should work as well on shared memory as on distributed memory machines.
Proper code restructuring is important:
As far as possible, I/O should be done in the outermost nest of a nested loop. Embedding I/O calls in inner nests of a nested loops usually results in a sequence of small requests interleaved with seeks. It is usually possible to restructure the loop nests so that the I/O is performed in outermost nest and only computation done in the inner nests. This restructuring is illustrated by the following example which is basd on the composition module of pathfinder. For the applications we studied, this was not a difficult operation.
Original code:
for (i = 0; i < num_input_pixels; i++)for (j = 0; j < num_of_bands; j++)map input pixel to output imageseek to output pixel in this band write pixel value for this band
Restructured code:
determine the bounding box output pixels involvedfor (j = 0; j < num_bands; j++) read in the bounding box for this bandfor (i = 0; i < num_input_pixels; i++) map input pixel to output imagewrite out the bounding box for this bandupdate output image pixel for this band
Information about future requests is usually available: In the parallelization scheme described above, processors subsample the input files in the partitioning phase. At the end of this phase, every processor has complete information about its future requests for input reads. For the modified version of the out-of-core max-reduction (where modification consisted of a pair of simple loop-splitting and loop-reordering transformations), information about updates to all frequency bands of the output image is known before any updates are performed.
It is possible to partition the intermediate data so that each processor reads and writes to its own local disk(s). Bandwidths for local disk access are substantially higher than the bandwidths for non-local accesses. In addition, local accesses are guaranteed not to interfere with I/O requests from other processors. This increases the utility of the file cache and makes the overall behavior of the application more predictable.
stdio is usually not suitable for satellite data processing:Many programs use the fread/fwrite interface for I/O which introduces an extra level of buffering and requires one extra copy. Since individual requests are usually large enough, the buffering performed by fread/fwrite does not provide a performance advantage and the read/write interface is likely to provide additional benefit.
Geo-location information should be placed at the top of file: in the absence of information that can be used to map the IFOVs contained in a file, our parallelization scheme is forced to have each processor subsample all the files. This is inefficient and limits scalability. Providing geo-location information at the beginning of the file would allow each processor to read data proportional to the number of files.
Diskful machines are important: Diskful machines (machines with local disks) allow problems to be partitioned such that most of the I/O requests are satisfied by local disks. As noted above, local disk accesses have a higher application-level bandwidth with the associated benefit of guaranteed lack of contention for the disk and the file cache. In combination with code restructuring to exploit locality, diskful machines can improve both the I/O performance and the overall execution time for out-of-core applications.
Complex I/O interfaces are not required: After code restructuring, most requests in the studied applications were large. For large requests, the interface is usually less important. Small strided requests were a recurrent pattern in the original versions of pathfinder and climate. However we found that these patterns were caused by the embedding of small I/O requests in the innermost loops. Relatively straightforward loop restructuring, including loop splitting, interchanging the order of nested loops and fusing multiple requests were sufficient to coalesce these requests into large block I/O requests. None of the applications studied required collective I/O. This is not surprising given the size of the requests after code restructuring. All of the applications are parallelized in SPMD fashion. In our earth-science applications all processes are independent (apart from initial and possibly final synchronization). Independent I/O requests were able to utilize the servers when they would have been idle in a collective-I/O model.
Compiler-directed placement of I/O operations can be eliminate I/O waiting time: our experiments showed that complete overlap of the write operations with computation can be achieved through flow-sensitive interprocedural analysis. Note that almost no overlap would have been possible if the analysis was restricted to within single procedures.
The declustering algorithm mentioned above scales well: as the number of disks is increased and consistently achieves a better response time compared to all the other algorithms (with a few exceptions when the number of disks is small). It also achieves perfect data balance and maximizes the disk space utilization. Furthermore, it rarely maps buckets that are close in the data space to the same disk indicating that the distributions it generates are probably quite close to the optimal distribution.