Notes from the 5/13/96 and 5/14/96

SIO Technical meeting "High-level" API breakout session

Jeff Hollingsworth
Manuel Ujaldon
Alan Sussman

Summary: The starting point for discussion was the use of MPI-IO as the SIO project's "high-level" API. There seemed to be agreement that MPI-IO was fine for MPI, but there was discussion about whether MPI-IO was appropriate for higher level libraries and for non-MPI programs. The general conclusion at the end of the meeting was that with the addition of several wrappers and helper functions, MPI-IO was reasonable and technically feasible as the SIO high level API.

The majority of these notes are structured as a series of issues, discussion, and conclusion reached (if any). There is also a set of milestones at the end of the document.

Issue: The MPI-IO interface is too complicated for non-MPI programmers.

Discussion: Some application programmers reported how hard the current version of MPI-IO was to use compared to other established interfaces. Two separate examples were addressed:

Compared to the UNIX I/O interface, some common calls contain an excessive number of parameters. MPI_Open, for example, has 8 parameters, and this is too complex for a non-MPI programmer (It was also noted this may be too complex for some MPI programmers). Two options were discussed about how to provide a simpler interface for the "common" case, but preserve the full features. The first option is to provide a wrapper for MPI_Open that defaults the communicator to MPI_COMM_WORLD, the etype to MPI_byte, the disp to 0, and the filetype to MPI_byte, and no hints. This would reduce the number of parameters on open from 8 to 3 in the simple case. The suggestion was made that MPIO_Open be the simplified three parameter version and that the current MPIO_Open be re-named MPIO_Openx. The second option discussed was to simplify MPIO_open and then use an extended version of MPIO_file_control.
The representative from Cray reported that a quick check of the MPI-IO and MPI documents revealed that only 7 MPI-like calls would be required to provide a minimal standalone interface for non-MPI programs. These calls are required to provide the MPI communicator and type construction functionality.

Conclusion: There was agreement that providing two different interfaces, one simpler for easy things and another more complicated for complex I/O applications would simplify programming for both MPI and non-MPI programmers alike.

Issue: The relationship between MPI and MPI-IO.

Discussion: It was emphasized that the starting point right now for including I/O in the MPI standard is the current MPI-IO proposal. However, that does not mean anything, since PVM was the starting point for MPI and then nothing was taken from PVM. Moreover, MPI-IO has that name because of the analogy between I/O and message-passing, NOT because it has a direct relationship with, or requires the use of, MPI.

It was also pointed out that I/O is more complex than message-passing, mainly because of the persistence of files, so may lead to a more difficult standardization process.

There was some consensus that the less dependent the interface is on MPI, the more acceptable MPI-IO will be. To achieve this goal, several actions were suggested:

Make the MPI and MPI-IO documents fully independent: Include in the MPI-IO document text about MPI communicators and data types, perhaps inserted directly from the final MPI-1 document.
MPI-IO requires MPI datatypes to describe the data distribution in files and user buffers. If we want to make MPI-IO independent, MPI-IO would need to describe that in a similar way, perhaps with some simple calls for common data types (or a library of datatypes for common patterns).
Make MPI-IO calls more friendly to non-MPI users, by providing wrappers and/or convenience functions for simple operations.

Conclusion: It remains to be seen how much of MPI-IO will be accepted and included in MPI, and in what form. The next MPI meeting will not address details of MPI-IO and there is still plenty of time to modify MPI-IO so that both can converge in a joint effort for developing a standard I/O interface.

Issue: Do non-overlapping I/O write requests from different processes result in the file being written in a sequentially consistent manner?

Discussion: There are two cases to consider. First, both processes are members of the same MPI_COMM_WORLD. In this case, there was a strong feeling that consistency must be provided. The second case is if the two processes are not members of the same MPI_COMM_WORLD. This case is similar to two sequential processes opening the same file. Although sequential consistency here was deemed useful, it was felt that requiring it might be too restrictive for implementors.

Conclusion: The MPI-IO document should be revised to require sequential consistency for non-overlapping I/O requests from processes of the same MPI_COMM_WORLD.

Issue: How is data read or written by programs that do not use MPI-IO?

Discussion: This point was discussed extensively. Most people felt that files written by MPI-IO had to be readable by non-MPI applications. The example given was a parallel program that generates time-step data that would latter be visualized by a sequential process on another machine. Since MPI-IO is a specification for an API to provide persistent storage rather than a filesystem or file-format specification, it is not possible to require that a file written in MPI-IO map one-to-one to a single POSIX file. The counter examples given included networks of workstations that implement MPI-IO as a collection of UNIX files and the need to store meta-data in the POSIX file that is not visible in the MPI-IO file.

If the implementation of MPI-IO is a single POSIX visible file, then implementors should be encouraged to ensure that the linear byte stream visible from MPI-IO (i.e. a single process reading the entire file as MPI_byte) is the same as the byte ordering seen by POSIX calls.

Simple programs to convert data into and out of MPI-IO could be written using MPI-IO calls and POSIX I/O calls. For systems where the MPI-IO files are not directly readable from POSIX calls, implementors should provide standalone conversion routines. Such conversion should be able to be made on the fly (from a runtime library, within an application) in order to avoid space problems when transforming huge files.

There was also considerable discussion about the need to provide interoperability of different MPI-IO implementations on the same machine. It was concluded that it would be impossible to fully and portably define the meta-data required to describe how MPI-IO files would be stored by all implementations. Instead, it was felt that "conversion filters" could be provided as a value-added feature by specific implementations (The analogy given was that different word processors provide import filters for competitors products on PCs). In order to facilitate these features, it was felt that a function in MPI-IO should be provided to return a unique implementation and version string (e.g., NASA_REF_0.5.1). There was some discussion about how this string could be based on a similar proposal in MPI-2 for returning a version number. In addition, the MPIO_open call could be passed the version string to indicate that the file being opened is in a different format. The syntax for this is already defined in MPI-IO as a prefix of the form "<version>:/". Any implementation is free to return failure if it is unable to open the file in the passed format.

Conclusion: MPI-IO files that map one-to-one to a single POSIX visible file should be compatible with POSIX byte ordering. Implementations that are not directly readable by POSIX calls should provide standalone conversion routines. To permit interoperability, MPI-IO should provide a way to return a unique vendor/version string. The "prefix" syntax to MPIO_Open filenames can be used to provide access to vendor supplied conversion routines (if implemented).

Issue: Conversion of types between heterogeneous machines.

Discussion: Automatic conversion of types would be nice. For example, a file written using one byte ordering and floating point format could be automatically converted to another format. It was felt that providing this feature would be difficult. It could also have serious performance implications. However, the rich type system that MPI-IO inherits from MPI could be used as a basis to provide the required information to make such a conversion feasible.

Conclusion: This issue will not be addressed now, but we should be aware of it and avoid doing anything that would preclude extensions to permit automatic data type conversion.

Issue: File block pre-allocation.

Discussion: File block pre-allocation is a critical feature for applications that want to ensure disk space will be available before they execute a long computation. After looking at the MPI-IO document, it was agreed that MPIO_Resize can be used for file block pre-allocation. However, the current document does not indicate this fact clearly. Also, it was noted that on some systems that don't provide native support for file block pre-allocation, it might be very slow since it would require implementations to write zero data to "pre-allocate" space. However, if users really want to ensure that there will be space for their data, there is no alternative. A note to implementors and users should be provided indicating that this feature could be slow and the release notes should document if pre-allocation is slow.

Conclusion: The MPI-IO document should be revised to reflect that MPIO_Resize can be used for pre-allocation and that there are potential performance implications.

Issue: Dynamic process and thread creation for non-MPI programmers.

Discussion: Communicators are used in MPI-IO both to specify which processes will participate in collective I/O and to use shared file pointers. As currently defined in MPI-IO, when a communicator is passed into an MPIO_Open call the members of that communicator are "copied" into the file handle. Hence the processes that will participate in collective and shared operations are bound for the lifetime of the open file. However, if a program is not using MPI then the communicators are not well defined. It was suggested that one way to deal with this problem is to let a single process create a "communicator" with a specific number of participants associated with it. Then different processes in the application could call MPIO_open with this new "communicator" the specified number of times. How the "communicator" would be passed among the processes would be left to the application. This approach is very similar to the low-level API's notion of collective communication. The primary difference is that the "communicator" lifetime would be the duration of the open file, not a single collective I/O operation.

Conclusion: Using this extended "communicator" abstraction would be useful for non-MPI programs, and could be useful for MPI programs too.

Issue: MPI-IO for shared memory and multi-threaded systems

Discussion: Saying that MPI-IO calls must be thread-safe is not enough. A clean mechanism for specifying the participants in a collective I/O must be designed. The issue comes down to what objects can participate: threads or processes. One suggestion was that whatever objects, whether threads, processes or whatever, participate in the MPIO_open are returned a file handle, and those file handles specify the participants in collective operations. At least one of the participants may be required to specify how many callers there will be to the MPIO_open, since open is a collective operation, and otherwise it may be difficult to determine when the open completes.

Conclusion: There was consensus that MPI-IO cannot be successful if it does not efficiently handle multiple threads in a single address space. It does not appear difficult to specify a clean syntax and semantics for MPI-IO to work in a multi-threaded environment, but this issue must be addressed in the final interface.

Issue: Semantics of blocking during collective I/O operations.

Discussion: First, we discussed that collective I/O operations (with the exception of shared file pointers) do not imply synchronization and that a collective I/O operation may return before the rest of the processes that are executing that same collective operation complete. However, there was some interest in a function that would let a process find out if a specific collective I/O operation had been completed by all processes.

Conclusion: No consensus was reached on this point.

Issue: Not all file modes are likely to be used equally, and it would be helpful to know which ones are most common to permit optimization

Discussion: MPI-IO file access modes can be viewed as a 2x3x2 space (blocking/non-blocking, explicit offset/independent file pointer/shared file pointer, collective/non-collective). Given these 12 combinations, not all of them will be equally used and it might be useful to know what the "popular" modes might be. It was suggested that the "fast" modes would be the popular ones. Others felt that it was too early to identify "popular" modes.

Conclusion: It would be nice to identify what the "popular" modes are, but it is too early to do this. Release notes should document any modes that are known to be particularly slow.

Issue: Inquiry functions and hints are not as complete as desired

Discussion: Additional hints and inquiry functions would be helpful for creating portable high performance applications. The concern is that without additional hints being specified each vendor would create their own incompatible hints. The SIO low-level API group has been working to develop additional hints and inquiry functions. Since we would like the two API's to remain as compatible as possible, it is desirable to find out what the low-level API group has defined and try to incorporate them into the high-level API.

Conclusion: It is not possible to get the low-level group's revised hints/inquiry functions and incorporate them in the MPI-IO document by the June MPI meeting, but they should be available soon after that meeting.

Issue: Implementing MPI-IO on top of the SIO low-level API.

Discussion: Everyone agreed this was a good idea. No one has volunteered to implement MPI-IO, with the SIO extensions, on top of the SIO low-level API (but there are rumors that Dan Reed's group at Illinois might do it).

Conclusion: No definitive answer was reached.

Milestones

June '96: MPI Forum begins parallel I/O discussion; modified MPI-IO draft provided as starting point for discussion.

Sept '96: Maryland will develop MPI-IO versions of several I/O kernels

Nov '96: draft SIO H.L. API (compatible with the I/O chapter of MPI at that time), including hints and inquiry functions and features for non-MPI programmers will be ready for discussion at the Supercomputing'96 roundtable.

Nov'96: have an application (to be named later) running on the NASA reference implementation of MPI-IO on a Convex/HP Exemplar.

April '97: NASA reference implementation running on IBM SP-2, Intel Paragon, SGI Power Challenge, Cray J-90. HP will provide a benchmark application to be run on top of this, and Illinois will provide more applications.

Oct '97: Final SIO API document based on experience with applications using the reference implementations.

Generated by latex2html-95.1
Mon May 20 13:00:14 EDT 1996