next up previous
Next: Estimation of Performance Cost Up: PetaSIM - A Performance Previous: Introduction and Motivation

Operation of PetaSIM

PetaSIM defines a general framework in which the user specifies the computer and problem architectures and the primitive costs of I/O, communication and computation. The computer and problem can in principle be expressed at any level of granularity - typically the problem is divided into aggregates which fit into the lowest interesting level of the memory hierarchy which is exposed in the user specified computer model. Note the user is responsible for deciding on the ``lowest interesting level'' and the same problem/machine mapping can be studied at different levels depending on what part of the memory is exposed for user(PetaSIM) control and which (lower) parts are assumed under automatic (cache) machine control. The computer and problem can both be described hierarchically, and PetaSIM will support both numeric and data intensive applications. Further, both distributed and shared memory architectures, and various combinations of those architectures, can be modeled.

We assume the HLAM model shown in Figure 2, where an application is viewed as a meta-problem that is first divided into modules that are task parallel. Each module is either sequential or data parallel and further divided into aggregates defined by the memory structure of the target machine. The multi-level module structure in HLAM will be treated conventionally as an event driven simulation in PetaSIM. However the large scale data parallel parts (collections of aggregates) will be treated differently by a novel mechanism that exploits the loosely synchronous SPMD structure of this part of the problem.

Synchronous and loosely synchronous (and the typically easier embarrassingly parallel) data parallel modules will be supported. As shown in Figure 2, these are divided into phases - in each phase all aggregates are computed - with synchronization (typically I/O and/or interprocessor communication) at the end of each phase. Modules must have a fixed distribution into aggregates during each phase but can be redistributed at phase boundaries. Aggregates are the smallest units into which we divide problem and could be, for example, a block of grid points or particles, depending on the application. The basic idea in PetaSIM is to use a sophisticated spreadsheet. Each cell of the spreadsheet represents a memory unit of the computer and they are generated from the hierarchical machine description described in Section 2.1. For each phase of the module simulation, the user provides a strategy - called the execution script in Figure 1 - that moves the module aggregates through the cells. The simulator accumulates the corresponding communication and computation costs. Some special cells (e.g. those corresponding to the lowest memory hierarchy level exposed) have CPU's attached to them and aggregates are computed when they land on a CPU cell. The phase terminates when all aggregates have been computed. Note that a ``real computer'' would determine this cell stepping strategy from the software supplied by the user and the action of the hardware. The simulator will supply a framework but the user is responsible for understanding both the problem and the machine well enough to supply this cell stepping execution script. We expect to gradually automate parts of this process in future activities. In general, there are many more memory cells than aggregate objects. Usually the number of module aggregates is greater than or equal to the number of cells with CPU's. A distributed memory machine is just a special case with no memory hierarchy at each node. For that configuration, the number of cells, CPU's and module aggregates are all equal and the cell stepping strategy involves one step in which all objects are computed. In this case, PetaSIM will reproduce results like the simple ``back of the envelope'' results quoted earlier.

We will provide both a C and Java version of the simulator, whereas the user interface will be developed as a Java applet. The visualization of the results will use a set of Java Applets based on extensions to NPAC's current Java interface to Pablo performance monitoring system from the University of Illinois [18].

It is possible to use any level of fidelity, but the user is responsible for estimating communication costs and the compute cost when an aggregate arrives at a CPU cell. These estimates require determining the effects of lower memory levels in the hierarchy, which are not modeled directly. As was described earlier and illustrated in Figure 2, modules and computers are both specified as a set of units labeled by a hierarchy level and a position (labeled by a one or multi-dimensional index) within a given level. The user must specify the linkage between these units with an appropriate associated function (or simply a weight) that calculates the communication performance between units in the computer model and the message traffic required by the problem model. Section 2.4 gives more details on the estimation of performance cost functions for the different components needed by PetaSIM. The initial system will support ``flat'' problems but general hierarchies for problems and computers is essential and will be implemented in the second year of the project.


next up previous
Next: Estimation of Performance Cost Up: PetaSIM - A Performance Previous: Introduction and Motivation

Wes Stevens
Fri Jul 11 15:07:44 EDT 1997