HERMES: A Heterogeneous Reasoning and Mediator System

[ Top ] [ Previous Section ] [ Next Section ]

HERMES:
A Heterogeneous Reasoning and Mediator System

2.4. The Mediator Programming Environment (MPE)

This section discusses the three toolkits that can be used to assist the mediator author in constructing a mediator. The first of these is used in the domain integration phase, while the latter two are for use in the actual writing of mediating rules.

2.4.1. The Domain Integration Toolkit

Domain integration enables a mediator to execute the functions defined in the domain being integrated, and to convert the output of those functions into a form intelligible to the mediator compiler.

The various steps involved in executing a function call of the form d:f(a1,...,an) are described below.

(Domain Opening) When executing the above call, the software package named d must first be opened. If for example this software package is PARADOX then it may be opened through a C programming interface. In case no C programming interface is available, then a shell script or a batch file may be used. In general, the C programming interface is preferable due to its speed. A shell script on the other hand, involves less coding and may suffice if the domain in question is seldom used. The domain integration toolkit makes available both options to the mediator author. When the mediator author interacts with the toolkit, directions will be given on where the code, whether it accesses the C programming interface or the shell script, should be placed.
(Domain Execution) The next step involves developing facilities to handle output from run-time calls to different domain function. Specifically, this involves creating the data structures in the output-format file referred to above. Recall that a call of the form d:f(a1,...,an) returns a set of objects from the data source. The type of this set can be determined at compile-time by examining the output-format file of the function f. There are two ways in which this set of objects may be stored and used by the mediator. The current implementation simply places them in a flat ASCII file. The output format of the function f declared in the domain catalog for the domain d is used to parse this flat file. Thus, for instance, if the output of a function such as range in the spatial domain is a set of points, then the output file may be a flat ASCII file containing:
31 5 27 11 19 26
Assuming that each point is a pair representing an x and an y coordinate, this file can be parsed to obtain 3 points -- (31,5),(27,11) and (19,26). It is the responsibility of the person performing the domain integration to implement the necessary parse routines, and he/she will be appropriately instructed by the domain integration toolkit.
An alternative to the file approach is currently under development that would maintain the set of solutions in a dynamic data structure in main memory. For example, we may use a simple linked list, and to use insertion and deletion algorithms to manage the list. Each node in the linked list contains a record whose form corresponds to the output format specification for d:f (as specified in the output format file described in Section 2.3.2), and a pointer to the next element in the linked list. The insertion algorithms themselves would share a common skeleton that is independent of what d and what f are. However, the node structure would differ because the first field of a node would be a pointer to a structure of type d:f. Writing the insertion routines requires some work on the part of the programmer. However, it should only involve minor modifications to a skeleton code (e.g. replacing calls to d0:f0, say, by calls to d:f.).
The main advantage of the file approach is that the results of a function call can easily be stored on disk. Thus, computations to domains can be cached (either in core, or on disk). In the case when large amounts of data are returned by domain functions, or that the computation of the domain functions is time-intensive, this would lead to a tremendous saving by avoiding recomputation. On the other hand, storing results to files itself requires considerable overhead.
(Domain Completion) The third step involves closing the software package once the computation of the function calls is complete. This requires writing code for detecting when an operation has completed its execution, and can be done only on a case-by-case basis. The mediator author is responsible for writing this code.

The mediator programming environment (MPE) features a graphical interface to assist in the following five step systematic integration of domains.

First the name of the domain being integrated is requested by the MPE in order to create a catalog file for this domain.
It then asks for a list of functions in this domain that will be accessed by the mediator.
For each function in the above list, the mediator author specifies the input and output format. This specification is written into the catalog file for the domain as described in the domain execution phase.
The mediator author then implements the code necessary for performing the domain opening, execution, and completion phases.
For each body of data being accessed within a domain the mediator author is queried on whether the data contains time-stamp information, and to implement, if necessary, the code to convert the time-stamp in the domain to a time stamp consistent with the standard clock maintained by HERMES. For each body of data, the toolkit automatically generates a data() declaration as described earlier. Note this declaration also contains a time-stamp obtained from HERMES.

2.4.2. The Conflict Resolution (CR) Toolkit

The purpose of the conflict resolution toolkit is to assist the mediator author in dealing with semantic conflicts that might arise during semantic integration. A menu of commonly used strategies is available to the mediator author. For each predicate defined in the mediator, the mediator author has the option of choosing to apply one of these conflict resolution strategies. Each strategy selected will cause the CR Toolkit to generate a set rules that becomes a part of the mediator. On the other hand, the mediator author may define a new strategy by writing a set of rules. The learning module (cf. Section 2.4.4) will then attempt to determine whether the rules embodies a strategy that should be developed further for future use. Below, we specify some existing conflict resolution strategies currently implemented in the toolkit, and describe how these facilities are incorporated into the mediator compiler.

We assume a background set of integrity constraints that expresses invariant semantical relationships between different data sources. The syntactic forms of these constraints are not important for our discussion.

Latest Data Preference Strategy: Very simply, this strategy says that preference is given to temporally more recent data. To accommodate this, the mediator accesses a special domain implemented in HERMES called HTIME (for HERMES time). HTIME contains a domain operation called timefind that takes a relation name, say r, and a tuple t as input. It first checks if the relation r has associated time-stamp information. Recall that during domain integration, this information may either be explicitly included by the domain integrator, or implicitly generated by the mediator compiler and stored in the data() declaration. If a time-stamp exists, a temporal conversion function supplied by the domain integrator is invoked with r as input. The value returned is the time, normalized to a global clock maintained internally by HERMES. If relation r has no associated time-stamp information, then timefind examines the appropriate data declarations in the mediator to determine the time at which relation r was introduced into the mediator system. This time is used as the time at which the data was created. Using the results returned by timefind, the most recent data is selected whenever conflicting data arises.

Predicate Preference Strategy: Many databases may define concepts that are essentially the same. Consider a PARADOX relation called part1 and a DBASE relation called part2. Suppose part1 has a field color, and part2 has a field col. Though part1.color and part2.col are syntactically distinct, they in fact refer to the same semantic concept, viz. color. The CR-Toolkit allows a mediator author to express that in the Paradox domain, if PARADOX:part1.name=DBASE:part2.name, then part1 and part2 are the same, and hence their colors should be identical, i.e. PARADOX:part1.color=DBASE:part2.col. The system, upon recognizing this integrity constraint, will generate all possible violations. For each violation, it asks the mediator author to indicate a preference between the two relations. Hence the mediator author may say, for instance, that the PARADOX database's information is to be preferred with respect to color, but that the DBASEs value is to be preferred with respect to price. Once the indications are made, rules are generated by the CR-toolkit to reflect these preferences.

Object Preference Strategy: The idea here is similar to predicate preference except that preference is indicated over individual objects. Hence it relies on the assumption that certain databases have more reliable knowledge of the attributes of certain objects than other databases. For example, a mediator author may note that what is called ``dog'' in one database is identical to what is called ``chien'' in another. Furthermore, a preference may be expressed that values of all attributes associated with ``dog'' in a database override values stored for the same attributes in other databases. The CR-Toolkit allows the mediator author to interactively express such preferences, and automatically generates the appropriate rules that will be placed in the mediator.

Value-Based Preference Strategy: Value-based conflict resolution may be appropriate for certain business applications. In this strategy, the user can specify that when two numeric fields disagree, either the higher or the lower value ought to be picked. Hence for instance, the integrity constraint

PARADOX:part1.name=DBASE:part2.name => PARADOX:part1.price=DBASE:part2.cost.

says (assuming identical currencies) that two parts should have the same price in both PARADOX and DBASE databases. However, should this constraint fail, the mediator-author who is writing the mediator on behalf of the company that produces these products may choose to quote the higher price. This is an example of a value-based preference strategy. The CR-Toolkit allows the user to interactively express such preferences. Again, appropriate rules will be generated by the toolkit and placed in the mediator.

Criticality Number Strategy: Recall that when creating the mediator, the mediator author includes various declarations such as the data declarations described earlier. Similarly, we have a special declaration called a criticality declaration of the form

criticality(filename):Crit,

where Crit is a real number. The higher it is, the more ``reliable'' is the data in that domain. The Criticality Number Strategy allows the user to use the rather coarse indication that all data in a domain of a specified criticality c1 is ``more reliable'' than all data in a domain of criticality c2<c1 broad, sweeping strategy may be used as the default in cases where, for example, the mediator author does not truly believe that conflicts will arise.

2.4.3. The Information Pooling (IP) Toolkit

Information pooling is a complicated matter. It is further complicated when the sources from which the information is pooled differ vastly from one another, both in structure, and in content. To facilitate this process, the IP-Toolkit consists of a number of subtoolkits that allow integration both within a given domain, as well as integration across different but similar domains, as well as integration across multiple and dissimilar domains. For example, suppose integration is to be performed over eight databases (call them DB1,...,DB8). Suppose moreover that DB1 and DB2 are PARADOX databases, DB3 and DB4 are DBASE databases, DB5 and DB6 are text databases, and DB7 and DB8 are pictorial databases. Figure 3 shows how the IP Toolkit in HERMES would accomplish the desired integration. As can be seen from Figure 3, the IP-toolkit contains many ``sub-toolkits'' that need to be independently developed. Each of these subtoolkits allows the generation of some rules (but not all) within the mediator. Using the subtoolkits leads to a stepwise procedure. For instance, in Figure 3, the mediator author could define certain rules in the mediator by using the PARADOX IP Toolkit; similarly, he may define certain rules in the mediator by using the DBASE IP toolkit; he may then define new mediator rules based on these intermediate definitions, perhaps in conjunction with the relational IP toolkit.

Intra-Domain Sub-Toolkits: Often, there may be relatively efficient ways of integrating information within a given domain. For example, if a number of different databases are all developed using PARADOX, then as they are all structurally similar, it would be easier perform integration over these databases first whenever possible, before doing integration with data sources implemented in other database management systems.

As an example, consider the integration of the two PARADOX databases DB1 and DB2 in Figure 3. These databases may have relations called emp_comp1 and emp_comp2, respectively, showing the employee records of two companies. It may turn out that a particular employee obtains money from both; in such cases, the mediator author may wish to write a rule of the form:

salary(X,S):[1,R] <-: in(Sal1,PARADOX:project(select=('db1.emp_comp1',"name",X)"salary")) &
in(Sal2,PARADOX:project(select=('db2.emp_comp2',"nom",X)"sal")) &
Sal = Sal1 + Sal2.

Note that the name (resp. salary) fields of the emp_comp1 relation in DB1 are identical to the nom (resp. sal) fields in DB2.

In this example, the PARADOX Intra-Domain Sub-Toolkit would present, for each database in the PARADOX domain, relation-attribute pairs of the form R.A and ask the mediator author for assistance in grouping together semantically ``equivalent'' R.A pairs. In the preceding example, this involves grouping together the pairs (db1.emp_comp1).salary and (db2.emp_comp2).sal in one group, and (db1.emp_comp1).name and (db1.emp_comp1).nom in another.

Within any given domain, the mediator author must specify the structures of interest with respect to grouping. Once this specification is written, the mediator author would need to implement code that accomplishes this grouping. Note, however, that this code needs to be implemented only the first time a new domain is integrated -- for example, if we merely wish to add a new PARADOX database, say DB9, then there is no need to re-implement this grouping code.

Sub-Toolkits for Integrating ``Similar'' Domains:

Formally, DBASE and PARADOX are two different domains; however, these two domains share several characteristics. For example, as they are both relational database management systems, hence the operations of select, project and join are common to both of them. So are many other aggregate operations including SUM, MIN, MAX, and COUNT.

Therefore, it may be advantageous to create a sub-toolkit that accesses the predicates that have been defined in the mediator using the PARADOX and DBASE IP subtoolkits, and then using them to define other parts of the mediator. This is the purpose of the sub-toolkits for integrating similar domains -- the mediator author, though, is left to decide which domains are similar.

Just as in the case of the Intra-domain subtoolkits, the subtoolkits for integrating similar domains are used to group structures together. For example, in the case of say PARADOX and DBASE this involves grouping together pairs (R1.A1) from the PARADOX domain with pairs (R2.A2) from the DBASE domain. Thus, for instance, the mediator author may specify that (db2.emp_comp2).sal in the PARADOX domain is the same as (db3.employee).income in the DBASE domain.

Sub-Toolkits for Integrating Diverse Domains: At this point in our system, this integration must be done by the mediator author. The GUI for the MPE merely serves as a tool for expressing the mediation rules.

2.4.4. Learning in HERMES

Though small, an important aspect of the HERMES mediator programming environment is a facility for learning. When a mediator author develops a mediator, strategies other than the ones available in the CR-Toolkit may be necessary. Thus the mediator author needs to specify such a strategy through a set of mediator rules. This set, appropriately generalized, may be useful across a sufficiently large number of mediation tasks, that storing it as a new strategy in the CR-Toolkit for later use is worthwhile.

The decision on whether the new strategy will be useful in the future lies with the mediator author, while the appropriate generalization to the set of rules may often be generated automatically. In certain cases the mediator author may need to manually adjust the strategy thus generated.

A similar learning component may be developed for the Information Pooling Toolkit.

[ Top ] [ Previous Section ] [ Next Section ]

Click here to go back to the Hermes homepage .

Web Accessibility