|
This small demonstrator serves to illustrate several Semantic Web technologies assembled to solve the problem of querying a heterogeneous set of databases. The search term(s) supplied by the user may not match the table or attribute names in each of the databases (the user’s mental model of the schema and the implemented database schema may differ). A thesaurus relating the terms is used in the databases to enable the user query to be converted into queries suitable for each of the databases, and the answers are then integrated into a single table.
It is common when organisations merge or even within stable organisations that users require information derived from a combination of databases which have been constructed in different ways - heterogeneous databases. The common problems of heterogeneous database access occur where databases are constructed that share some elements, yet are different in any of several ways, such as:
Best practice database design methods include rich design models that make these issues explicit, although these models are not stored as part of the databases, are often lost, and are often themselves written using different representations or notations (e.g. entity relationship diagrams or UML). Even when the models used in developing the databases are available, it is still a considerable effort for end users to resort to them to develop queries that can be issued across multiple heterogeneous databases. There have been many theoretical solutions to this problem, and many demonstrations of solutions, yet there is no generally adopted solution.
The Semantic Web technologies provide support for such distributed rich modelling to be stored on-line and integrated into enterprise systems for heterogeneous database access when required. This small demonstrator illustrates a solution to the simplest of the heterogeneous database problems of the user using the wrong term for an element in a query. However, the modelling approaches available through RDF, RDF Schema and OWL implemented in the toolkits that support them can be used to address all of the problems listed above in an incremental way to minimise investment requirements, leading to some payback for each increase in distributed modelling. The current demonstrator only matches directly equivalent terms within a single model, using pre-existing models. More complex mappings and consequent transformations can be added with richer model representations that the underlying semantic web technologies support.
The small demonstrator shows how a single query can be broken down into a set of queries aimed at different heterogeneous databases. The breakdown process involves mapping query terms to the terms in the target database. This is done using a thesaurus of synonyms and other standard thesaurus relationships. The demonstration uses a multilingual thesaurus to overcome both the last two (and simplest) semantic heterogeneity problems listed above.
The demonstrator uses a thesaurus server to host the thesaurus. The thesaurus server holds thesauri represented using in RDF according to anRDF Schema of the structure of thesauri. This thesaurus is accessed via a Java class library built using the Jena RDF toolkit from HP Labs. The programming interface to the Thesaurus class library is provided by a Web Service endpoint accessed via SOAP, implemented using the Apache Axis toolkit.
The query application implemented in Java can be run as a standalone application or
called from a Java servlet. In addition, the query application can use
the Thesaurus library directly.

The query application assumes that the term entered will be in the thesaurus, and that any table or attribute names in the database applications will also be in the thesaurus. For generic thesauri the first is a reasonable assumption if the thesaurus contains a reasonable coverage of a language. The second can be assumed to be true if the database development method used in an organisation imposes the use of existing thesaurus terms or the listing of terms used into a thesaurus as a result of the data modelling. If terms used in database construction are not entered into the thesaurus the example approach would not work - but nor would any other approach if there is no mapping from the terms used to some other set of terms.
The query application maps the query term to synonyms in the thesaurus. It then searches a list of SQL databases it has some knowledge of for those terms. It then issues queries through JDBC to those SQL databases retrieving the answers which it integrates and presents to the user as an html page.
So that you can use this simple demonstrator as the basis of larger scale solutions:
The technical background to the problem and the technical implementation details are each explained in formal deliverables to the European Union for the Question How project in which this demonstrator was produced.
The simple demonstration problem is to issue a query to 3 SQL databases simultaneously where the schema are different. The three databases involved consist of one table each, which itself consists of two attributes describing the name and price of vehicles. The attribute names are different in each database, as they are intended to represent databases from different countries - UK, France and Germany.
|
English |
French |
German |
|
CREATE TABLE TEST_EN ( ID INTEGER NOT NULL, NAME VARCHAR(250), PRICE NUMERIC(10) ) |
CREATE TABLE TEST_FR ( ID INTEGER NOT NULL, NOM VARCHAR(250), PRIX NUMERIC(10) ) |
CREATE TABLE TEST_DE ( ID INTEGER NOT NULL, NAMEN VARCHAR(250), PREIS NUMERIC(10) ) |
The user can query the databases by supplying the term corresponding to price in English, French or German. The Thesaurus library is used to determine which databases contain attributes relevant to the user query, and to create an appropriate SQL query for each database.
The results are integrated into a single table and returned as an html page. The structure of the results will vary depending on the query term used. The order of the integrated data will always have the original query term at the top.In this example it would be easy to remove the labels from the non-query term subtables, or even translate the reply into another language supported by the thesaurus. However, the current presentation makes the structure of the process evident.
Run the demonstration system |
Typical Output from program
Search Term from user is price There are 3 tables matching the translated search term
|
This demonstrator was produced as part of the Question How project partly supported by the European Union under the IST programme. The implementation was made by Ian Johnson at CCLRC's Rutherford Appleton Laboratory in the UK. Other demonstrations of W3C technology produced in the Question-How project are also available.
| This activity was partly supported by grant IST-2000-28767 from the European Union's Information Society Programme to the Question How project |
A. P. Sheth and J. A. Larson. Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys, 22(3):183--236, 1990.
Y. Arens,
C. A. Knoblock,
and W. Shen. Query reformulation for dynamic information integration.
Journal of Intelligent Information Systems, 6, 99-130, (1996). http://citeseer.nj.nec.com/arens96query.html
Mackinon, M.L., Marwick, D.H, and Williams, M.H. A model for query decomposition and answer construction in heterogeneous distributed database systems. Journal of Intelligent Information systems 11, 69-87 (1998)