Many applications on sustainable development require information derived from heterogeneous multi-source data. This information is inherently distributed and data is created under different projects by various organizations. Integration of all the relevant data for timely and correct decision making is a challenging task due to syntactic, schematic and semantic heterogeneities. Ontologies help in collaboration and resolving data integration and interoperability issues while retaining the integrity of the original data. This paper deals with the methodologies that integrate data in different formats, vocabularies, schemas, semantics and representations using ontologies. An ontology-driven information system is developed and the practical issues in its implementation are discussed. A case study is presented with geomorphology where one can explore ways of representing knowledge as ontologies and then use these ontologies for different services like seamless querying across multi-source datasets, transforming data to another classification and providing user-specific information.
A number of applications necessitate integration of data from various disciplines. For example, sustainable development encompasses soil erosion, deforestation, water resources and disaster management in addition to other information. It is often difficult to obtain necessary and timely information required for suitable decision making from available data. This is because all the necessary data or information is rarely present in a single database and data is created using different vocabularies, taxonomies and content standards.
Current data and information systems at data providers’ site can only provide on-line ordering, and access at best, of data and standard products, not the user-specific information and knowledge. The system cannot create a product on demand. Thus though we have geospatial data, up-to-date geospatial information and knowledge is not available. Currently, the conversion from geospatial data to information and knowledge requires the users at their local site having significant amount of domain knowledge for information/knowledge extraction from raw data and knowledge of the geospatial data processing/formats. This also involves a significant amount of computer hardware and software resources. As a result, the use of geospatial data is expensive and requires a lot of effort.
The first step for data integration is interoperability. There are multiple levels of interoperability. At the level of data representation, we have XML and GML. Interoperability for functional processing can be achieved through web services. Based on the application, interoperability may be required at multiple levels. Ontologies with other technologies like web services help in resolving these issues. While interoperability for functional processing can be achieved through web services, semantic heterogeneity can be managed using ontologies.
Data integration is useful and essential in many applications. In this paper, challenges in data integration have been discussed in section 2, architecture of the ontology-driven information system in the 3rd section, representing knowledge using ontologies in section 4 and how the ontology-driven information system can be used for data integration and interoperability is shown in the 5th section. Some tasks are specifically explored with geomorphology, like (1) Seamless Visualization (2) Uniform query across datasets (3) Reclassification into another schema (4) Providing user-specific information through integrated modeling of data from multiple projects (like suitability of site locations, ground water prospects etc..)
2. Heterogeneities in Spatial Data
Huge geospatial data has been created under different projects by different organizations over the last 20 years. Some of the national projects include National Natural Resource Management System (NNRMS), National (Natural) Resource Information System (NRIS), National Resource Database (NRDB), National Agriculture Technology Project (NATP), Rajiv Gandhi Drinking Water Mission (RGNDWM), National Resource Census (NRC) and National Resource Data Management System (NRDMS). The purpose of each project varies and the area of study changes correspondingly. Projects utilize different vocabularies, taxonomies, content standards and representations. Data is available in multiple formats at different scales and includes several themes.
Solutions have emerged for handling syntactic differences across spatial data sources. Spatial interoperability issues like differences in classification schemes and semantic heterogeneities have received less attention. Driven by practical challenges and need for collaboration among data creators and integration of data from multiple sources, the focus of interoperability research is moving to models and technologies of seamless automatic on-demand spatial data integration and modeling.
2.1 Data Formats
Geospatial data sets exist in different formats like shapefiles, coverages, geodatabase etc. These in turn may have to be linked with other data like tabular or report data existing in their own formats. In addition to having sufficient knowledge of the domain, the user needs to know the data formats and the use of appropriate software that read these formats.
Vocabulary can differ across organizations and is sometimes not quantitative. This is a challenging issue when data has to be interpreted by non-technical people other than the creators like decision makers or by computers for automatic processing. For example, volcanic rocks may also be called igneous extrusive rocks.
2.3 Syntax, Attribute Names and Values
A class can be represented differently in multiple projects. For example, a single landform “Structural Hill” in geomorphology has the following representations in different projects (shown in figure 1):
Figure 1: Part of classification schemes of two projects
Table 1: Representation of structural hill in different projects
2.4 Semantic Heterogeneity
A single class may mean different things to different people. Dense forest, large structural hill, highly dissected structural hill – these are ambiguous natural language terms (Table 1). Interestingly, this may also vary based on the terrain of the study area like plains or hilly areas. This information must be specified explicitly if the data has to be used correctly by others or for automatic on-demand integration and modeling.
Different classification schemes may be adopted based on the purpose of the project. Some standardization has happened over the years, like NRIS and more recently NNRMS standards. However, based on the area of study and scale, deviations such as lower-level classes being ignored, additional classes being included and ambiguity in terms is seen.
3. System Architecture
An architecture that handles interoperability and integration at multiple levels while giving good performance is suggested. There are several initiatives for interoperability at the level of data representation, like XML and GML. This improves interoperability among systems but is inefficient for retrieval and processing of large datasets. A hybrid approach is used in this paper where data itself resides in its original form and metadata with ontologies are used to achieve interoperability. A knowledge base consisting of domain ontologies and task specific ontologies is created in an appropriate form so that it can be processed automatically. Data and project specific information is stored in data ontologies. By dynamically linking the data ontologies with the knowledge base using a reasoner, user-specific information is generated. By providing the results as GML through web-enabled services using WMS, WFS etc. interoperability, integration and performance is achieved.
3.1 Web Ontology Language (OWL)
Ontologies are used to capture knowledge about some domain of interest. An ontology describes the concepts in the domain and also the relationships that hold between those concepts. The most recent development in standard ontology languages is OWL from the World Wide Web Consortium (W3C). OWL ontologies may be categorized into three species or sub-languages: OWL-Lite, OWL-DL and OWL-Full. A defining feature of each sub-language is its expressiveness. At present, automated reasoning on the ontology is possible only with OWL-DL. An OWL ontology consists of Individuals, Properties, and Classes. Reasoning in OWL (Description Logics) is based on open world reasoning (OWR). There are tools that allow creating and editing ontologies and help in visualizing relationships. However, these are still primitive.
3.2 Ontology Alignment and Reasoners
Ontology alignment is the task of merging, aligning or forming relationships between classes across ontologies. This includes relationships like subclass, equivalence, disjointedness etc. There are some tools that offer GUIs that make it easier to define relationships between ontologies.
Reasoners can infer new relationships between classes from existing ontologies, like computing subsumption relationships between classes and detecting inconsistent classes. Based on application, this may need to be used once initially for merging or dynamically at run-time. Aligning once is itself challenging because of the differences in classifications driven by purpose and scale of the project. When used dynamically at run-time to infer relationships automatically or semi-automatically, different sets of classes from various ontologies may have to be aligned based on requirements. This requires more comprehensive information stored in ontologies and is more difficult. Reasoners that can be integrated with the rest of the application and that are callable at run-time are a must in such cases.
4. Representing Knowledge using Ontologies
Using technologies like geo-ontologies, one can explore how domain expertise of experts can be represented as knowledge that can be understood by automated or semi-automated tools. Ontologies aid in storing knowledge about the domain (domain ontologies) and project-specific information about the data (data ontologies) in a form understandable by both humans and machines.
It is important to note that there are differences between OWL languages, like the Unique Name Assumption (UNA). The scope and purpose of creating the ontology must also be decided before creating ontology. There are multiple ways of defining a class and representing information about it. Issues like whether this ontology will be merged or aligned with other ontolgies, whether the alignment happens once and is static or the alignment has to happen dynamically in multiple ways based on requirements have to be decided. To enable dynamic linking, it is required that task for which this is created be decided beforehand. This is because currently reasoners work only on OWL-DL and not OWL-full. It is necessary to represent some information as attributes of a class, some as sub-classes and some as a cover for a partition. Knowing the scope and purpose helps us decide which representation is used to define a class. For example, a dense forest can be stored as a forest with value ‘High’ under ‘Density’ or as a subclass of forest. Proper care must be taken of these issues for dynamic linking to work correctly and give complete results.
4.1 Domain Ontologies
Experts’ knowledge of the domain, common vocabulary and terminology and other implicit information must be stored as domain ontologies. A single ontology describing complete information of the domain may not be possible and opinions of experts may differ. Multiple ontologies can co-exist and can be used to store opinions of different experts. These can be aligned / merged based on need or the one preferred by a user can be used by him. In this way, ontologies help in collaboration.
4.2 Task Specific Ontologies
Tasks can be categorized into common tasks, tasks that are clearly defined and tasks that are not needed often. These may require combining information from multiple domains. The related data may be available from different datasets. The task also defines the level of detail to which information must be stored. The purpose and mode in which this ontology will be used will help in deciding the representation of the information. An example of a task could be finding suitable sites for wells that provide drinking water. This requires finding sites that fulfill some conditions and information from multiple layers pertaining to ground water regime such as lithology, geomorphology, geological structures and hydrology has to be combined.
4.3 Data Ontologies
Subjective information like scope and purpose of the project, extent to which standards were followed, assumptions made, explicit definitions of terms used and other relevant information like area of study can be stored explicitly as data ontologies. Data ontologies also include project specific details like data format, projection, source, classification schema, representations of each class in the dataset like column having the class name and codes/values corresponding to this class (E.g. The class ‘Structural Hill’ is represented as a value ‘0101010000’ in the column ‘GU-CODE’ in RGNDWM project). This information aids in reuse of data by a larger group of people, minimizes errors during reuse and makes it relevant for longer periods of time. Storing information in this form also enables automatic processing by systems enabling integrated data analysis and modeling with dynamic linking.
Some national projects like NATP and RGNDWM have been studied. Relevant ontologies that are required for processing and analysis of the data have been created. Some interesting observations were found. While newer projects follow standards to a large extent, there are some implicit assumptions that are specific to each project or dataset. If these are not stored explicitly, automated systems processing this data could be misguided. For example, a class found under a higher level class need not necessarily be a subset of the higher level class. It could be that it is discernable within the higher level class at a particular scale.
5. Using Ontologies for Different Applications
5.1 Data Consistency and Validation
Semantic rules can be defined for a project. These can be checked in the actual data. E.g. Streams must not cross reservoir or pond. A state cannot be contained in a district. A settlement cannot be found in water.
5.2 Seamless Query Processing
This does not require any modification of original data. Steps include Data Discovery, query rewriting and execution. An example workflow in our system is as follows (Figure 2). User types his request as “Select all Piedmont Polygons”. This is translated to “Select all polygons with Class = “Piedmont”. Data sets to which this must be applied are obtained through data discovery. Using the domain and data ontologies, this query is expanded and rewritten for each dataset. Further, the location and data format of each dataset is obtained from the metadata of the dataset. The query is then sent to the corresponding processing software/web-service for the results. The results are compiled / merged and sent to the user.
There may be layers which are overlapping. In these areas, a ranking system is followed. A default ranking may be given by the system but can be overridden by the user.
Figure 2: Example workflow for Seamless Query Processing
5.3 Remapping or Reclassification
Each project is done to meet a particular objective and these objectives vary across projects. Standards are sometimes followed but only to a certain extent. Deviations happen either because the project necessitates it or due to the pressure of deadlines. Another issue is the ambiguity in terms used. E.g. – high / moderate / less dense forests, highly / moderately / less dissected hills. These definitions vary across projects and could also vary based on the geographic area. “High dissection” may be defined differently in plains and in hilly terrain. All this information is essential for automated processing and for consumption by people other than data creators.
5.4 Aligning Classification Schemes
Data of RGNDWM project has been taken up as an example. An attempt has been made to see the feasibility of reuse for NRCensus project. This involves two steps - aligning schemas of both the projects and transforming the underlying data accordingly. Comparing the classification schemas of the projects, it is immediately apparent that the purpose and scope of both the projects is very different. While the purpose of RGNDWM is for locating suitable sites for drinking water wells, NRCensus depicts changes and modifications of the country’s natural resources like land, water, soils, forests etc.
Correlations were successfully established like equivalence, subset, partition etc. for some of the classes. Two main issues were found. One was for moving from more to less generalized classes and the other was when a single class comes under multiple parents in the classification scheme. While moving from more to less generalized, we have explored ways of deriving additional information required by analyzing other layers. E.g. Structural hill is to be classified as one of highly / moderately/ less dissected hill. This information was derived from drainage and slope layers. For deciding the parent of a class, we have explored multiple ways like finding similarity measure of parents, analyzing the corresponding area on the map etc.
5.5 Providing User-Specific Information
This requires processing, modeling and on-the-fly integration of relevant data at multiple levels – domain knowledge, data ontologies and user specific information. In order to provide what the user requires in the form he / she requires, data access services, data transformation services and high level services for modeling and analysis must be combined in an appropriate way dynamically. User-specific information can be provided in many cases. In some cases, a transformation that brings the data in a form closest to the user’s requirements can be accomplished.
5.6 Practical Issues
Data must be corrected and complete to the extent possible. Some obvious errors that can easily be sighted by the human eye cannot be detected by automated systems. Lot of extra care has to be taken in ensuring the correctness, completeness and integrity of data that is used for automatic processing. Methods for data validation and detecting potential errors in the data must be included in the technological solutions provided. This success is partly based on capturing appropriate information as ontologies. Classes having multiple parents in some classification schemas can confuse automated reasoners and must be resolved before submission for automated processing. Another issues faced was that though the schemas were successfully aligned, polygons of higher levels were not digitized.
Ontology-driven information systems enable faster and better decision making by facilitating on-the-fly integration and interoperability of data across multiple disciplines. Ontologies aid in correct usage of data, makes data usable by a larger community and makes the data relevant for longer periods of time. Total automation of all analysis and modeling tasks may require more technological advances. However, if current technology is properly understood and applied in the right way, dynamic on-the-fly integrated analysis and modeling is possible to a large extent.
Shri. K.Babu Govindraj, Scientist, NRSC is thankfully acknowledged for his help in giving the required inputs from the NAS database of Applications area for the study. Director, ADRIN, Director, NRSC and Deputy Director (RS&GIS-AA), NRSC are thankfully acknowledged for giving permission to carry out the joint research project.
T. Devogele, C. Parent and S. Spaccapietra: On Spatial Database Integration. International Journal of Geographic Information Science 4 (1998), 335–352.
- A. Vckovsky (ed.): International Journal of Geographic Information Science – Special Issue: Interoperability in GIS. 4 (1998).
- F. Fonseca and M. Egenhofer: Ontology–driven information systems. In: 7th ACM Symposium on Advances in GIS (C.B. Medeiros, ed.), Kansas City, MO, 1999, pp. 14–19.
- Lecture Notes of ‘Workshop on GEON 2005’
- A Practical Guide To Building OWL Ontologies Using The Protege-OWL Plugin and CO-ODE Tools by Matthew Horridge, Holger Knublauch, Alan Rector, Robert Stevens, Chris Wroe
- S.K.Ghosh, Manoj Paul: Geo-Spatial Interoperability : Crossing Semantic and Syntactic barrier in GIS, Gis Development, Asia Pacific, September 2006
- I.Zaslavsky, A.Memon, G.Memon: Integration across heterogeneous spatial data and applications within a large cyberinfrastructure project, GIS Development, May 2005
- William D. Thornbury, Principles of Geomorphology
- NNRMS project literature
- Project reports of NATP
- Manual of RGNDWM, NRSC 2007