Mapping of extensible markup language-to-ontology representation for effective data integration

Extensible markup language (XML) is well-known as the standard for data exchange over the internet. It is flexible and has high expressibility to express the relationship between the data stored. Yet, the structural complexity and the semantic relationships are not well expressed. On the other hand, ontology models the structural, semantic and domain knowledge effectively. By combining ontology with visualization effect, one will be able to have a closer view based on respective user requirements. In this paper, we propose several mapping rules for the transformation of XML into ontology representation. Subsequently, we show how the ontology is constructed based on the proposed rules using the sample domain ontology in University of Wisconsin-Milwaukee (UWM) and mondial datasets. We also look at the schemas, query workload, and evaluation, to derive the extended knowledge from the existing ontology. The correctness of the ontology representation has been proven effective through supporting various types of complex queries in simple protocol and resource description framework query language (SPARQL) language.


INTRODUCTION
Extensible markup language (XML) has been widely used as the data exchange format over the internet [1], [2]. Big data analytics is the trend in various industries to boost their industrial performance, and in fact, XML data format usually forms the basis of data streaming used in the analytical process [3], [4]. However, XML data represent the data only at the syntactic level. On the other hand, ontology is a knowledge representation that established a shared vocabulary, conceptualizations and model domain knowledge for various applications [5]- [8]. Ontology is often expressed in ontology web language (OWL) format.
Several ontology generation (also known as ontology mapping) techniques existed to transform the gap between the syntactical XML and semantical OWL representation [9]- [11]. Ontology enrichment is also an objective of the transformation [12]. It is to extend the ontology by adding the elements and constructor (class, object attributes, data types, concept relations, axioms, properties). In addition, the ontology population process adds individuals or attributes to available individuals from XML data to the ontology representation.
In general, the mapping approaches from XML to ontology representation can be grouped into two main categories: the instance approach and the validation approach [13]. The instance approach intends to convert XML documents directly to ontology representation without using schema knowledge. Most of these approaches generate new ontologies from only XML content by using XML path language (Xpath) query Int J Artif Intell ISSN: 2252-8938  Mapping of extensible markup language-to-ontology representation for … (Su-Cheng Haw) 433 language based on path expression to navigate each path of the respective node in the XML. Klein presented the first mapping tool to translate XML documents directly to an ontology language such as resource description framework (RDF) or OWL [14]. He proposed a method to transform the ambiguous XML data into the RDF statements on a one-way mapping basis. Bohring and Auer [15] proposed a framework to map XML into OWL, which is built on top of the XML instance document to possibly generate XML schema definition (XSD), and finally transform it into OWL. O'Connor and Das [16] proposed a domain-specific language called XML master. It is developed by using OWL syntax and XPath query language. The validation approach refers to the approach that generates ontology from the schema. Both XSD and document type definition (DTD) are the two major schemas that are being used today. However, DTDs are not in XML format, which DTDs do not support 'namespace' while XSD does provide more advanced features. These approaches make use of the advantage of XSD, which contains the defined elements and XML types of simple or complex data. Ferdinand et al. [17] proposed a semi-automated approach named ontology web language mapping (OWLMAP), which is constructed based on some mapping rules to handle complex type, simple type, attribute, element, elemet type, substitution group and so on. The XML schema to ontology web language (XS2OWL) [18] approach targets to support the interoperability between the XML and OWL environment. The tool automatically transforms the XSD as input into: i) main ontology which is directly reflected by the defined transformation rules and ii) mapping ontology. The mapping ontology is used to keep the radio-frequency identifications (rf:IDs) of the OWL constructs of the main ontology which cannot be generated directly from the main ontology. There are four classes of mapping ontology which are complex type info type, element info type, data type property info type and data type property info type. Bedini et al. [19] developed a prototype named Janus, which consists of 40 transformation rules to map the XSD constructs to ontology (OWL2-RL) constructs. Their approach managed to minimize the information loss during the transformation process. As an example, the construction 'restriction', derived from the restriction of a simple type, allows the creation of several simple types from the simple predefined types in XSD. However, the transformation is designed based on an application domain to compute statistical analysis of the business to business (B2B) domain, which becomes the constraint of this approach. An efficient XML to OWL converter (EXCO) [20] is a tool, which could manage both enrichment and population for XML documents into OWL by covering both the internal and external references. Thuy et al. [21] proposed s-trans, which transforms XML healthcare data into ontology based on extraction of the XML schema with added description of the semantic knowledge. Subsequently, in another research, Thuy et al. [22] proposed to reduce the redundancy of data resulting from duplicate elements in XML schema by measuring the similarity between these duplicates before the transformation process.
More recently, Shapkin and Shumsky [23] proposed modularizing the transformation from XML to ontologies based on some designed templates, which are constructed based on class and property values. Singapogu et al. [24] proposed the mapping by looking at the XML schema elements to automatically structure and represent it in the first draft of ontology. Subsequently, some part-of-speech tagging method is employed to extract the domain knowledge to enrich the refinement of ontology. Jounaidi and Bahaj [25] formulated some rules for mapping the XML schema into ontology representation. This mapping also covers the relationships between the nodes to ensure the structure is maintained. They also proposed canonical data model (CDM) to transform XML Schema into OWL ontology [26]. Hacherouf et al. [27] proposed patterns identification for XSD conversion to OWL (PIXCO), a method based on formal concept analysis (FCA) to model the transformation patterns. There are several processes involved including the constructions of XML schema, the transformation patterns identified and the OWL modelling.
From the review, we observed that EXCO [20] tool is stable and has enriched information. Nevertheless, EXCO can be further improved to support some advanced operators and restrictions. Our proposed framework extended EXCO to add some new functionalities as described in the next section. Figure 1 depicts the overall framework of our proposed approach. In this method, a validation approach is used in the proposed solution. At first, if there is no XSD nor DTD available as input, it will be generated automatically from the XML documents. The generation of the target OWL is composed of a few stages as elaborated next.

Stage 1: initial transformation step
The trang application programming interface (API) is used in the transformation between an XML document and XML schema to define the restriction on the XML structure. Trang API is an open source API for working with XML files to convert the XML schema into XSD schema format. In addition, trang is also capable to infer a schema from an XML document itself if the schema is not present.

Stage 2: resolving the conflict
Next, to resolve the internal and external references, consolidation mapping is adopted from [19] method. First, i) collecting schema files, XSD schemas are collected to get the reference of their location into a hash table. The namespace and references are saved to avoid duplication; ii) merging schemas files, namespace prefixes of the saved schemas are unified to merge them into the main schema file; and iii) reorganizing schema, to reorganize the internal references within the main schema file. The referred elements will be simply appended into the node, and this is done hierarchically through the descendant. Finally, the useless element is eliminated.

Stage 3: automated transformation
The automatic transformation is handled through an algorithm developed based on the transformation model to map the consolidated output construct into the ontology web language description logics (OWL-DL) construct. The process is done without any user intervention. This process will ease the user with the initial mapping constructued.

Stage 4: refinement stage
Refinement of the generated ontology and mapping of bridges. The invalid mapping can be cleaned and reconstructed. Mapping of ontology which cannot be generated directly from the XML schema can be manually mapped using the mapping ontology that keeps the rdf:IDs of the OWL constructs.

METHOD 3.1. Translation on UWM dataset
The University of Wisconsin-Milwaukee (UWM) XML document from the University of Washington (UW) database group [28] is used as an example. UWM data are the course data derived from the UWM website. Figure 2(a) shows the partial view of UWM XML document, which contains the series of <course_listing> records and each of them contains the details of the record elements and value. The next step of the ontology generation process is the conversion of the XML document to XML Schema, XSD. The following generated XML-schema is depicted in Figure 2(b). The XSD generated having of the elements, sub elements, and property restrictions like the type of cardinality and also operators of class combinations (union of, complement of, intersection of).
The rules of the generation of OWL constructs are shown in Figure 3. OWL class can be created from xsd: complex types; and xsd: elements which are independent identities. OWL data type property is created from the element which they are the only literal with no attributes as well as the XML attributes. The constraints properties from XML schema like min occurs or max occurs will map as the cardinality constraint in OWL. There is owl: minimum cardinality and owl: maximum cardinality. The inheritance that is shown isa relationship that is derived from XML Schema will be mapped to RDF schema (RDFS): sub class of in OWL. As a similar condition for elements will be mapped to RDFS: sub property of RDF. The compositors of combining elements sequence, all and choice will be mapped into owl: intersection of, owl: union of or  Table 1 show the OWL classes and properties generated respectively, where by the Table 1(a) lists the object type property while Table 1(b) lists the data type property. During the transformation, two elements with the same name, but on a different level, the property will add "has" prefix for owl: object properties. In addition, rdf:ID will be generated for each instance of the class in order. The generated OWL ontology is shown in Figure 5. This ontology is comprised of seven complex types since seven OWL classes, root, course_listing, restriction, A, section_listing, hours, and bldg_and_rm are created. The couse_listing, section_listing, hours and bldg_and_rm further contains respective properties as shown in Figure 5.   In the next section, the method of ontology population that we adopted is by looking at the XML instances and XSD files as inputs. XSD documents are acting as the reference to translate the XML instance to the OWL ontology. The snippet as follows shows an example of the transformation of XML elements to instances according to the OWL model. The data type property is represented as follows. The extracted model of OWL ontology constructed composed of: i) classes for concept definition; ii) object properties for object relationship; and iii) data type properties for the relationship between object and data values.

Translation on mondial dataset
In addition, the same XML-OWL transformation is applied to the mondial XML document. Tables 2  and Table 3 show the OWL classes and properties generated. Table 3(a) lists the object type property while Table 3(b) lists the data type property. The generated mondial OWL ontology is depicted in Figure 6.   Figure 6. The generated UWM OWL ontology

RESULTS AND DISCUSSION
The implementation of the proposed approach has been applied to several XML datasets from different domains including UWM and mondial datasets. The evaluation of the final ontology is done by comparing the semantics captured in the defined ontologies with the semantics captured in the automatic generation. Based on the comparison shown, the semantics captured manually is shown to be the same as the automatic transformation result. The correctness of the ontology representation has been proven by the reflection of the query result with manual verification. Some examples designed query tests were executed towards the constructed UWM and mondial ontology by using simple protocol and resource description framework query language (SPARQL) playground (standalone multi-platform web application) [29].

Query results on UWM dataset
Two queries were executed to check the correctness of the number of returned results on UWM ontology representation as compared with the query retrieved from the XML dataset itself. Figure 7 shows the first query, query 1, which list the course_listing with credit of 7. Figure 8 depicts query 2, which list the number of sections of each course group by credit. From the number of returned results, it shows that the ontology constructed via our mapping scheme is correct.  Figure 9 shows the first query, query 1, which list the countries latitude at 50.3 with their respective religion. Figure 10 depicts query 2, which show the city of more than 10000 population. From the number of returned results, it shows that the ontology constructed via our mapping scheme is correct.

CONCLUSION
In this paper, we proposed a set of transformation rules to translate XML documents into OWL ontology representation. The generated ontology is found accurately defined the semantics of the XML document through the evaluation of comparison to the manual transformation approach. In future work, we will focus on the generation of OWL for the unsupported constructs of the validation schema.

Palanichamy Naveen
joined the Faculty of Computing and Informatics, Multimedia University after receiving Ph.D from Curtin University, Malaysia. She received her Bachelor of Engineering (CSE) and Master of Engineering (CSE) from Anna University, India. Her research interest includes smart grid, cloud computing, machine learning, deep learning and recommender system. She is involved in multiple research projects funded by Multimedia University. She can be contacted at email: p.naveen@mmu.edu.my.