An improvement in measuring the semantic similarity between RDF ontologies
Abstract— RDF (Resource Description Framework) ontologies has been playing an important role for many knowledge applications because they support a source of precisely defined terms. However, the wide-Spread of RDF ontologies creates a demand for automatic way of assessing their similarity. In this paper, we present a novel method to measure the semantic similarity between elements in different RDF ontologies. This measure is designed so as to enable extraction of information encoded in RDF element descriptions and to take into account the element relationships with its ancestors and children. We evaluate the proposed measures in the context of matching two RDF ontologies to determine the number of matches between them and then compare with human estimation and the related methods. The experimental results show that our similarity values are better than other approaches with regard to the accuracy of semantics and structure similarities
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
Trang 8
Trang 9
Tóm tắt nội dung tài liệu: An improvement in measuring the semantic similarity between RDF ontologies
2 * ( , ) * ( , ) ( , ) SpSim C C ChSim C C NbSim C C 1 2 1 2 1 2 . . | [ ] [ ],1 ( , ) ( , ) i i i cf i C cf C cf cf C cf C cf i n DSim1 C C max n n 1 2. . 1 2 ( , ) ( , ) C cf C cf cf max n n DSim2 C C n (8) (10) Table 1. RDF data type compatibility by equation (9) Pham Thi Thu Thuy, Nguyen Dang Tien 27 4.2.1. Super Similarity Super entities are the set of super classes defined by rdfs:subClassOf and the properties of those classes. For instance, the super entities of element SportCar in Fig. 3 are Vehicle, power, and registeredTo. Usually, the super entity of each element within a RDF Schema document contains several elements, therefore the super similarity between two elements C1 and C2 is the average similarity of two super element lists. For instance, the super element of an element C1 is SC1 = [C11, C12,, C1k], and the super element of and element C2 is is SC2 = [C21, C22,, C2t], where k and t are the numbers of super elements of the element C1 and C2, respectively. If k ≥ t, we take each element in SC1 to compare with each element in SC2. Otherwise, if k < t, we compare each element in SC2 with each element in SC1. The highest value of the measurement is chosen. The super similarity (SpSim) of two elements C1 and C2 is presented as following matrices (11) and (12): ( , ) ( , ) ( , ) ( , ) ( , ) 11 21 11 2t 1 2 1k 21 1k 2t DcSim C C DcSim C C SpSim C C DcSim C C DcSim C C , k≥t (11) ( , ) ( , ) ( , ) ( , ) ( , ) 21 11 21 1k 2 1 2t 11 2t 1k DcSim C C DcSim C C SpSim C C DcSim C C DcSim C C , k<t (12) Where DcSim is the description similarity between each super element of element C1 and each super element of element C2. It is determined by the equation (2). The super similarity of two elements C1 and C2 presented in matrices (11) and (12) is determined by the following equations (13) and (14), respectively. ( ( , )) ( , ) k t 1i 2j j 1 i 1 1 2 max DcSim C C SpSim C C k (13) ( ( , )) ( , ) t k 2i 1j j 1 i 1 2 1 max DcSim C C SpSim C C t (14) Where max is the maximum similarity value of each row in the matrix. If two elements C1 and C2 do not have any super element (it means they are root elements), then SpSim(C1,C2) =1. In the case that one of two compared elements is a root element, then SpSim(C1,C2) =0. 4.2.2. Children Similarity Children of an element C are the collection of properties of element C and all subclasses of element C and the corresponding properties of those subclasses. Similar to the super computation, in order to calculate the children similarity of two elements C1 in RDFS1 and C2 in RDFS2, we collect all children of elements C1 and C2 and then compare the description similarity of each children pair. Assume that m and n are the numbers of children of the element C1 and C2, respectively, the children similarity (ChSim) between two elements C1 and C2 can be presented as following matrices (15) and (16): ( , ) ( , ) ( , ) ( , ) ( , ) 11 21 11 2n 1 2 1m 21 1m 2n DcSim C C DcSim C C ChSim C C DcSim C C DcSim C C , m≥n (15) ( , ) ( , ) ( , ) ( , ) ( , ) 21 11 21 1m 2 1 2n 11 2n 1m DcSim C C DcSim C C ChSim C C DcSim C C DcSim C C , m<n (16) Where DcSim is the description similarity of each child element of element C1 and each child element of element C1. The children similarity of two elements C1 and C2 in the matrices (15) and (16) are determined by the following equations (17) and (18), respectively: ( ( , )) ( , ) m n 1i 2j j 1 i 1 1 2 max DcSim C C ChSim C C m (17) ( ( , )) ( , ) n m 2i 1j j 1 i 1 2 1 max DcSim C C ChSim C C n (18) 28 AN IMPROVEMENT IN MEASURING THE SEMANTIC SIMILARITY BETWEEN RDF ONTOLOGIES In the case that one of the elements C1 and C2 is the leaf node (that means it contains no child node), their children similarity is 0. Depending on the expected similarity value (threshold value), the semantic similarity between two element C1 and C2 (R2Sim) can be divided into two groups, high similarity and low similarity, and then the matching and integrating strategies for those elements will be applied. In this paper, we assign 0.7 to the threshold value. Therefore, if value of R2Sim is greater than or equal 0.7, then two elements are highly similar. V. EXPERIMENTAL EVALUATION We perform experiments to answer two questions. 1. How much advantage does the R2Sim provide, compared to other approaches? 2. How effective is each similarity factor in measuring the semantic similarity between elements in different RDF Schemas? To answer these questions, we select the data set and set up the implementation as follows. 5.1. Data Set and Setup The semantic similarity between elements in different RDF Schemas (R2Sim) is implemented with C# language. To compare the name similarity (NSim) in the description measurement, we integrate WordNet and its .NET API, which is provided by Simpson et al. [21] to our implementation. We evaluate the proposed measures in the context of matching two RDF Schemas to determine the number of matches between them and then compare with other approaches. The criteria for evaluating the quality of matching system are precision and recall, which originate from information retrieval [22] and were adapted to ontology matching [18]. To examine the performance of R2Sim, we download about 20 RDF Schemas from [20] as source schemas and then modify each source schema to generate a corresponding destination schema. This paper presents the test results with five RDFS sources from [20] and their modified schemas. The characteristics of five RDF Schemas are presented in Table 2. Table 2. The characteristics of the tested schemas # Schema name File size (KB) # classes (source/ destination) # properties (source/ destination) 1 GeneOntology 4642 7853/7000 40107/40200 2 RealEstateData 2311 2925/3050 19687/15600 3 ACM-Computing 231 312/312 1146/1200 4 MovieDatabase 86 96/80 379/300 5 Educational 8 17/30 26/40 In Table 2, the destination schema of GeneOntology is modified by increasing the number of properties and decreasing the number of classes of the source schema. In contrast, we decrease the number of properties and increase the number of classes of RealEstateData. For ACM-Computing, we keep the same number of classes and increase the number of properties. For MovieDatabase, we increase two numbers whereas we decrease those in Education schema. The results of simulation are presented in next section. 5.2. Experiment Results Since our approach focuses on the similarity between RDF Schema elements, we compare our method to similar works such as Leme et al. [17], Do et al. [18], and Algergawy et al. [12]. The precision, recall and F-measure values among R2Sim and related work are presented in Fig. 5, Fig. 6, and Fig. 7. Note that in this paper, the threshold values are chosen between 0.3 and 1, since those similarity values lower than 0.3 are mostly different and easy to determine by human observing. Pham Thi Thu Thuy, Nguyen Dang Tien 29 The comparison results in Fig. 5, 6, and 7 show that our R2Sim significantly outperforms the other methods at all thresholds, followed by the methods of Algergawy, Leme, and Do. The Algergawy’s method outperforms the R2Sim when the thresholds are equal or less than 0.5. The main reason for this is that the data type similarity values of Algergawy’method are very high and based on user’s judgment. However, for high threshold values, Algergawy’s method has less accurate similarity values. The measures of Do and Leme have poor results since they are simply based on the string similarity of element names, but Leme’s method is better than Do’s method since Leme approach still considers the data type similarity. Further, in order to determine the most important factor that affects the similarity values, we separate five similarity factors (NSim, DfSim, DtSim, SpSim, and ChSim) and compare with the whole combination of them (R2Sim). The result is presented in Fig. 8. The columns in Fig. 8 show that DtSim has the lowest measure quality. Its F-measure values is only 62% in comparing with 66% of NSim and DfSim, about 80% of SpSim and ChSim. The reason is that the data type similarity measure is mostly applied for property elements whereas the number of class elements in RDFS is very high, so only DtSim cannot differentiate the semantic similarity of RDF elements. Among five measuring factors, ChSim gives the highest similarity value. However, regarding the best quality achieved, we observe that the combination of all similarity factors outperforms ChSim. Therefore, it is better to use multiple similarity measures instead of using a single measure. VI. CONCLUSIONS This paper proposes a novel similarity measuring technique for RDF elements. We present a semantic similarity measurement method that computes both description and neighborhood resemblances. The experimental evaluation demonstrates that our method outperforms the human judgment and related approaches, especially our approach gets best result when processing complex RDF documents. Further, the combination of all measuring factors provides important information for deriving the correct similarity values. We hope that the research has established a foundation to help the integration of different RDF Schemas. If this method is popularized, a large amount of RDF Schema data on the current Web will be integrated into the useful Fig. 5. Precision among R2Sim and related approaches Fig. 6. Recall among R2Sim and related approaches Fig.7. F-measure among R2Sim and related work Fig.8. Quality of R2Sim, NSim, DfSim, DtSim, SpSim, and ChSim 30 AN IMPROVEMENT IN MEASURING THE SEMANTIC SIMILARITY BETWEEN RDF ONTOLOGIES ontology for the Semantic Web and its applications. Our future research will focus on computing the similarity of RDF individuals based on the RDF Schema’s relatedness. REFERENCES [1] Frank Manola, Eric Miller, W3C, 2004, [2] Doan A. H., Madhavan J., Domingos P., Ontologies Matching: A Machine Learning Approach, Handbook on Ontologies in Inf. Systems, Springer-Velag, 2003. [3] Ehrig M., Sure Y., Ontology Mapping – an integrated approach, 1 st European Semantic web Symposium, 2004. [4] Oundhankar S., K. Verma, Sivashanugam K., Discovery of web services in a Multi-Ontologies and Federated Registry Environment, International Journal of Web Services Research, 1, 3, 2005. [5] Ronald M., Thomas H., Rene P., Enterprise Knowledge Infrastructures, 2nd edition, Springer, 2009. [6] D Vint Productions, XML Schema - Data Types Quick Reference, 2003. [7] E. Pyshkin, A. Kuznetsov, Approaches for Web Search User Interfaces: How to improve the search quality for various types of information, JoC, Vol.1, No.1, pp.1-8. [8] Roman Y Shtykh, Qun Jin, A human-centric integrated approach to web information search and sharing, HCIS 2011, 1:2 (22 November 2011) [9] Vitaly Klyuev, Ai Yokoyama, Web Query Expansion: A Strategy Utilising Japanese WordNet, JoC, Vol.1, No.1, pp.23-28. [10] Princeton University, WordNet_ A lexical database for English, [11] R. Nayak, T. Tran, A progressive clustering algorithm to group the XML data by structural and semantic similarity”, Pattern Recognition & Artificial Intelligence 21(4) (2007) 723-743. [12] Alsayed Algergawy, Richi Nayak, Gunter Saake, Element similarity measures in XML schema matching, Journal of Information Sciences, pp. 4975-4998, 2010. [13] Sergey Melnik, Bridging the gap between RDF and XML, 1999, [14] M. Ferdinand, C. Zirpins, and D. Trastour, Lifting XML Schema to OWL, Web Engineering – 4th International Conference, ICWE, pp. 354-358, 2004. [15] Tim Berners-Lee, A strawman unstriped syntax for RDF in XML, W3C, March 2007. [16] Jonathan Boden, Simplified XML syntax for RDF, [17] Leme, L. A. P. P.; Casanova, M. A.; Breitman, K. K. & Furtado, A. L. Evaluation of similarity measures and heuristics for simple RDF schema matching. Technical Report 44/08, Dept. Informatics, PUC-Rio 14, 2008. [18] Hong-Hai Do, and Erhard Rahm, COMA - A System for Flexible Combination of Schema Matching Approaches, Proceedings of the Very Large Data Bases conference (VLDB), pp 610–621, 2002. [19] Dongqiang D. Yang and David M.W. Powers, Measuring Semantic Similarity in the Taxonomy of WordNet, The 28th Australasian Computer Science Conference (ACSC2005), Australia, pp. 315-322, 2005. [20] A. Maganaraki and L. Sidirourgos, RDF Schema Registry, 2004 [21] Troy Simpson, Crowe M., WordNet.Net 2005 [22] Wikipedia, Precision and recall, [23] Pham Thi Thu Thuy, Young-Koo Lee, and Sungyoung Lee, "R2Sim: A Novel Semantic Similarity Measure for Matching between RDF Schemas", The 2012 FTRA International Conference on Advanced IT, engineering and Management (FTRA AIM 2012), Seoul, Korea, February 6-8, 2012. [24] Roberto De Virgilio, Antonio Maccioni, Riccardo Torlone, “A similarity measure for approximate querying over RDF data”, EDBT’13 Proceedings of the Joint EDBT/ICDT 2013 Workshops, pp. 205-213, 2013. [25] Samur Araujo, Jan Hidders, Daniel Schwabe, Arjen P. de Vries, “SERIMI – Resource Description Similarity, RDF Instance Matching and Interlinking”, OM, volume 814 of CEUR Workshop Proceedings, CEUR-WS.org, 2011. [26] Marcelo Schiessl, Rita Berardi, and Marisa Brascher, “Similarity between text and RDF”, Information Services and Use, 34 (2014), 325-330. [27] Mehwish Alam and Amedeo Napoli, “An Approach Towards Classifying and Navigating RDF data based on Pattern Structures”, Proceedings of the International Workshop on Formal Concept Analysis and Applications 2015 co-located with 13th International Conference on Formal Concept Analysis., Jun 2015, Nerja, Spain. 1434, pp.33-48, 2015. MỘT CẢI TIẾN TRONG VIỆC ĐO LƯỜNG ĐỘ TƯƠNG ĐỒNG NGỮ NGHĨA GIỮA CÁC TÀI LIỆU RDF Phạm Thị Thu Thúy, Nguyễn Đăng Tiến TÓM TẮT— RDF hiện đang đóng vài trò quan trọng trong các ứng dụng tri thức bởi RDF cung cấp một lượng thuật ngữ cho phép mô tả chính xác dữ liệu. Tuy nhiên, sự lớn mạnh của RDF dẫn đến nhu cầu đánh giá sự giống nhau giữa các tài liệu có tính tương đồng. Bài báo này trình bày một cải tiến trong việc so sánh sự tương quan về ngữ nghĩa giữa các phần tử trong tài liệu RDF. Các công thức đo lường chú trọng đến thông tin được mô tả trong các phần tử RDF và mối quan hệ giữa các phần tử cha và con. Các công thức đề xuất được thực nghiệm bằng cách ánh xạ các tài liệu RDF với nhau để xác định số lượng tương quan và so sánh kết quả với nhận định khách quan của người dùng. Các kết quả thực nghiệm chỉ ra rằng phương pháp của chúng tôi cho kết quả độ tương tự chính xác hơn các phương pháp liên quan. Keywords— Độ tương tự, tài liệu RDF, đo lường.
File đính kèm:
- an_improvement_in_measuring_the_semantic_similarity_between.pdf