An improvement in measuring the semantic similarity between RDF ontologies

Abstract— RDF (Resource Description Framework) ontologies has been playing an important role for many knowledge applications because they support a source of precisely defined terms. However, the wide-Spread of RDF ontologies creates a demand for automatic way of assessing their similarity. In this paper, we present a novel method to measure the semantic similarity between elements in different RDF ontologies. This measure is designed so as to enable extraction of information encoded in RDF element descriptions and to take into account the element relationships with its ancestors and children. We evaluate the proposed measures in the context of matching two RDF ontologies to determine the number of matches between them and then compare with human estimation and the related methods. The experimental results show that our similarity values are better than other approaches with regard to the accuracy of semantics and structure similarities
Download
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
Trang 8
Trang 9
9 trang xuanhieu 6140
Download
Bạn đang xem tài liệu "An improvement in measuring the semantic similarity between RDF ontologies", để tải tài liệu gốc về máy hãy click vào nút Download ở trên
Tóm tắt nội dung tài liệu: An improvement in measuring the semantic similarity between RDF ontologies

2
* ( , ) * ( , )
( , )
SpSim C C ChSim C C
NbSim C C
 
1 2
1 2
1 2
. .
| [ ] [ ],1
( , )
( , )
i i i cf
i
C cf C cf
cf C cf C cf i n
DSim1 C C
max n n

1 2. .
1 2
( , )
( , )
C cf C cf
cf
max n n
DSim2 C C
n
(8) 
(10) 
Table 1. RDF data type compatibility by equation (9) 
Pham Thi Thu Thuy, Nguyen Dang Tien 27 
4.2.1. Super Similarity 
Super entities are the set of super classes defined by rdfs:subClassOf and the properties of those classes. For 
instance, the super entities of element SportCar in Fig. 3 are Vehicle, power, and registeredTo. Usually, the super entity 
of each element within a RDF Schema document contains several elements, therefore the super similarity between two 
elements C1 and C2 is the average similarity of two super element lists. 
For instance, the super element of an element C1 is SC1 = [C11, C12,, C1k], and the super element of and 
element C2 is is SC2 = [C21, C22,, C2t], where k and t are the numbers of super elements of the element C1 and C2, 
respectively. If k ≥ t, we take each element in SC1 to compare with each element in SC2. Otherwise, if k < t, we 
compare each element in SC2 with each element in SC1. The highest value of the measurement is chosen. The super 
similarity (SpSim) of two elements C1 and C2 is presented as following matrices (11) and (12): 
( , ) ( , )
( , )
( , ) ( , )
11 21 11 2t
1 2
1k 21 1k 2t
DcSim C C DcSim C C
SpSim C C
DcSim C C DcSim C C
, k≥t (11) 
( , ) ( , )
( , )
( , ) ( , )
21 11 21 1k
2 1
2t 11 2t 1k
DcSim C C DcSim C C
SpSim C C
DcSim C C DcSim C C
, k<t (12) 
Where DcSim is the description similarity between each super element of element C1 and each super element of 
element C2. It is determined by the equation (2). The super similarity of two elements C1 and C2 presented in matrices 
(11) and (12) is determined by the following equations (13) and (14), respectively. 
( ( , ))
( , )
k t
1i 2j
j 1
i 1
1 2
max DcSim C C
SpSim C C
k

 (13) 
( ( , ))
( , )
t k
2i 1j
j 1
i 1
2 1
max DcSim C C
SpSim C C
t

 (14) 
Where max is the maximum similarity value of each row in the matrix. 
If two elements C1 and C2 do not have any super element (it means they are root elements), then SpSim(C1,C2) =1. 
In the case that one of two compared elements is a root element, then SpSim(C1,C2) =0. 
4.2.2. Children Similarity 
Children of an element C are the collection of properties of element C and all subclasses of element C and the 
corresponding properties of those subclasses. Similar to the super computation, in order to calculate the children 
similarity of two elements C1 in RDFS1 and C2 in RDFS2, we collect all children of elements C1 and C2 and then 
compare the description similarity of each children pair. Assume that m and n are the numbers of children of the 
element C1 and C2, respectively, the children similarity (ChSim) between two elements C1 and C2 can be presented as 
following matrices (15) and (16): 
( , ) ( , )
( , )
( , ) ( , )
11 21 11 2n
1 2
1m 21 1m 2n
DcSim C C DcSim C C
ChSim C C
DcSim C C DcSim C C
, m≥n (15) 
( , ) ( , )
( , )
( , ) ( , )
21 11 21 1m
2 1
2n 11 2n 1m
DcSim C C DcSim C C
ChSim C C
DcSim C C DcSim C C
, m<n (16) 
Where DcSim is the description similarity of each child element of element C1 and each child element of element 
C1. The children similarity of two elements C1 and C2 in the matrices (15) and (16) are determined by the following 
equations (17) and (18), respectively: 
( ( , ))
( , )
m n
1i 2j
j 1
i 1
1 2
max DcSim C C
ChSim C C
m

 (17) 
( ( , ))
( , )
n m
2i 1j
j 1
i 1
2 1
max DcSim C C
ChSim C C
n

 (18) 
28 AN IMPROVEMENT IN MEASURING THE SEMANTIC SIMILARITY BETWEEN RDF ONTOLOGIES 
In the case that one of the elements C1 and C2 is the leaf node (that means it contains no child node), their 
children similarity is 0. 
Depending on the expected similarity value (threshold value), the semantic similarity between two element C1 
and C2 (R2Sim) can be divided into two groups, high similarity and low similarity, and then the matching and 
integrating strategies for those elements will be applied. In this paper, we assign 0.7 to the threshold value. Therefore, if 
value of R2Sim is greater than or equal 0.7, then two elements are highly similar. 
V. EXPERIMENTAL EVALUATION 
We perform experiments to answer two questions. 
1. How much advantage does the R2Sim provide, compared to other approaches? 
2. How effective is each similarity factor in measuring the semantic similarity between elements in different 
RDF Schemas? 
To answer these questions, we select the data set and set up the implementation as follows. 
5.1. Data Set and Setup 
The semantic similarity between elements in different RDF Schemas (R2Sim) is implemented with C# language. 
To compare the name similarity (NSim) in the description measurement, we integrate WordNet and its .NET API, 
which is provided by Simpson et al. [21] to our implementation. We evaluate the proposed measures in the context of 
matching two RDF Schemas to determine the number of matches between them and then compare with other 
approaches. The criteria for evaluating the quality of matching system are precision and recall, which originate from 
information retrieval [22] and were adapted to ontology matching [18]. 
To examine the performance of R2Sim, we download about 20 RDF Schemas from [20] as source schemas and 
then modify each source schema to generate a corresponding destination schema. This paper presents the test results 
with five RDFS sources from [20] and their modified schemas. The characteristics of five RDF Schemas are presented 
in Table 2. 
Table 2. The characteristics of the tested schemas 
# Schema name File size 
(KB) 
# classes (source/ 
destination) 
# properties (source/ 
destination) 
1 GeneOntology 4642 7853/7000 40107/40200 
2 RealEstateData 2311 2925/3050 19687/15600 
3 ACM-Computing 231 312/312 1146/1200 
4 MovieDatabase 86 96/80 379/300 
5 Educational 8 17/30 26/40 
In Table 2, the destination schema of GeneOntology is modified by increasing the number of properties and 
decreasing the number of classes of the source schema. In contrast, we decrease the number of properties and increase 
the number of classes of RealEstateData. For ACM-Computing, we keep the same number of classes and increase the 
number of properties. For MovieDatabase, we increase two numbers whereas we decrease those in Education schema. 
The results of simulation are presented in next section. 
5.2. Experiment Results 
Since our approach focuses on the similarity between RDF Schema elements, we compare our method to similar 
works such as Leme et al. [17], Do et al. [18], and Algergawy et al. [12]. The precision, recall and F-measure values 
among R2Sim and related work are presented in Fig. 5, Fig. 6, and Fig. 7. Note that in this paper, the threshold values 
are chosen between 0.3 and 1, since those similarity values lower than 0.3 are mostly different and easy to determine by 
human observing. 
Pham Thi Thu Thuy, Nguyen Dang Tien 29 
The comparison results in Fig. 5, 6, and 7 show that our R2Sim significantly outperforms the other methods at 
all thresholds, followed by the methods of Algergawy, Leme, and Do. The Algergawy’s method outperforms the R2Sim 
when the thresholds are equal or less than 0.5. The main reason for this is that the data type similarity values of 
Algergawy’method are very high and based on user’s judgment. However, for high threshold values, Algergawy’s 
method has less accurate similarity values. The measures of Do and Leme have poor results since they are simply based 
on the string similarity of element names, but Leme’s method is better than Do’s method since Leme approach still 
considers the data type similarity. 
Further, in order to determine the most important factor that affects the similarity values, we separate five 
similarity factors (NSim, DfSim, DtSim, SpSim, and ChSim) and compare with the whole combination of them (R2Sim). 
The result is presented in Fig. 8. 
The columns in Fig. 8 show that DtSim has the lowest measure quality. Its F-measure values is only 62% in 
comparing with 66% of NSim and DfSim, about 80% of SpSim and ChSim. The reason is that the data type similarity 
measure is mostly applied for property elements whereas the number of class elements in RDFS is very high, so only 
DtSim cannot differentiate the semantic similarity of RDF elements. Among five measuring factors, ChSim gives the 
highest similarity value. However, regarding the best quality achieved, we observe that the combination of all similarity 
factors outperforms ChSim. Therefore, it is better to use multiple similarity measures instead of using a single measure. 
VI. CONCLUSIONS 
This paper proposes a novel similarity measuring technique for RDF elements. We present a semantic similarity 
measurement method that computes both description and neighborhood resemblances. The experimental evaluation 
demonstrates that our method outperforms the human judgment and related approaches, especially our approach gets 
best result when processing complex RDF documents. Further, the combination of all measuring factors provides 
important information for deriving the correct similarity values. 
We hope that the research has established a foundation to help the integration of different RDF Schemas. If this 
method is popularized, a large amount of RDF Schema data on the current Web will be integrated into the useful 
Fig. 5. Precision among R2Sim and related approaches Fig. 6. Recall among R2Sim and related approaches 
Fig.7. F-measure among R2Sim and related work 
Fig.8. Quality of R2Sim, NSim, DfSim, DtSim, SpSim, 
and ChSim 
30 AN IMPROVEMENT IN MEASURING THE SEMANTIC SIMILARITY BETWEEN RDF ONTOLOGIES 
ontology for the Semantic Web and its applications. Our future research will focus on computing the similarity of RDF 
individuals based on the RDF Schema’s relatedness. 
REFERENCES 
[1] Frank Manola, Eric Miller, W3C, 2004,  
[2] Doan A. H., Madhavan J., Domingos P., Ontologies Matching: A Machine Learning Approach, Handbook on Ontologies in 
Inf. Systems, Springer-Velag, 2003. 
[3] Ehrig M., Sure Y., Ontology Mapping – an integrated approach, 1
st
 European Semantic web Symposium, 2004. 
[4] Oundhankar S., K. Verma, Sivashanugam K., Discovery of web services in a Multi-Ontologies and Federated Registry 
Environment, International Journal of Web Services Research, 1, 3, 2005. 
[5] Ronald M., Thomas H., Rene P., Enterprise Knowledge Infrastructures, 2nd edition, Springer, 2009. 
[6] D Vint Productions, XML Schema - Data Types Quick Reference,  2003. 
[7] E. Pyshkin, A. Kuznetsov, Approaches for Web Search User Interfaces: How to improve the search quality for various types of 
information, JoC, Vol.1, No.1, pp.1-8. 
[8] Roman Y Shtykh, Qun Jin, A human-centric integrated approach to web information search and sharing, HCIS 2011, 1:2 (22 
November 2011) 
[9] Vitaly Klyuev, Ai Yokoyama, Web Query Expansion: A Strategy Utilising Japanese WordNet, JoC, Vol.1, No.1, pp.23-28. 
[10] Princeton University, WordNet_ A lexical database for English,  
[11] R. Nayak, T. Tran, A progressive clustering algorithm to group the XML data by structural and semantic similarity”, Pattern 
Recognition & Artificial Intelligence 21(4) (2007) 723-743. 
[12] Alsayed Algergawy, Richi Nayak, Gunter Saake, Element similarity measures in XML schema matching, Journal of 
Information Sciences, pp. 4975-4998, 2010. 
[13] Sergey Melnik, Bridging the gap between RDF and XML, 1999,  
[14] M. Ferdinand, C. Zirpins, and D. Trastour, Lifting XML Schema to OWL, Web Engineering – 4th International Conference, 
ICWE, pp. 354-358, 2004. 
[15] Tim Berners-Lee, A strawman unstriped syntax for RDF in XML, W3C, March 2007. 
[16] Jonathan Boden, Simplified XML syntax for RDF,  
[17] Leme, L. A. P. P.; Casanova, M. A.; Breitman, K. K. & Furtado, A. L. Evaluation of similarity measures and heuristics for 
simple RDF schema matching. Technical Report 44/08, Dept. Informatics, PUC-Rio 14, 2008. 
[18] Hong-Hai Do, and Erhard Rahm, COMA - A System for Flexible Combination of Schema Matching Approaches, Proceedings 
of the Very Large Data Bases conference (VLDB), pp 610–621, 2002. 
[19] Dongqiang D. Yang and David M.W. Powers, Measuring Semantic Similarity in the Taxonomy of WordNet, The 28th 
Australasian Computer Science Conference (ACSC2005), Australia, pp. 315-322, 2005. 
[20] A. Maganaraki and L. Sidirourgos, RDF Schema Registry,  2004 
[21] Troy Simpson, Crowe M., WordNet.Net  2005 
[22] Wikipedia, Precision and recall,  
[23] Pham Thi Thu Thuy, Young-Koo Lee, and Sungyoung Lee, "R2Sim: A Novel Semantic Similarity Measure for Matching 
between RDF Schemas", The 2012 FTRA International Conference on Advanced IT, engineering and Management (FTRA 
AIM 2012), Seoul, Korea, February 6-8, 2012. 
[24] Roberto De Virgilio, Antonio Maccioni, Riccardo Torlone, “A similarity measure for approximate querying over RDF data”, 
EDBT’13 Proceedings of the Joint EDBT/ICDT 2013 Workshops, pp. 205-213, 2013. 
[25] Samur Araujo, Jan Hidders, Daniel Schwabe, Arjen P. de Vries, “SERIMI – Resource Description Similarity, RDF Instance 
Matching and Interlinking”, OM, volume 814 of CEUR Workshop Proceedings, CEUR-WS.org, 2011. 
[26] Marcelo Schiessl, Rita Berardi, and Marisa Brascher, “Similarity between text and RDF”, Information Services and Use, 34 
(2014), 325-330. 
[27] Mehwish Alam and Amedeo Napoli, “An Approach Towards Classifying and Navigating RDF data based on Pattern 
Structures”, Proceedings of the International Workshop on Formal Concept Analysis and Applications 2015 co-located with 
13th International Conference on Formal Concept Analysis., Jun 2015, Nerja, Spain. 1434, pp.33-48, 2015. 
MỘT CẢI TIẾN TRONG VIỆC ĐO LƯỜNG ĐỘ TƯƠNG ĐỒNG 
NGỮ NGHĨA GIỮA CÁC TÀI LIỆU RDF 
Phạm Thị Thu Thúy, Nguyễn Đăng Tiến 
TÓM TẮT— RDF hiện đang đóng vài trò quan trọng trong các ứng dụng tri thức bởi RDF cung cấp một lượng thuật ngữ cho phép 
mô tả chính xác dữ liệu. Tuy nhiên, sự lớn mạnh của RDF dẫn đến nhu cầu đánh giá sự giống nhau giữa các tài liệu có tính tương 
đồng. Bài báo này trình bày một cải tiến trong việc so sánh sự tương quan về ngữ nghĩa giữa các phần tử trong tài liệu RDF. Các 
công thức đo lường chú trọng đến thông tin được mô tả trong các phần tử RDF và mối quan hệ giữa các phần tử cha và con. Các 
công thức đề xuất được thực nghiệm bằng cách ánh xạ các tài liệu RDF với nhau để xác định số lượng tương quan và so sánh kết 
quả với nhận định khách quan của người dùng. Các kết quả thực nghiệm chỉ ra rằng phương pháp của chúng tôi cho kết quả độ 
tương tự chính xác hơn các phương pháp liên quan. 
Keywords— Độ tương tự, tài liệu RDF, đo lường.
File đính kèm:
an_improvement_in_measuring_the_semantic_similarity_between.pdf