On rectifying the mapping between articles and institutions in bibliometric databases

Abstract: Today, bibliometric databases are indispensable sources for researchers and research

institutions. The main role of these databases is to find research articles and estimate the

performance of researchers and institutions. Regarding the evaluation of the research performance

of an organization, the accuracy in determining institutions of authors of articles is decisive.

However, current popular bibliometric databases such as Scopus and Web of Science have not

addressed this point efficiently. To this end, we propose an approach to revise the authors’

affiliation information of articles in bibliometric databases. We build a model to classify articles to

institutions with high accuracy by assembling the bag of words and n-grams techniques for

extracting features of affiliation strings. After that, these features are weighted to determine their

importance to each institution. Affiliation strings of articles are transformed into the new feature

space by integrating weights of features and local characteristics of words and phrases contributing

to the sequences. Finally, on the feature space, the support vector classifier method is applied to

learn a predictive model. Our experimental result shows that the proposed model’s accuracy is

about 99.1%.

Download

Trang 1

Trang 2

Trang 3

Trang 4

Trang 5

Trang 6

Trang 7

Trang 8

Trang 9

Trang 10

10 trang xuanhieu 6040

Download

Bạn đang xem tài liệu "On rectifying the mapping between articles and institutions in bibliometric databases", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: On rectifying the mapping between articles and institutions in bibliometric databases

 
University, Hanoi”, 2-grams based phrases are 
“Vietnam National”, and “National University”. 
The phrase “University, Hanoi” is considered as 
meaningless and is ignored. 
Figure 3. An example of the preprocessing steps. 
When transforming aﬃliation strings into 
the new feature space, we try to capture both 
local and global characteristics. With the local 
characteristic of an aﬃliation string s, we 
estimate how “important” extracted words or 
phrases contribute to s. Meanwhile, with the 
global characteristic, we may obtain the 
contribution/importance of extracted words or 
phrases to the institution in the set of 
institutions. 
The local characteristic is quantified by 
frequency of the word or phrase appearing in an 
aﬃliation string. The importance of a word or a 
N.K. Tuan et al. / VNU Journal of Science: Comp. Science & Com. Eng, Vol. 36, No. 2 (2020) 12-21 
17 
phrase is proportional to the frequency of the 
word or the phrase, it is assumed that the higher 
the frequency of the word (phrase) is, the more 
the importance of the word (phrase) to the 
institution. The local characteristic is 
determined by IF: 
 (1) 
where t is a feature represents a word or a 
phrase. freq(t, s) is frequency of t in s. 
The global characteristic is evaluated by the 
inverse institution frequency (IIF) of the word 
or the phrase. We assume that each institution is 
a set of words and phrases which are retrieved 
from prior feature extraction step, the 
characteristic shows how common a word or a 
phrase appears in all institutions. 
Table 1. Examples of IF-IIF of words and phrases 
Institution Written affiliation Top words or phrases IF-IIF 
Vietnam Natl. Univ. Hanoi 
Department of Electronics and 
Telecommunications, VNU University of 
Engineering and Technology, Viet Nam 
university of engineering 0.357 
vnu university 0.320 
vnu 0.294 
Ton Duc Thang Univ. 
Faculty of Applied Sciences, Ton Duc 
Thang University, Tan Phong Ward, 
District 7, Ho Chi Minh City, Viet Nam 
duc thang university 0.270 
ton duc thang 0.242 
tan phong ward 0.222 
Vietnam Aca. of Sci. 
& Tech. 
Institute of Biotechnology, VAST, 18, 
Hoang Quoc Viet Road, Cau Giay, 
Hanoi, Viet Nam 
vast 0.346 
18 0.285 
quoc viet road 0.265 
L 
This metric can be calculated by taking the 
total number of institutions, dividing it by the 
number of institutions that contain a word or a 
phrase. The closer it is to 1.0, the more 
common a word is. The formulation for global 
characteristics is showed as follows. 
 (2) 
where C denotes a set of institutions and Ct is 
the set of institutions containing t. 
We see that an aﬃliation string is 
represented by a feature vector contains 
weighted values that can capture both local and 
global characteristics of words and phrases 
decomposed from the original. These feature 
values are obtained as follows. 
 (3) 
Table 1 shows words or phrases with high 
IF-IIF for three institutions including Vietnam 
National University in Hanoi, Vietnam 
Academy of Science and Technology, and Ton 
Duc Thang University. The results show that 
important words or phrases of the aﬃliation 
strings have high IF-IIF values. Therefore, 
these words or phrases can be eﬃcient to 
represent the corresponding institution and the 
classifier model can utilize them to predict 
accurately. 
2.3. A SVM Model for Aﬃliation String 
Classification 
To learn a predictive model, in our 
approach, we use Support Vector Classifier 
(SVC) [16]. In addition, the Radial Basic 
Function (RBF) kernel is used to map data to 
higher-dimension space before learning the 
classifier fk of class k. 
 fk(x) = ∗ Φ(x, ) + (4) 
where is the weight vector and Φ(x,x’) is the 
RBF function defined as follows. 
Φ(x, x’) = exp(−γ ∗ ||x – x’||2) (5) 
N.K. Tuan et al. / VNU Journal of Science: Comp. Science & Com. Eng, Vol. 36, No. 2 (2020) 12-21 
18 
The training step optimises a convex cost 
function. The probability that an aﬃliation 
string x is classified to an institution k is 
formulated as follows. 
 (6)
where A and B are estimated by minimizing 
the negative log likelihood of training data 
(using their labels and decision values). 
The approach has many benefits. First, the 
model only depends on the most informative 
patterns (the support vectors). Second, the 
learning process is not complicated because 
there are no false local minima. 
After learning the model using SVC with 
RBF kernel, we set the heuristic threshold 0.6 
in classifying aﬃliation strings to institutions. 
In equation (6), x is classified as k only if 
p(k|x) ≥ 0.6, otherwise the label k is rejected. 
S 
 Figure 4. The number of aﬃliation strings of each institution. 
3. Experimental Evaluation 
This section presents the experimental 
result of our method on a data set of aﬃliations 
collected from Scopus. About the dataset, we 
firstly obtain metadata of articles published in 
both 2016 and 2017 that belongs to at least one 
Vietnamese institution. After that we extract 
aﬃliation strings of Vietnamese institutions. 
The data set consists of 12704 aﬃliation strings 
labeled to 36 classes. 35 classes represent 35 
predetermined institutions and one class 
(OTHER) is for other institutions. Figure 4 
shows the distribution of aﬃliation strings in 
each institution. It can be seen that the data set 
is unbalanced. 
The data set of aﬃliations is preprocessed 
by the steps mentioned above. Features 
represented by Bag of Words and 1-3 grams are 
weighted by using IF-IIF function. The feature 
space has 24383 dimensions. The data set is 
then splitted into training data set and testing 
data set by 80/20 ratio with 10163 aﬃliation 
strings and 2541 aﬃliation strings, respectively. 
In the training step, 5-fold cross validation is 
used to obtain a fit model. In addition, we tried 
to tune the hyper-parameters of SVC model 
with 4 diﬀerent kernels including Linear, 
Polynomial, Radial Basis Function (RBF) and 
Sigmoid. The parameter γ is experimented from 
10
-5
 to 10
-2
 while the parameter C, the penalty for 
N.K. Tuan et al. / VNU Journal of Science: Comp. Science & Com. Eng, Vol. 36, No. 2 (2020) 12-21 
19 
misclassifying a data point, changes from 10
-3
 to 
10
3
. Finally, we decided on the SVC model with 
RBF kernel, 10
-2
 for γ and 10
2
 for C. 
The testing data set is used to measure the 
performance of our model and other models 
based on other well-known classification 
methods including Random Forest (RF) [17], 
Logistic Regression (LR) and K-Nearest 
Neighbor (KNN) [18]. The results are described 
in the Table 2. 
Table 2. Accuracy of models 
Model Precision Recall Macro-F1 
RF 0.6693 0.7665 0.7152 
LR 0.9589 0.9595 0.9591 
KNN 0.9601 0.9551 0.9575 
SVM 0.9914 0.9913 0.9913 
The experiment result showed that our 
SVM model outperforms other models. 
However, the distribution in Figure 4 produced 
that the sizes of samples set between classes 
were imbalanced, especially number of the 
"OTHER" samples was significant compared to 
the rest, which may lead to inaccurate 
evaluation. Accordingly, we further assessed 
each label accuracy instead of bringing all 
together. The empirical result revealed that the 
F1-score of each label ranges between 0.96 and 
1.00, and mostly at the highest score 1.00. 
These statistics rationally pointed out that there 
were no class had lagged in F1-score, all of 
them had the values very close to the Macro-F1 
after measuring overall result on multiple 
classes. Besides, comparing to the model 
proposed by Pascal Cuxac and his colleagues 
[19] (trained on their own data set), the Macro-
F1 score of our model (0.99) is better than that 
of their model (0.93). The accuracy of our 
model is very high (approximate 1.0) in three 
accuracy measures on the testing data set. This 
result prompts us to apply the model to a 
practical problem. 
We also applied our model to verify the 
mapping articles to institutions in Scopus. From 
Scopus, we collected metadata of all articles 
published by at least one Vietnamese institution 
during the period from 1/2014 to 6/2019. By 
classifying aﬃliation strings of each article we 
can check whether Scopus mapped them to 
institutions correctly. The result is shown in 
table 3. The first column indicates institutions. 
The second one is the number of articles 
published by the corresponding institution, 
which purely obtained as cardinality of 
Scopus’s article set. The third column is the 
number of articles of each institution as the 
result of A2I tool based on our approach. The 
fourth column is the number of articles that 
Scopus counts for the corresponding institution 
but our tool decided contrarily, this one is 
calculated by the cardinality of set diﬀerence of 
Scopus and A2I sets. In contrast, the values in 
the fifth column is the number of articles of the 
corresponding institution miscounted by Scopus 
but were found by A2I. The number in the 
parentheses is the result after checking 
manually each diﬀerence set, represents number 
of remaining articles is correctly assigned to the 
institution. For example, with the Vietnam 
Academy of Science and Technology, the 
number of articles recognized by Scopus is 
3931. Our tool shows that this number should 
be 4519. The tool also indicates that 5 articles 
which not actually belong to this institution but 
still being counted by Scopus. By checking 
manually (i.e. looking at the aﬃliation strings 
of articles) we confirm that all these 5 articles 
are wrongly counted by Scopus. Meanwhile, 
our tool found 593 more articles (in Scopus) 
that belong to the institution. The result of the 
manual check shows that only 592 (out of 593) 
actually belong to the institution. Our tool 
failed to detect one article. Regarding Ton Duc 
Thang University, 3955 papers indicated by 
Scopus actually belong to this university (i.e. 
there is no false positive). Our tool hints that 40 
articles are miscounted. 
N.K. Tuan et al. / VNU Journal of Science: Comp. Science & Com. Eng, Vol. 36, No. 2 (2020) 12-21 
20 
g 
O 
Table 3. The number of aﬃliation strings of each institution 
Institution |Scopus| |A2I| 
|Scopus \A2I| |A2I \Scopus| 
(manual check) (manual check) 
Ton Duc Thang Univ. 3955 3995 0 (0) 40 (37) 
Vietnam Aca. of Sci. & Tech. 3931 4519 5 (0) 593 (592) 
Vietnam Natl. Univ. Hanoi 2639 3132 599 (0) 1092 (1092) 
Hanoi Univ. of Sci. & Tech. 3052 2530 572 (0) 50 (48) 
Vietnam Natl. Univ. HCM 1839 4734 154 (0) 3049 (3038) 
Duy Tan Univ. 1789 1789 2 (1) 2 (2) 
Hue Univ. 624 923 1 (0) 300 (295) 
Hanoi Univ. of Edu. 744 774 1 (0) 31 (31) 
Can Tho Univ. 964 941 55 (0) 32 (26) 
Univ. of Da Nang 790 868 19 (3) 97 (96) 
P 
Although the correct number is 37 
(obtained by manual check), our tool shows its 
eﬀectiveness, especially in finding miscounted 
articles for Vietnam National University Hanoi 
and Vietnam National University HCM. 
 4. Conclusions 
In this work, we study the issue of 
bibliometric databases such as Scopus and Web 
of Science in identifying authors’ institutions. 
We propose a method for mapping aﬃliation 
strings (written in papers) to authors’ 
institutions. Our method exploits only basic 
techniques in NLP and machine learning. We 
experimented the method with papers of 
Vietnamese institutions in Scopus. The 
experiment result shows the eﬀectiveness of our 
method. An implication of the result is that the 
current approach of mapping papers to 
institutions of Scopus needs improving. 
Acknowledgments 
This work has been supported by Vietnam 
National University, Hanoi (VNU) under 
project QG.18.62. 
References 
[1] S.B. Shereen Hanafi, Discover the data behind 
the times higher education world university 
rankings, Elsevier Connect. 
[2] M. Dobrota, M. Bulajic, L. Bornmann, V. 
Jeremic, A new approach to the qs university 
ranking using the composite i-distance indicator: 
Uncertainty and sensitivity analyses, JASIST 67 
(2016) 200-211. 
[3] A.-P. Pavel, Global university rankings - a 
comparative analysis, Procedia 
Economics and Finance 26 (2015) 54-63. 
https://doi.org/10.1016/S2212-5671(15)00838-2. 
[4] Web of science databases, Clarivate Analytics. 
[5] J.F. Burnham, Scopus database: a review, 
Biomedical Digital Libraries 3. 
[6] F. Franceschini, D. Maisano, L. Mastrogiacomo, 
A novel approach for estimating the omitted-
citation rate of bibliometric databases with an 
application to the field of bibliometrics, Journal 
of the american society for information science 
and technology 64 (2013) 2149-2156. 
https://doi.org/10.1002/asi.22898. 
[7] F. Franceschini, D. Maisano, L. Mastrogiacomo, 
Scientific journal publishers and omitted 
citations in bibliometric databases: Any 
relationship?, Journal of Informetrics 8(3) 
(2014) 751 - 765. 
https://doi.org/10.1016/j.joi.2014.07.003. 
[8] R. Buchanan, Accuracy of cited references: The 
role of citation databases, College Research 
Libraries 67.  
[9] J. Valderrama-Zurián, R. Aguilar-Moya, D. 
Melero-Fuentes, R. Aleixandre-Benavent, A 
systematic analysis of duplicate records in 
scopus, Journal of Informetrics 9 (2015) 570–
576.  10.1016/j.joi.2015.05.002. 
[10] J. Zhu, G. Hu, W. Liu, Doi errors and possible 
solutions for web of science, Scientometrics 
118(2) (2019) 709-718. 
N.K. Tuan et al. / VNU Journal of Science: Comp. Science & Com. Eng, Vol. 36, No. 2 (2020) 12-21 
21 
[11] S. Xu, L. Hao, X. An, D. Zhai, H. Pang, Types 
of doi errors of cited references in web of 
science with a cleaning method, Scientometrics 
120(3) (2019) 1427-1437.  
10.1007/s11192-019-03162-4. 
[12] E. Krauskopf, Missing documents in scopus: the 
case of the journal enfermeria nefrologica, 
Scientometrics 119(1) (2019) 543-547. 
https://doi.org/10.1007/ s11192-019-03040-z. 
[13] W. Liu, G. Hu, L. Tang, Missing author address 
information in web of science-an explorative study, 
Journal of Informetrics 12(3) (2018) 985-997. 
https://doi.org/10.1016/j.joi.2018.07.008. 
[14] E. Krauskopf, Standardization of the 
institutional address, Scientometrics 94(3) 
(2013) 1313-1315. 
[15] E. Krauskopf, Call for caution in the use of 
bibliometric data, J. Assoc. Inf. Sci. Technol. 
68(8) (2017) 2029-2032. 
[16] M. Awad, R. Khanna, Support Vector Machines for 
Classification, Apress, Berkeley, CA, 2015, pp. 39-66. 
[17] L. Breiman, Random forests, Machine Learning 
45(1) (2001) 5-32. 
https://doi.org/10.1023/A:1010933404324 
[18] T. Cover, P. Hart, Nearest neighbor pattern 
classification, IEEE Trans. Inf. Theor. 13(1) 
(2006) 21-27. 
[19] L.J.-C.B. Cuxac, P., Eﬃcient supervised and semi-
supervised approaches for aﬃliations 
disambiguation, Scientometrics 97(1) (2013) 47-58. 
2

File đính kèm:

on_rectifying_the_mapping_between_articles_and_institutions.pdf