On rectifying the mapping between articles and institutions in bibliometric databases
Abstract: Today, bibliometric databases are indispensable sources for researchers and research
institutions. The main role of these databases is to find research articles and estimate the
performance of researchers and institutions. Regarding the evaluation of the research performance
of an organization, the accuracy in determining institutions of authors of articles is decisive.
However, current popular bibliometric databases such as Scopus and Web of Science have not
addressed this point efficiently. To this end, we propose an approach to revise the authors’
affiliation information of articles in bibliometric databases. We build a model to classify articles to
institutions with high accuracy by assembling the bag of words and n-grams techniques for
extracting features of affiliation strings. After that, these features are weighted to determine their
importance to each institution. Affiliation strings of articles are transformed into the new feature
space by integrating weights of features and local characteristics of words and phrases contributing
to the sequences. Finally, on the feature space, the support vector classifier method is applied to
learn a predictive model. Our experimental result shows that the proposed model’s accuracy is
about 99.1%.
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
Trang 8
Trang 9
Trang 10
Tóm tắt nội dung tài liệu: On rectifying the mapping between articles and institutions in bibliometric databases
University, Hanoi”, 2-grams based phrases are “Vietnam National”, and “National University”. The phrase “University, Hanoi” is considered as meaningless and is ignored. Figure 3. An example of the preprocessing steps. When transforming affiliation strings into the new feature space, we try to capture both local and global characteristics. With the local characteristic of an affiliation string s, we estimate how “important” extracted words or phrases contribute to s. Meanwhile, with the global characteristic, we may obtain the contribution/importance of extracted words or phrases to the institution in the set of institutions. The local characteristic is quantified by frequency of the word or phrase appearing in an affiliation string. The importance of a word or a N.K. Tuan et al. / VNU Journal of Science: Comp. Science & Com. Eng, Vol. 36, No. 2 (2020) 12-21 17 phrase is proportional to the frequency of the word or the phrase, it is assumed that the higher the frequency of the word (phrase) is, the more the importance of the word (phrase) to the institution. The local characteristic is determined by IF: (1) where t is a feature represents a word or a phrase. freq(t, s) is frequency of t in s. The global characteristic is evaluated by the inverse institution frequency (IIF) of the word or the phrase. We assume that each institution is a set of words and phrases which are retrieved from prior feature extraction step, the characteristic shows how common a word or a phrase appears in all institutions. Table 1. Examples of IF-IIF of words and phrases Institution Written affiliation Top words or phrases IF-IIF Vietnam Natl. Univ. Hanoi Department of Electronics and Telecommunications, VNU University of Engineering and Technology, Viet Nam university of engineering 0.357 vnu university 0.320 vnu 0.294 Ton Duc Thang Univ. Faculty of Applied Sciences, Ton Duc Thang University, Tan Phong Ward, District 7, Ho Chi Minh City, Viet Nam duc thang university 0.270 ton duc thang 0.242 tan phong ward 0.222 Vietnam Aca. of Sci. & Tech. Institute of Biotechnology, VAST, 18, Hoang Quoc Viet Road, Cau Giay, Hanoi, Viet Nam vast 0.346 18 0.285 quoc viet road 0.265 L This metric can be calculated by taking the total number of institutions, dividing it by the number of institutions that contain a word or a phrase. The closer it is to 1.0, the more common a word is. The formulation for global characteristics is showed as follows. (2) where C denotes a set of institutions and Ct is the set of institutions containing t. We see that an affiliation string is represented by a feature vector contains weighted values that can capture both local and global characteristics of words and phrases decomposed from the original. These feature values are obtained as follows. (3) Table 1 shows words or phrases with high IF-IIF for three institutions including Vietnam National University in Hanoi, Vietnam Academy of Science and Technology, and Ton Duc Thang University. The results show that important words or phrases of the affiliation strings have high IF-IIF values. Therefore, these words or phrases can be efficient to represent the corresponding institution and the classifier model can utilize them to predict accurately. 2.3. A SVM Model for Affiliation String Classification To learn a predictive model, in our approach, we use Support Vector Classifier (SVC) [16]. In addition, the Radial Basic Function (RBF) kernel is used to map data to higher-dimension space before learning the classifier fk of class k. fk(x) = ∗ Φ(x, ) + (4) where is the weight vector and Φ(x,x’) is the RBF function defined as follows. Φ(x, x’) = exp(−γ ∗ ||x – x’||2) (5) N.K. Tuan et al. / VNU Journal of Science: Comp. Science & Com. Eng, Vol. 36, No. 2 (2020) 12-21 18 The training step optimises a convex cost function. The probability that an affiliation string x is classified to an institution k is formulated as follows. (6) where A and B are estimated by minimizing the negative log likelihood of training data (using their labels and decision values). The approach has many benefits. First, the model only depends on the most informative patterns (the support vectors). Second, the learning process is not complicated because there are no false local minima. After learning the model using SVC with RBF kernel, we set the heuristic threshold 0.6 in classifying affiliation strings to institutions. In equation (6), x is classified as k only if p(k|x) ≥ 0.6, otherwise the label k is rejected. S Figure 4. The number of affiliation strings of each institution. 3. Experimental Evaluation This section presents the experimental result of our method on a data set of affiliations collected from Scopus. About the dataset, we firstly obtain metadata of articles published in both 2016 and 2017 that belongs to at least one Vietnamese institution. After that we extract affiliation strings of Vietnamese institutions. The data set consists of 12704 affiliation strings labeled to 36 classes. 35 classes represent 35 predetermined institutions and one class (OTHER) is for other institutions. Figure 4 shows the distribution of affiliation strings in each institution. It can be seen that the data set is unbalanced. The data set of affiliations is preprocessed by the steps mentioned above. Features represented by Bag of Words and 1-3 grams are weighted by using IF-IIF function. The feature space has 24383 dimensions. The data set is then splitted into training data set and testing data set by 80/20 ratio with 10163 affiliation strings and 2541 affiliation strings, respectively. In the training step, 5-fold cross validation is used to obtain a fit model. In addition, we tried to tune the hyper-parameters of SVC model with 4 different kernels including Linear, Polynomial, Radial Basis Function (RBF) and Sigmoid. The parameter γ is experimented from 10 -5 to 10 -2 while the parameter C, the penalty for N.K. Tuan et al. / VNU Journal of Science: Comp. Science & Com. Eng, Vol. 36, No. 2 (2020) 12-21 19 misclassifying a data point, changes from 10 -3 to 10 3 . Finally, we decided on the SVC model with RBF kernel, 10 -2 for γ and 10 2 for C. The testing data set is used to measure the performance of our model and other models based on other well-known classification methods including Random Forest (RF) [17], Logistic Regression (LR) and K-Nearest Neighbor (KNN) [18]. The results are described in the Table 2. Table 2. Accuracy of models Model Precision Recall Macro-F1 RF 0.6693 0.7665 0.7152 LR 0.9589 0.9595 0.9591 KNN 0.9601 0.9551 0.9575 SVM 0.9914 0.9913 0.9913 The experiment result showed that our SVM model outperforms other models. However, the distribution in Figure 4 produced that the sizes of samples set between classes were imbalanced, especially number of the "OTHER" samples was significant compared to the rest, which may lead to inaccurate evaluation. Accordingly, we further assessed each label accuracy instead of bringing all together. The empirical result revealed that the F1-score of each label ranges between 0.96 and 1.00, and mostly at the highest score 1.00. These statistics rationally pointed out that there were no class had lagged in F1-score, all of them had the values very close to the Macro-F1 after measuring overall result on multiple classes. Besides, comparing to the model proposed by Pascal Cuxac and his colleagues [19] (trained on their own data set), the Macro- F1 score of our model (0.99) is better than that of their model (0.93). The accuracy of our model is very high (approximate 1.0) in three accuracy measures on the testing data set. This result prompts us to apply the model to a practical problem. We also applied our model to verify the mapping articles to institutions in Scopus. From Scopus, we collected metadata of all articles published by at least one Vietnamese institution during the period from 1/2014 to 6/2019. By classifying affiliation strings of each article we can check whether Scopus mapped them to institutions correctly. The result is shown in table 3. The first column indicates institutions. The second one is the number of articles published by the corresponding institution, which purely obtained as cardinality of Scopus’s article set. The third column is the number of articles of each institution as the result of A2I tool based on our approach. The fourth column is the number of articles that Scopus counts for the corresponding institution but our tool decided contrarily, this one is calculated by the cardinality of set difference of Scopus and A2I sets. In contrast, the values in the fifth column is the number of articles of the corresponding institution miscounted by Scopus but were found by A2I. The number in the parentheses is the result after checking manually each difference set, represents number of remaining articles is correctly assigned to the institution. For example, with the Vietnam Academy of Science and Technology, the number of articles recognized by Scopus is 3931. Our tool shows that this number should be 4519. The tool also indicates that 5 articles which not actually belong to this institution but still being counted by Scopus. By checking manually (i.e. looking at the affiliation strings of articles) we confirm that all these 5 articles are wrongly counted by Scopus. Meanwhile, our tool found 593 more articles (in Scopus) that belong to the institution. The result of the manual check shows that only 592 (out of 593) actually belong to the institution. Our tool failed to detect one article. Regarding Ton Duc Thang University, 3955 papers indicated by Scopus actually belong to this university (i.e. there is no false positive). Our tool hints that 40 articles are miscounted. N.K. Tuan et al. / VNU Journal of Science: Comp. Science & Com. Eng, Vol. 36, No. 2 (2020) 12-21 20 g O Table 3. The number of affiliation strings of each institution Institution |Scopus| |A2I| |Scopus \A2I| |A2I \Scopus| (manual check) (manual check) Ton Duc Thang Univ. 3955 3995 0 (0) 40 (37) Vietnam Aca. of Sci. & Tech. 3931 4519 5 (0) 593 (592) Vietnam Natl. Univ. Hanoi 2639 3132 599 (0) 1092 (1092) Hanoi Univ. of Sci. & Tech. 3052 2530 572 (0) 50 (48) Vietnam Natl. Univ. HCM 1839 4734 154 (0) 3049 (3038) Duy Tan Univ. 1789 1789 2 (1) 2 (2) Hue Univ. 624 923 1 (0) 300 (295) Hanoi Univ. of Edu. 744 774 1 (0) 31 (31) Can Tho Univ. 964 941 55 (0) 32 (26) Univ. of Da Nang 790 868 19 (3) 97 (96) P Although the correct number is 37 (obtained by manual check), our tool shows its effectiveness, especially in finding miscounted articles for Vietnam National University Hanoi and Vietnam National University HCM. 4. Conclusions In this work, we study the issue of bibliometric databases such as Scopus and Web of Science in identifying authors’ institutions. We propose a method for mapping affiliation strings (written in papers) to authors’ institutions. Our method exploits only basic techniques in NLP and machine learning. We experimented the method with papers of Vietnamese institutions in Scopus. The experiment result shows the effectiveness of our method. An implication of the result is that the current approach of mapping papers to institutions of Scopus needs improving. Acknowledgments This work has been supported by Vietnam National University, Hanoi (VNU) under project QG.18.62. References [1] S.B. Shereen Hanafi, Discover the data behind the times higher education world university rankings, Elsevier Connect. [2] M. Dobrota, M. Bulajic, L. Bornmann, V. Jeremic, A new approach to the qs university ranking using the composite i-distance indicator: Uncertainty and sensitivity analyses, JASIST 67 (2016) 200-211. [3] A.-P. Pavel, Global university rankings - a comparative analysis, Procedia Economics and Finance 26 (2015) 54-63. https://doi.org/10.1016/S2212-5671(15)00838-2. [4] Web of science databases, Clarivate Analytics. [5] J.F. Burnham, Scopus database: a review, Biomedical Digital Libraries 3. [6] F. Franceschini, D. Maisano, L. Mastrogiacomo, A novel approach for estimating the omitted- citation rate of bibliometric databases with an application to the field of bibliometrics, Journal of the american society for information science and technology 64 (2013) 2149-2156. https://doi.org/10.1002/asi.22898. [7] F. Franceschini, D. Maisano, L. Mastrogiacomo, Scientific journal publishers and omitted citations in bibliometric databases: Any relationship?, Journal of Informetrics 8(3) (2014) 751 - 765. https://doi.org/10.1016/j.joi.2014.07.003. [8] R. Buchanan, Accuracy of cited references: The role of citation databases, College Research Libraries 67. [9] J. Valderrama-Zurián, R. Aguilar-Moya, D. Melero-Fuentes, R. Aleixandre-Benavent, A systematic analysis of duplicate records in scopus, Journal of Informetrics 9 (2015) 570– 576. 10.1016/j.joi.2015.05.002. [10] J. Zhu, G. Hu, W. Liu, Doi errors and possible solutions for web of science, Scientometrics 118(2) (2019) 709-718. N.K. Tuan et al. / VNU Journal of Science: Comp. Science & Com. Eng, Vol. 36, No. 2 (2020) 12-21 21 [11] S. Xu, L. Hao, X. An, D. Zhai, H. Pang, Types of doi errors of cited references in web of science with a cleaning method, Scientometrics 120(3) (2019) 1427-1437. 10.1007/s11192-019-03162-4. [12] E. Krauskopf, Missing documents in scopus: the case of the journal enfermeria nefrologica, Scientometrics 119(1) (2019) 543-547. https://doi.org/10.1007/ s11192-019-03040-z. [13] W. Liu, G. Hu, L. Tang, Missing author address information in web of science-an explorative study, Journal of Informetrics 12(3) (2018) 985-997. https://doi.org/10.1016/j.joi.2018.07.008. [14] E. Krauskopf, Standardization of the institutional address, Scientometrics 94(3) (2013) 1313-1315. [15] E. Krauskopf, Call for caution in the use of bibliometric data, J. Assoc. Inf. Sci. Technol. 68(8) (2017) 2029-2032. [16] M. Awad, R. Khanna, Support Vector Machines for Classification, Apress, Berkeley, CA, 2015, pp. 39-66. [17] L. Breiman, Random forests, Machine Learning 45(1) (2001) 5-32. https://doi.org/10.1023/A:1010933404324 [18] T. Cover, P. Hart, Nearest neighbor pattern classification, IEEE Trans. Inf. Theor. 13(1) (2006) 21-27. [19] L.J.-C.B. Cuxac, P., Efficient supervised and semi- supervised approaches for affiliations disambiguation, Scientometrics 97(1) (2013) 47-58. 2
File đính kèm:
- on_rectifying_the_mapping_between_articles_and_institutions.pdf