Optimizing Confusion of Authors’ Names in Persian Articles Using Random Forest Algorithm

Document Type : Research Paper

Authors

1 Assistant Professor, Design and System Operations Department, Regional Information Center for Science and Technology, Shiraz. Iran,

2 Faculty of Evaluation and Resource Development Department, Regional Information Center for Science and Technology & PhD Student in Library and Information Science, shiraz, Iran.

Abstract

Purpose:
Name is a key factor for distinguishing authors. In the academic databases that store information on papers, searching for the name of the article author is one of the most important elements in increasing visibility and the quantitative studies in the field of Scientology including the amount of citing works. The diversity of writings is one of the issues that lead to challenges in various scientific fields. In addition, the lack of writing standards in the Persian language and the lack of keyboards and standard codes, the habit of simply writing are among the factors that lead to the author's name disambiguation. Also, the spelling mistakes that occur by the writers in writing the name lead to the creation of different forms of writing for a single name. Considering the importance of solving the confusion of authors’ names in Persian articles, this paper aims to propose a framework to solve the problem of confusion and dispersion of authors' names in Persian articles, which has led to a rupture and lack of comprehensiveness in information retrieval.
Methodology: The present research is an applied scientometrics method carried out by documentary procedure, and the required data is collected from the ISC database. The initial statistical population is 913 records during the period 2015 to 2017. The proposed framework consists of three stages: searching, matching, and grouping. In this regard, after initial pre-processing and feature extraction, the search operation is performed to find records that are potentially likely to be identical. Our method extracts two types of features including internal and external. The internal feature has been extracted from the author’s information like first name, last name, affiliation, email, and co-authors. In addition, the external feature uses the scientific history of authors like articles and research interests. Next, in the search phase, the records that are potentially the same are identified. We propose a new method called Farsi-Soundex, which has been inspired by the well-known Soundex to categorize potential unique names. The same records are then found through further investigation in the adaptation phase, which is based on random forests. Therefore, the input of the matching stage is a group of records that have been detected the same based on the Farsi-Soundex algorithm. To specify whether these records are the same or not, a random forest algorithm has been applied to them. Finally, in the grouping stage, all the records that have been identified as the same using random forest are placed in one group by a hash-based algorithm.
Finding: The internal features of Email address, last name, and first name are the most significant features to optimize name-writing confusion. Also, the obtained results show the external features of the main subject and sub-subject provide the least effective features for solving the author name disambiguation problem in the academic database. In addition, using a random forest as a classifier in the matching phase, with an accuracy of over 99%, can solve the problem of confusion in writing the authors' names.
Conclusion: Results show the high efficiency of our framework in uniformity of names according to the criteria of accuracy, recall, and F value compared to the support vector machine, the nearest neighbor, and genetics. Our proposed method can be applied to scientific databases to standardize the names of the authors. In the future, we are investigating the efficiency of our proposed framework in a non-stationary environment in which the distribution of data may be changed over time. 

Keywords


خسروی، عبدالرسول (1383). ضرورت مستندسازی موضوع‌ها و نام‌های فارسی در محیط اینترنت. پیام بهارستان. 41، 8-11.
 
خسروی، مریم (1390). آشفتگی نگارش نام پدیدآورندگان ایرانی در پایگاه اطلاعاتی آی.اس.آی. فصلنامه علمی پژوهشی پژوهشگاه علوم و فناوری اطلاعات ایران، 4، 45-65.
 
دهقان، شیرین؛ محمودی، زلیخا؛ قاسم‌پور، محمد (1392). مدارک نمایه‌شدۀ محققین دانشگاه علوم پزشکی شیراز با آدرس وابستگی سازمانی غیراستاندارد در Web of Science و Scopus. مدیریت اطلاعات سلامت. 10 (6): 810-818.
 
زلفی‌گل، محمدعلی؛ شیری، مرتضی و کیانی بختیاری، ابوالفضل (1386). اهمیت رعایت اصول نمایه‌سازی در مستندات علمی. رهیافت، 39، 37-46.
 
صادقی گورجی، شهربانو؛ پوراحمد، علی‌اکبر؛ حاجی زین‌العابدینی، محسن و ضیایی، ثریا (1394). ارزیابی کارآمدی گوگل پژوهشگر در بازیابی اطلاعات نویسندگان دارای شکل‌های گوناگون نام: بررسی ضریب بازیافت و دقت. پژوهشنامه کتابداری و اطلاع‌رسانی، 5 (1)، 205-216.
 
عبدی, ساجده؛ نوروزی چاکلی عبدالرضا؛ اسدی سعید (1400). ارزیابی تطبیقی تأثیر کنترل مستندات بر جایگاه بهره‌وری علمی پژوهشگران در پایگاه‌های گوگل‌اسکالر و ریسرچ‌گیت .پژوهش‌نامه علم‌سنجی .7(13), 203-216
کیانی، حمیدرضا؛ داورپناه، محمدرضا؛ فتاحی، رحمت‌الله (1394). بررسی تأثیر خطاهای نظام‌مند موجود در طبقه‌بندی موضوعی آی‌اس‌آی بر حجم تولیدات علمی و میزان رؤیت‌پذیری رشته‌ها. پژوهش‌نامه کتابداری و اطلاع‌رسانی. 5 (2): 284-263.
 
مرتضوی، سید محمد؛ ندیمی شهرکی، محمدحسین؛ موسی خانی، مصطفی (1396). بهبود صحت ابهام‌زدایی نام نویسنده با استفاده از خوشه‌بندی تجمّعی. پردازش علائم و داده‌ها، ۱۴ (۴)،۱۱۷-۱۲۸.
 
مزروعی سبدانی، نصیرالدین؛ ابراهیم‌پور کومله، حسین و نیک‌فرجام، علی‌محمد ( 1392). ارائه روش بانظارت به‌منظور دسته‌بندی مقالات با وجود ابهام در داده‌ها. دوازدهمین کنفرانس سیستم‌های هوشمند ایران، مجتمع آموزش عالی بم.
 
مظفری نیلوفر (1400). ارائه روشی مبتنی بر ژنتیک برای رفع ابهام نام نویسندگان مقالات. پژوهشنامه پردازش و مدیریت اطلاعات. ۳۶ (۳): ۸۱۶-۷۹۱.
 
Abdi, S., & Chakoli, A. N., Asadi, S. (2021). The comparative evaluation of authority control impact on the Iran researchers scientific productivity situation in the Google Scholar and ResearchGate. Scientometrics Research Journal. DOI: 4.4773.2019.rsci/22070.10 [In Persian]
 
Bhattacharya, I., & Getoor, L. (2006, April). A latent dirichlet model for unsupervised entity resolution. In Proceedings of the 2006 SIAM International Conference on Data Mining (pp. 47-58). Society for Industrial and Applied Mathematics. DOI: 10.1137/1.9781611972764.5
 
 
          Breiman, L., Friedman, J., Olsen, R., & Stone, C. (2010). Classification and Regression Trees (Wadsworth and Brooks/Cole, Monterey, CA, 1984). 
 
Cota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., & Laender, A. H. (2010). An unsupervised heuristic‐based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9), 1853-1870. DOI: 10.1002/asi.21363
 
Dehghan, Sh., Mahmoodi, Z., Ghasempour, M. (2013). Indexed documents of researchers of Shiraz University of Medical Sciences with non-standard affiliation in Web of Science and Scopus, Health Information Management, 10(6):810-818. [In Persian]
 
Fan, X., Wang, J., Pu, X., Zhou, L., & Lv, B. (2011). On graph-based name disambiguation. Journal of Data and Information Quality (JDIQ)2(2), 1-23. DOI: 10.1145/1891879.1891883
 
Ferreira, A. A., Veloso, A., Gonçalves, M. A., & Laender, A. H. (2010, June). Effective self-training author name disambiguation in scholarly digital libraries. In Proceedings of the 10th annual joint conference on Digital libraries (pp. 39-48). DOI: 10.1145/1816123.1816130
 
Jhawar, K., Sanyal, D. K., Chattopadhyay, S., Bhowmick, P. K., & Das, P. P. (2020, August). Author Name Disambiguation in PubMed using Ensemble-Based Classification Algorithms. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (pp. 469-470). DOI: 10.1145/3383583.3398568
 
Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004, June). Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004. (pp. 296-305). IEEE. DOI: 10.1145/996350.996419
 
Huynh, T., Hoang, K., Do, T., & Huynh, D. (2013, March). Vietnamese author name disambiguation for integrating publications from heterogeneous sources. In Asian Conference on Intelligent Information and Database Systems (pp. 226-235). Springer, Berlin, Heidelberg. DOI: 10.1007/978-3-642-36546-1_24
 
Kang, I. S., Na, S. H., Lee, S., Jung, H., Kim, P., Sung, W. K., & Lee, J. H. (2009). On co-authorship for author disambiguation. Information Processing & Management45(1), 84-97. DOI: 10.1016/j.ipm.2008.06.006
 
Kawashima, H., & Tomizawa, H. (2015). Accuracy evaluation of Scopus Author ID based on the largest funding database in Japan. Scientometrics103(3), 1061-1071. DOI: 10.1007/s11192-015-1580-z
 
Khosravi, A. (2004), The necessity of documenting Persian topics and names in the Internet environment, Payam Baharestan, 41:8-11.[In Persian]
 
Khosravi, M. (2011). The confusion of Iranian Author Names in ISI database. Scientific Research of Iran Research Institute of Science and Information Technology, 4:46-65. [In Persian]
 
Kiani, H., Davarpanah, M., Fattahi, R. (2015). Investigating the impact of systematic errors in the subject classification of ISI on the volume of scientific productions and the degree of visibility of fields. Library and Information Science Research, 5(2): 263-284. [In Persian]
 
Kim, K., Khabsa, M., & Giles, C. L. (2016). Random forest dbscan for uspto inventor name disambiguation. arXiv preprint arXiv:1602.01792. DOI: 10.48550/arXiv.1602.01792
 
Kim, J., & Kim, J. (2020). Effect of forename string on author name disambiguation. Journal of the Association for Information Science and Technology, 71(7), 839-855. DOI: 10.1002/asi.24298
 
Lait, A. J., & Randell, B. (1996). An assessment of name matching algorithms. Technical Report Series-University of Newcastle Upon Tyne Computing Science.
 
Mazroyi Sabadani, N., Ebrahimpour Komleh, H., Nikfarjam, A. (2013). A supervised approach for classification of papers with data ambiguation, 12th Iranian Conference on Intelligent Systems. Bam. [In Persian]
 
Mortazavi, S. M., Nadimi Shahraki, M. H., Mosakhani, M. (2017). Improving the accuracy of the author name disambiguation by using clustering ensemble. JSDP. 2018; 14 (4) :117-128. DOI: 10.29252/jsdp.14.4.117 [In Persian]
 
Mozafari, N. (2021). A Genetic-based Approach for Author Name Disambiguation Problem. Iranian Journal of Information Processing Management36(3), 791-816. DOI: 10.52547/jipm.36.3.791. [In Persian]
 
Myles, A. J., Feudale, R. N., Liu, Y., Woody, N. A., & Brown, S. D. (2004). An introduction to decision tree modeling. Journal of Chemometrics: A Journal of the Chemometrics Society, 18(6), 275-285. DOI: 10.1002/cem.873
 
Noori, A. (2011, July). On the relation between centrality measures and consensus algorithms. In 2011 International Conference on High Performance Computing & Simulation (pp. 225-232). IEEE. DOI: 10.1109/HPCSim.2011.5999828
 
On, B. W., Elmacioglu, E., Lee, D., Kang, J., & Pei, J. (2006, December). Improving grouped-entity resolution using quasi-cliques. In Sixth International Conference on Data Mining (ICDM'06) (pp. 1008-1015). IEEE. DOI: 10.1109/ICDM.2006.85
 
Pal, A., R., A. Munshi, and D. Saha. (2013). An approach to speed-up the word sense disambiguation procedure through sense filtering. International journal of Instrumentation and Control systems (IJICS). 3(4), 29-41. DOI: 10.5121/ijics.2013.3403
 
Sadeghi Gouraji, Sh., Pourahman, A., Hajizeinolabedini, M., Zeiaei, S. (2015), Evaluation of the Effectiveness of Google Scholar in Authors' Information Retrieval Library and Information Science Research. 5(1): 205-2016. DOI: 10.22067/RIIS.V5I1.24674 [In Persian]
 
Shin, D., Kim, T., Jung, H., & Choi, J. (2010, April). Automatic method for author name disambiguation using social networks. In 2010 24th IEEE International Conference on Advanced Information Networking and Applications (pp. 1263-1270). IEEE. DOI: 10.1109/AINA.2010.66
 
Silva, J. M., & Silva, F. (2017, April). Feature extraction for the author name disambiguation problem in a bibliographic database. In Proceedings of the Symposium on Applied Computing (pp. 783-789). DOI: 10.1145/3019612.3019663
 
Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data (TKDD)3(3), 1-29. DOI: 10.1145/1552303.1552304
 
Treeratpituk, P., & Giles, C. L. (2009, June). Disambiguating authors in academic publications using random forests. In Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries (pp. 39-48). DOI: 10.1145/1555400.1555408
 
Verikas, A., Gelzinis, A., & Bacauskiene, M. (2011). Mining data with random forests: A survey and results of new tests. Pattern recognition, 44(2), 330-349. DOI: 10.1016/j.patcog.2010.08.011
 
Wang, G., Hao, J., Ma, J., & Jiang, H. (2011). A comparative assessment of ensemble learning for credit scoring. Expert systems with applications, 38(1), 223-230. DOI: 10.1016/j.eswa.2010.06.048
 
Zhang, B., & Al Hasan, M. (2017, November). Name disambiguation in anonymized graphs using network embedding. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (pp. 1239-1248). DOI: 10.1145/3132847.3132873
 
Zolfigol, M.A., Shiri, M., Kiani Bakhtiari, A. (2007). The importance of observing the principles of indexing in scientific documents, Rahyaft, 39:37-46. [In Persian]