بهینه‌سازی آشفتگی اسامی نویسندگان مقالات فارسی با استفاده از روش جنگل تصادفی

مظفری, نیلوفر; ورع, نرجس

doi:10.22070/rsci.2021.13393.1449

بهینه‌سازی آشفتگی اسامی نویسندگان مقالات فارسی با استفاده از روش جنگل تصادفی

نوع مقاله : مقاله پژوهشی

نویسندگان

¹ استادیار، گروه پژوهشی طراحی و عملیات سیستم‌ها، مرکز منطقه‌ای اطلاع‌رسانی علوم و فناوری، شیراز، ایران .

² عضو هیئت علمی گروه پژوهشی ارزیابی و توسعه منابع، مرکز منطقه‌ای اطلاع‌رسانی علوم و فناوری و دانشجوی دکتری علم اطلاعات و دانش شناسی. شیراز، ایران

10.22070/rsci.2021.13393.1449

چکیده

هدف: ارائه چارچوبی جهت حل مشکل آشفتگی و پراکندگی اسامی نویسندگان در مقالات فارسی که منجر به گسیختگی و فقدان جامعیت در بازیابی اطلاعات شده است.
روش‌شناسی: پژوهش حاضر از نوع کاربردی علم‌سنجی است که به روش اسنادی انجام شده است. جامعه آماری را از 913 رکورد از نام نویسندگان مقالات فارسی برگرفته از پایگاه استنادی علوم جهان اسلام، طی بازه زمانی 1395 تا 1397 تشکیل می‌دهد. چارچوب پیشنهادی از سه مرحله جستجو، تطابق و گروه‌بندی تشکیل شده است. در این راستا، بعد از پیش‌پردازش اولیه و استخراج ویژگی، عملیات جستجو با هدف یافتن رکوردهایی که بالقوه احتمال یکسان‌بودن آنها وجود دارد انجام شده و سپس رکوردهای یکسان از طریق بررسی‌های بیشتر در مرحله تطابق که مبتنی بر جنگل تصادفی است یافت می‌شود.
یافته‌ها: ویژگی‌های پست الکترونیک، نام خانوادگی و نام از مهم‌ترین ویژگی‌ها برای بهینه‌سازی آشفتگی نگارش اسامی هستند. استفاده از جنگل تصادفی به‌عنوان طبقه‌بند در مرحله تطابق، با دقت بالای 99 درصد می‌تواند مشکل آشفتگی نگارش اسامی نویسندگان را برطرف نماید.
نتیجه‌گیری: نتایج نشان از کارایی بالای این روش در یکدست‌سازی اسامی با توجه به معیارهای دقت، بازیافت و مقدار اف نسبت به طبقه‌بندهای بردار پشتیبان، نزدیک‌ترین همسایه و ژنتیک دارد.

کلیدواژه‌ها

20.1001.1.24233773.1401.8.16.9.7

عنوان مقاله [English]

Optimizing Confusion of Authors’ Names in Persian Articles Using Random Forest Algorithm

نویسندگان [English]

Niloofar Mozafari ¹
Narjes Vara ²

¹ Assistant Professor, Design and System Operations Department, Regional Information Center for Science and Technology, Shiraz. Iran,

² Faculty of Evaluation and Resource Development Department, Regional Information Center for Science and Technology & PhD Student in Library and Information Science, shiraz, Iran.

چکیده [English]

Purpose:
Name is a key factor for distinguishing authors. In the academic databases that store information on papers, searching for the name of the article author is one of the most important elements in increasing visibility and the quantitative studies in the field of Scientology including the amount of citing works. The diversity of writings is one of the issues that lead to challenges in various scientific fields. In addition, the lack of writing standards in the Persian language and the lack of keyboards and standard codes, the habit of simply writing are among the factors that lead to the author's name disambiguation. Also, the spelling mistakes that occur by the writers in writing the name lead to the creation of different forms of writing for a single name. Considering the importance of solving the confusion of authors’ names in Persian articles, this paper aims to propose a framework to solve the problem of confusion and dispersion of authors' names in Persian articles, which has led to a rupture and lack of comprehensiveness in information retrieval.
Methodology: The present research is an applied scientometrics method carried out by documentary procedure, and the required data is collected from the ISC database. The initial statistical population is 913 records during the period 2015 to 2017. The proposed framework consists of three stages: searching, matching, and grouping. In this regard, after initial pre-processing and feature extraction, the search operation is performed to find records that are potentially likely to be identical. Our method extracts two types of features including internal and external. The internal feature has been extracted from the author’s information like first name, last name, affiliation, email, and co-authors. In addition, the external feature uses the scientific history of authors like articles and research interests. Next, in the search phase, the records that are potentially the same are identified. We propose a new method called Farsi-Soundex, which has been inspired by the well-known Soundex to categorize potential unique names. The same records are then found through further investigation in the adaptation phase, which is based on random forests. Therefore, the input of the matching stage is a group of records that have been detected the same based on the Farsi-Soundex algorithm. To specify whether these records are the same or not, a random forest algorithm has been applied to them. Finally, in the grouping stage, all the records that have been identified as the same using random forest are placed in one group by a hash-based algorithm.
Finding: The internal features of Email address, last name, and first name are the most significant features to optimize name-writing confusion. Also, the obtained results show the external features of the main subject and sub-subject provide the least effective features for solving the author name disambiguation problem in the academic database. In addition, using a random forest as a classifier in the matching phase, with an accuracy of over 99%, can solve the problem of confusion in writing the authors' names.
Conclusion: Results show the high efficiency of our framework in uniformity of names according to the criteria of accuracy, recall, and F value compared to the support vector machine, the nearest neighbor, and genetics. Our proposed method can be applied to scientific databases to standardize the names of the authors. In the future, we are investigating the efficiency of our proposed framework in a non-stationary environment in which the distribution of data may be changed over time.

کلیدواژه‌ها [English]

Name ambiguity
Article authors Persian articles
Random forest algorithm
Name Authority
Farsi-Soundex algorithm

مراجع

خسروی، عبدالرسول (1383). ضرورت مستندسازی موضوع‌ها و نام‌های فارسی در محیط اینترنت. پیام بهارستان. 41، 8-11.

خسروی، مریم (1390). آشفتگی نگارش نام پدیدآورندگان ایرانی در پایگاه اطلاعاتی آی.اس.آی. فصلنامه علمی پژوهشی پژوهشگاه علوم و فناوری اطلاعات ایران، 4، 45-65.

دهقان، شیرین؛ محمودی، زلیخا؛ قاسم‌پور، محمد (1392). مدارک نمایه‌شدۀ محققین دانشگاه علوم پزشکی شیراز با آدرس وابستگی سازمانی غیراستاندارد در Web of Science و Scopus. مدیریت اطلاعات سلامت. 10 (6): 810-818.

زلفی‌گل، محمدعلی؛ شیری، مرتضی و کیانی بختیاری، ابوالفضل (1386). اهمیت رعایت اصول نمایه‌سازی در مستندات علمی. رهیافت، 39، 37-46.

صادقی گورجی، شهربانو؛ پوراحمد، علی‌اکبر؛ حاجی زین‌العابدینی، محسن و ضیایی، ثریا (1394). ارزیابی کارآمدی گوگل پژوهشگر در بازیابی اطلاعات نویسندگان دارای شکل‌های گوناگون نام: بررسی ضریب بازیافت و دقت. پژوهشنامه کتابداری و اطلاع‌رسانی، 5 (1)، 205-216.

عبدی, ساجده؛ نوروزی چاکلی عبدالرضا؛ اسدی سعید (1400). ارزیابی تطبیقی تأثیر کنترل مستندات بر جایگاه بهره‌وری علمی پژوهشگران در پایگاه‌های گوگل‌اسکالر و ریسرچ‌گیت .پژوهش‌نامه علم‌سنجی .7(13), 203-216

کیانی، حمیدرضا؛ داورپناه، محمدرضا؛ فتاحی، رحمت‌الله (1394). بررسی تأثیر خطاهای نظام‌مند موجود در طبقه‌بندی موضوعی آی‌اس‌آی بر حجم تولیدات علمی و میزان رؤیت‌پذیری رشته‌ها. پژوهش‌نامه کتابداری و اطلاع‌رسانی. 5 (2): 284-263.

مرتضوی، سید محمد؛ ندیمی شهرکی، محمدحسین؛ موسی خانی، مصطفی (1396). بهبود صحت ابهام‌زدایی نام نویسنده با استفاده از خوشه‌بندی تجمّعی. پردازش علائم و داده‌ها، ۱۴ (۴)،۱۱۷-۱۲۸.

مزروعی سبدانی، نصیرالدین؛ ابراهیم‌پور کومله، حسین و نیک‌فرجام، علی‌محمد ( 1392). ارائه روش بانظارت به‌منظور دسته‌بندی مقالات با وجود ابهام در داده‌ها. دوازدهمین کنفرانس سیستم‌های هوشمند ایران، مجتمع آموزش عالی بم.

مظفری نیلوفر (1400). ارائه روشی مبتنی بر ژنتیک برای رفع ابهام نام نویسندگان مقالات. پژوهشنامه پردازش و مدیریت اطلاعات. ۳۶ (۳): ۸۱۶-۷۹۱.

Abdi, S., & Chakoli, A. N., Asadi, S. (2021). The comparative evaluation of authority control impact on the Iran researchers scientific productivity situation in the Google Scholar and ResearchGate. Scientometrics Research Journal. DOI: 4.4773.2019.rsci/22070.10 [In Persian]

Bhattacharya, I., & Getoor, L. (2006, April). A latent dirichlet model for unsupervised entity resolution. In Proceedings of the 2006 SIAM International Conference on Data Mining (pp. 47-58). Society for Industrial and Applied Mathematics. DOI: 10.1137/1.9781611972764.5

Breiman, L., Friedman, J., Olsen, R., & Stone, C. (2010). Classification and Regression Trees (Wadsworth and Brooks/Cole, Monterey, CA, 1984).

Cota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., & Laender, A. H. (2010). An unsupervised heuristic‐based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9), 1853-1870. DOI: 10.1002/asi.21363

Dehghan, Sh., Mahmoodi, Z., Ghasempour, M. (2013). Indexed documents of researchers of Shiraz University of Medical Sciences with non-standard affiliation in Web of Science and Scopus, Health Information Management, 10(6):810-818. [In Persian]

Fan, X., Wang, J., Pu, X., Zhou, L., & Lv, B. (2011). On graph-based name disambiguation. Journal of Data and Information Quality (JDIQ), 2(2), 1-23. DOI: 10.1145/1891879.1891883

Ferreira, A. A., Veloso, A., Gonçalves, M. A., & Laender, A. H. (2010, June). Effective self-training author name disambiguation in scholarly digital libraries. In Proceedings of the 10th annual joint conference on Digital libraries (pp. 39-48). DOI: 10.1145/1816123.1816130

Jhawar, K., Sanyal, D. K., Chattopadhyay, S., Bhowmick, P. K., & Das, P. P. (2020, August). Author Name Disambiguation in PubMed using Ensemble-Based Classification Algorithms. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (pp. 469-470). DOI: 10.1145/3383583.3398568

Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004, June). Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004. (pp. 296-305). IEEE. DOI: 10.1145/996350.996419

Huynh, T., Hoang, K., Do, T., & Huynh, D. (2013, March). Vietnamese author name disambiguation for integrating publications from heterogeneous sources. In Asian Conference on Intelligent Information and Database Systems (pp. 226-235). Springer, Berlin, Heidelberg. DOI: 10.1007/978-3-642-36546-1_24

Kang, I. S., Na, S. H., Lee, S., Jung, H., Kim, P., Sung, W. K., & Lee, J. H. (2009). On co-authorship for author disambiguation. Information Processing & Management, 45(1), 84-97. DOI: 10.1016/j.ipm.2008.06.006

Kawashima, H., & Tomizawa, H. (2015). Accuracy evaluation of Scopus Author ID based on the largest funding database in Japan. Scientometrics, 103(3), 1061-1071. DOI: 10.1007/s11192-015-1580-z

Khosravi, A. (2004), The necessity of documenting Persian topics and names in the Internet environment, Payam Baharestan, 41:8-11.[In Persian]

Khosravi, M. (2011). The confusion of Iranian Author Names in ISI database. Scientific Research of Iran Research Institute of Science and Information Technology, 4:46-65. [In Persian]

Kiani, H., Davarpanah, M., Fattahi, R. (2015). Investigating the impact of systematic errors in the subject classification of ISI on the volume of scientific productions and the degree of visibility of fields. Library and Information Science Research, 5(2): 263-284. [In Persian]

Kim, K., Khabsa, M., & Giles, C. L. (2016). Random forest dbscan for uspto inventor name disambiguation. arXiv preprint arXiv:1602.01792. DOI: 10.48550/arXiv.1602.01792

Kim, J., & Kim, J. (2020). Effect of forename string on author name disambiguation. Journal of the Association for Information Science and Technology, 71(7), 839-855. DOI: 10.1002/asi.24298

Lait, A. J., & Randell, B. (1996). An assessment of name matching algorithms. Technical Report Series-University of Newcastle Upon Tyne Computing Science.

Mazroyi Sabadani, N., Ebrahimpour Komleh, H., Nikfarjam, A. (2013). A supervised approach for classification of papers with data ambiguation, 12^th Iranian Conference on Intelligent Systems. Bam. [In Persian]

Mortazavi, S. M., Nadimi Shahraki, M. H., Mosakhani, M. (2017). Improving the accuracy of the author name disambiguation by using clustering ensemble. JSDP. 2018; 14 (4) :117-128. DOI: 10.29252/jsdp.14.4.117 [In Persian]

Mozafari, N. (2021). A Genetic-based Approach for Author Name Disambiguation Problem. Iranian Journal of Information Processing Management, 36(3), 791-816. DOI: 10.52547/jipm.36.3.791. [In Persian]

Myles, A. J., Feudale, R. N., Liu, Y., Woody, N. A., & Brown, S. D. (2004). An introduction to decision tree modeling. Journal of Chemometrics: A Journal of the Chemometrics Society, 18(6), 275-285. DOI: 10.1002/cem.873

Noori, A. (2011, July). On the relation between centrality measures and consensus algorithms. In 2011 International Conference on High Performance Computing & Simulation (pp. 225-232). IEEE. DOI: 10.1109/HPCSim.2011.5999828

On, B. W., Elmacioglu, E., Lee, D., Kang, J., & Pei, J. (2006, December). Improving grouped-entity resolution using quasi-cliques. In Sixth International Conference on Data Mining (ICDM'06) (pp. 1008-1015). IEEE. DOI: 10.1109/ICDM.2006.85

Pal, A., R., A. Munshi, and D. Saha. (2013). An approach to speed-up the word sense disambiguation procedure through sense filtering. International journal of Instrumentation and Control systems (IJICS). 3(4), 29-41. DOI: 10.5121/ijics.2013.3403

Sadeghi Gouraji, Sh., Pourahman, A., Hajizeinolabedini, M., Zeiaei, S. (2015), Evaluation of the Effectiveness of Google Scholar in Authors' Information Retrieval Library and Information Science Research. 5(1): 205-2016. DOI: 10.22067/RIIS.V5I1.24674 [In Persian]

Shin, D., Kim, T., Jung, H., & Choi, J. (2010, April). Automatic method for author name disambiguation using social networks. In 2010 24th IEEE International Conference on Advanced Information Networking and Applications (pp. 1263-1270). IEEE. DOI: 10.1109/AINA.2010.66

Silva, J. M., & Silva, F. (2017, April). Feature extraction for the author name disambiguation problem in a bibliographic database. In Proceedings of the Symposium on Applied Computing (pp. 783-789). DOI: 10.1145/3019612.3019663

Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(3), 1-29. DOI: 10.1145/1552303.1552304

Treeratpituk, P., & Giles, C. L. (2009, June). Disambiguating authors in academic publications using random forests. In Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries (pp. 39-48). DOI: 10.1145/1555400.1555408

Verikas, A., Gelzinis, A., & Bacauskiene, M. (2011). Mining data with random forests: A survey and results of new tests. Pattern recognition, 44(2), 330-349. DOI: 10.1016/j.patcog.2010.08.011

Wang, G., Hao, J., Ma, J., & Jiang, H. (2011). A comparative assessment of ensemble learning for credit scoring. Expert systems with applications, 38(1), 223-230. DOI: 10.1016/j.eswa.2010.06.048

Zhang, B., & Al Hasan, M. (2017, November). Name disambiguation in anonymized graphs using network embedding. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (pp. 1239-1248). DOI: 10.1145/3132847.3132873

Zolfigol, M.A., Shiri, M., Kiani Bakhtiari, A. (2007). The importance of observing the principles of indexing in scientific documents, Rahyaft, 39:37-46. [In Persian]

دوره 8، (شماره 2، پاییز وزمستان) - شماره پیاپی 16
مهر 1401
صفحه 203-220

تعداد مشاهده مقاله: 1,232
تعداد دریافت فایل اصل مقاله: 922

بهینه‌سازی آشفتگی اسامی نویسندگان مقالات فارسی با استفاده از روش جنگل تصادفی

Optimizing Confusion of Authors’ Names in Persian Articles Using Random Forest Algorithm

مراجع

دوره 8، (شماره 2، پاییز وزمستان) - شماره پیاپی 16
مهر 1401
صفحه 203-220

فایل ها

هم رسانی

ارجاع به این مقاله

آمار

بهینه‌سازی آشفتگی اسامی نویسندگان مقالات فارسی با استفاده از روش جنگل تصادفی

Optimizing Confusion of Authors’ Names in Persian Articles Using Random Forest Algorithm

مراجع

دوره 8، (شماره 2، پاییز وزمستان) - شماره پیاپی 16مهر 1401صفحه 203-220

فایل ها

هم رسانی

ارجاع به این مقاله

آمار

دوره 8، (شماره 2، پاییز وزمستان) - شماره پیاپی 16
مهر 1401
صفحه 203-220