Skip to main content

A survey of author name disambiguation techniques: 2010–2016

  • Ijaz Hussain (a1) and Sohail Asghar (a1)

Digital libraries content and quality of services are badly affected by the author name ambiguity problem in the citations and it is considered as one of the hardest problems faced by the digital library researchers. Several techniques have been proposed in the literature for the author name ambiguity problem. In this paper, we reviewed some recently presented author name disambiguation techniques and give some challenges and future research directions. We analyze the recent advancements in this field and classify these techniques into supervised, unsupervised, semi-supervised, graph-based and heuristic-based techniques according to their problem formulation that is mainly used for the author name disambiguation. A few surveys have been conducted to review different techniques for the author name disambiguation. These surveys highlighted only the methodology adopted for author name disambiguation but did not critically review their shortcomings. This survey provides a detailed review of author name disambiguation techniques available in the literature, makes a comparison of these techniques at an abstract level and discusses their limitations.

Hide All
Amancio, D. R., Oliveira, O. N. Jr & Costa, L. D. F. 2015. Topological-collaborative approach for disambiguating authors names in collaborative networks. Scientometrics 102(1), 465485.
Arunachalam, S. & Madhan, M. 2016. Adopting orcid as a unique identifier will benefit all involved in scholarly communication. The National Medical Journal of India 29(4), 227234.
Aswani, N., Bontcheva, K. & Cunningham, H. 2006. Mining information for instance unification. In International Semantic Web Conference, 329–342. Springer.
Bekkerman, R. & McCallum, A. 2005. Disambiguating web appearances of people in a social network. In Proceedings of the 14th International Conference on World Wide Web, 463–470. ACM.
Bhattacharya, I. & Getoor, L. 2007. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD) 1(1), 5.
Carrasco, R. C., Serrano, A. & Castillo-Buergo, R. 2016. A parser for authority control of author names in bibliographic records. Information Processing & Management 52(5), 753764.
Chin, W.-S., Zhuang, Y., Juan, Y.-C., Wu, F., Tung, H.-Y., Yu, T., Wang, J.-P., Chang, C.-X, Yang, C.-P., Chang, W.-C. Huang, K.-H., Kuo, T.-M., Lin, S.-W., Lin, Y.-S., Lu, Y.-C., Su, Y.-C., Wei, C.-K., Yin, T.-C., Li, C.-L., Lin, T.-W., Tsai, C.-H., Lin, S.-D., Lin, H.-T. & Lin, C.-J. 2014. Effective string processing and matching for author disambiguation. The Journal of Machine Learning Research 15(1), 30373064.
Chisholm, A. & Hachey, B 2015. Entity disambiguation with web links. Transactions of the Association for Computational Linguistics 3, 145156.
Christen, P. 2006. A comparison of personal name matching: techniques and practical issues. In Sixth IEEE International Conference on Data Mining-Workshops (ICDMW’06), 290–294. IEEE.
De Carvalho, A. P., Ferreira, A. A., Laender, A. H. & Gonçalves, M. A. 2011. Incremental unsupervised name disambiguation in cleaned digital libraries. Journal of Information and Data Management 2(3), 289.
Elliott, S. 2010. Survey of author name disambiguation: 2004 to 2010. Library Philosophy and Practice 473,
Esperidião, L. V. B., Ferreira, A. A., Laender, A. H., Gonçalves, M. A., Gomes, D. M., Tavares, A. I. & de Assis, G. T. 2014. Reducing fragmentation in incremental author name disambiguation. Journal of Information and Data Management 5(3), 293.
Fan, X., Wang, J., Pu, X., Zhou, L. & Lv, B. 2011. On graph-based name disambiguation. Journal of Data and Information Quality (JDIQ) 2(2), 10.
Ferreira, A. A., Gonçalves, M. A. & Laender, A. H. 2012. A brief survey of automatic methods for author name disambiguation. Acm Sigmod Record 41(2), 1526.
Ferreira, A. A., Gonçalves, M. A. & Laender, A. H. 2015. Automatic methods for disambiguating author names in bibliographic data repositories. In Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries, 297–298. ACM.
Ferreira, A. A., Veloso, A., Gonçalves, M. A. & Laender, A. H. 2010. Effective self-training author name disambiguation in scholarly digital libraries. In Proceedings of the 10th Annual Joint Conference on Digital Libraries, 39–48. ACM.
Ferreira, A. A., Veloso, A., Gonçalves, M. A. & Laender, A. H. 2014. Self-training author name disambiguation for information scarce scenarios. Journal of the Association for Information Science and Technology 65(6), 12571278.
Giunchiglia, F. & Shvaiko, P. 2003. Semantic matching. The Knowledge Engineering Review 18(3), 265280.
Gurney, T., Horlings, E. & Van Den Besselaar, P. 2012. Author disambiguation using multi-aspect similarity indicators. Scientometrics 91(2), 435449.
Han, D., Liu, S., Hu, Y., Wang, B. & Sun, Y. 2015. Elm-based name disambiguation in bibliography. World Wide Web 18(2), 253263.
Han, H., Giles, L., Zha, H., Li, C. & Tsioutsiouliklis, K. 2004. Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 2004 joint ACM/IEEE conference on Digital Libraries, 2004, 296–305. IEEE.
Han, H., Xu, W., Zha, H. & Giles, C. L. 2005. A hierarchical naive bayes mixture model for name disambiguation in author citations. In Proceedings of the 2005 ACM Symposium on Applied Computing, 1065–1069. ACM.
Huynh, T., Hoang, K., Do, T. & Huynh, D. 2013. Vietnamese author name disambiguation for integrating publications from heterogeneous sources. In Asian Conference on Intelligent Information and Database Systems, 226–235. Springer.
Imran, M., Gillani, S. & Marchese, M. 2013. A real-time heuristic-based unsupervised method for name disambiguation in digital libraries. D-Lib Magazine 19(9), 1.
Johnson, D. B. 1975. Finding all the elementary circuits of a directed graph. SIAM Journal on Computing 4(1), 7784.
Kofod-Petersen, A. 2012. How to do a structured literature review in computer science. Document released as a guide to performing a Structured Literature Review at NTNU.
Krzywicki, A., Wobcke, W., Bain, M., Martinez, J. C. & Compton, P. 2016. Data mining for building knowledge bases: techniques, architectures and applications. Knowledge Engineering Review 31(2), 97123.
Kum, H.-C., Krishnamurthy, A., Machanavajjhala, A., Reiter, M. K. & Ahalt, S. 2014. Privacy preserving interactive record linkage (ppirl). Journal of the American Medical Informatics Association 21(2), 212220.
LaFlamme, M. 2016. On the problem of the namesake. Cultural Anthropology 31(1), 13.
Lee, D., Kang, J., Mitra, P., Giles, C. L. & On, B.-W. 2007. Are your citations clean? Communications of the ACM 50(12), 3338.
Levin, F. H. & Heuser, C. A. 2010. Evaluating the use of social networks in author name disambiguation in digital libraries. Journal of Information and Data Management 1(2), 183.
Levin, M., Krawczyk, S., Bethard, S. & Jurafsky, D. 2012. Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology 63(5), 10301047.
Liu, Y., Li, W., Huang, Z. & Fang, Q. 2015. A fast method based on multiple clustering for name disambiguation in bibliographic citations. Journal of the Association for Information Science and Technology 66(3), 634644.
Liu, Y. & Tang, Y. 2015. Network based framework for author name disambiguation applications. International Journal of u-and e-Service, Science and Technology 8(9), 7582.
Maguire, E. J. 2016. Ethnicity sensitive author disambiguation using semi-supervised learning. In Proceedings of the Knowledge Engineering and Semantic Web: 7th International Conference, KESW 2016 649, 272. Springer, 21–23 September 2016.
Moher, D., Liberati, A., Tetzlaff, J. & Altman, D. G. 2009. Preferred reporting items for systematic reviews and meta-analyses: the prisma statement. Annals of Internal Medicine 151(4), 264269.
Murnane, E. L., Haslhofer, B. & Lagoze, C. 2013. Reslve: leveraging user interest to improve entity disambiguation on short text. In Proceedings of the 22nd International Conference on World Wide Web, 1275–1284. ACM.
Nicholson, S. W. & Bennett, T. B. 2016. Dissemination and discovery of diverse data: do libraries promote their unique research data collections? International Information & Library Review 48(2), 8593.
On, B.-W., Elmacioglu, E., Lee, D., Kang, J. & Pei, J. 2006. Improving grouped-entity resolution using quasi-cliques. In Sixth International Conference on Data Mining (ICDM’06), 1008–1015. IEEE.
On, B.-W., Lee, D., Kang, J. & Mitra, P. 2005. Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, 344–353. ACM.
On, B.-W., Lee, I. & Lee, D. 2012. Scalable clustering methods for the name disambiguation problem. Knowledge and Information Systems 31(1), 129151.
Onodera, N., Iwasawa, M., Midorikawa, N., Yoshikane, F., Amano, K., Ootani, Y., Kodama, T., Kiyama, Y., Tsunoda, H. & Yamazaki, S. 2011. A method for eliminating articles by homonymous authors from the large number of articles retrieved by author search. Journal of the American Society for Information Science and Technology 62(4), 677690.
Oramas, S., Espinosa-Anke, L., Sordo, M., Saggion, H. & Serra, X. 2016. Elmd: an automatically generated entity linking gold standard dataset in the music domain. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC.
Palfrey, J. 2016. Design choices for libraries in the digital-plus era. Daedalus 145(1), 7986.
Peng, H.-T., Lu, C.-Y., Hsu, W. & Ho, J.-M. 2012. Disambiguating authors in citations on the web and authorship correlations. Expert Systems with Applications 39(12), 1052110532.
Pereira, D. A., Ribeiro-Neto, B., Ziviani, N., Laender, A. H., Gonçalves, M. A. & Ferreira, A. A. 2009. Using web information for author name disambiguation. In Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, 49–58. ACM.
Provost, F. & Kohavi, R. 1998. Guest editors’ introduction: on applied research in machine learning. Machine Learning 30(2), 127132.
Pyle, R. L. 2016. Towards a global names architecture: the future of indexing scientific names. ZooKeys 550, 261281.
Santana, A. F., Gonçalves, M. A., Laender, A. H. & Ferreira, A. A. 2015. On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method. International Journal on Digital Libraries 16(3–4), 229246.
Scholtes, J. C. & Maes, F. P. E. et al. 2016. System and method for authorship disambiguation and alias resolution in electronic data. US Patent 9,264,387.
Schulz, C., Mazloumian, A., Petersen, A. M., Penner, O. & Helbing, D. 2014. Exploiting citation networks for large-scale author name disambiguation. EPJ Data Science 3(1), 1.
Seol, J.-W., Lee, S.-H. & Kim, K.-Y. 2016. Author disambiguation using co-author network and supervised learning approach in scholarly data. International Journal of Software Engineering and Its Applications 10(4), 7382.
Shin, D., Kim, T., Choi, J. & Kim, J. 2014. Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics 100(1), 1550.
Song, Y., Huang, J., Councill, I. G., Li, J. & Giles, C. L. 2007. Efficient topic-based unsupervised name disambiguation. In Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, 342–351. ACM.
Tang, J., Fong, A. C., Wang, B. & Zhang, J. 2012. A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering 24(6), 975987.
Tang, L. & Walsh, J. P. 2010. Bibliometric fingerprints: name disambiguation based on approximate structure equivalence of cognitive maps. Scientometrics 84(3), 763784.
Torvik, V. I. & Smalheiser, N. R. 2009. Author name disambiguation in medline. ACM Transactions on Knowledge Discovery from Data (TKDD) 3(3), 11.
Tran, H. N., Huynh, T. & Do, T. 2014. Author name disambiguation by using deep neural network. In Asian Conference on Intelligent Information and Database Systems, 123132. Springer
Wang, J., Berzins, K., Hicks, D., Melkers, J., Xiao, F. & Pinheiro, D. 2012. A boosted-trees method for name disambiguation. Scientometrics 93(2), 391411.
Wang, P., Zhao, J., Huang, K. & Xu, B. 2014. A unified semi-supervised framework for author disambiguation in academic social network. In International Conference on Database and Expert Systems Applications, 1–16. Springer.
Wang, X., Tang, J., Cheng, H. & Philip, S. Y. 2011. Adana: active name disambiguation. In 2011 IEEE 11th International Conference on Data Mining, 794–803. IEEE.
Weiss, A. 2016. Examining massive digital libraries (mdls) and their impact on reference services. The Reference Librarian 57(4), 286306.
Wu, H., Li, B., Pei, Y. & He, J. 2014. Unsupervised author disambiguation using Dempster-Shafer theory. Scientometrics 101(3), 19551972.
Zhao, J., Wang, P. & Huang, K. 2013. A semi-supervised approach for author disambiguation in KDD CUP 2013. In Proceedings of the 2013 KDD CUP 2013 Workshop, 10. ACM.
Zhu, J., Yang, Y., Xie, Q., Wang, L. & Hassan, S.-U. 2014. Robust hybrid name disambiguation framework for large databases. Scientometrics 98(3), 22552274.
Zhu, L., Ghasemi-Gol, M., Szekely, P., Galstyan, A. & Knoblock, C. A. 2016. Unsupervised entity resolution on multi-type graphs. In International Semantic Web Conference, 649–667. Springer.
Zhu, Y. & Li, Q. 2013. Enhancing object distinction utilizing probabilistic topic model. In 2013 International Conference on Cloud Computing and Big Data (CloudCom-Asia), 177–182. IEEE.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

The Knowledge Engineering Review
  • ISSN: 0269-8889
  • EISSN: 1469-8005
  • URL: /core/journals/knowledge-engineering-review
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed