In-depth analysis of the impact of OCR errors on named entity recognition and linking

Ahmed Hamdi; Elvys Linhares Pontes; Nicolas Sidere; Mickaël Coustaty; Antoine Doucet

doi:10.1017/S1351324922000110

In-depth analysis of the impact of OCR errors on named entity recognition and linking

Published online by Cambridge University Press: 18 March 2022

Ahmed Hamdi

Elvys Linhares Pontes

and

Ahmed Hamdi*: Affiliation:
University of La Rochelle, Laboratoire L3i, Faculté des Sciences et Technologies, Bâtiment Pascal, Avenue Michel Crépeau, 17042 La Rochelle, France
Elvys Linhares Pontes: Affiliation:
University of La Rochelle, Laboratoire L3i, Faculté des Sciences et Technologies, Bâtiment Pascal, Avenue Michel Crépeau, 17042 La Rochelle, France
Nicolas Sidere: Affiliation:
University of La Rochelle, Laboratoire L3i, Faculté des Sciences et Technologies, Bâtiment Pascal, Avenue Michel Crépeau, 17042 La Rochelle, France
Mickaël Coustaty: Affiliation:
University of La Rochelle, Laboratoire L3i, Faculté des Sciences et Technologies, Bâtiment Pascal, Avenue Michel Crépeau, 17042 La Rochelle, France
Antoine Doucet: Affiliation:
University of La Rochelle, Laboratoire L3i, Faculté des Sciences et Technologies, Bâtiment Pascal, Avenue Michel Crépeau, 17042 La Rochelle, France
*: *Corresponding author. E-mail: ahmed.hamdi@univ-lr.fr

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Named entities (NEs) are among the most relevant type of information that can be used to properly index digital documents and thus easily retrieve them. It has long been observed that NEs are key to accessing the contents of digital library portals as they are contained in most user queries. However, most digitized documents are indexed through their optical character recognition (OCRed) version which include numerous errors. Although OCR engines have considerably improved over the last few years, OCR errors still considerably impact document access. Previous works were conducted to evaluate the impact of OCR errors on named entity recognition (NER) and named entity linking (NEL) techniques separately. In this article, we experimented with a variety of OCRed documents with different levels and types of OCR noise to assess in depth the impact of OCR on named entity processing. We provide a deep analysis of OCR errors that impact the performance of NER and NEL. We then present the resulting exhaustive study and subsequent recommendations on the adequate documents, the OCR quality levels, and the post-OCR correction strategies required to perform reliable NER and NEL.

Keywords

Named entity recognition and linking Document indexing Information retrieval Optical character recognition Neural networks

Type: Article
Information: Natural Language Engineering , Volume 29 , Issue 2 , March 2023 , pp. 425 - 448

DOI: https://doi.org/10.1017/S1351324922000110 [Opens in a new window]
Copyright: © The Author(s), 2022. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S. and Vollgraf, R. (2019). FLAIR: an easy-to-use framework for state-of-the-art NLP. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, Minnesota. Association for Computational Linguistics, pp. 54–59.Google Scholar

Asahara, M. and Matsumoto, Y. (2003). Japanese named entity extraction with redundant morphological analysis. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, pp. 8–15.CrossRef Google Scholar

Bikel, D.M., Miller, S., Schwartz, R. and Weischedel, R. (1998). Nymble: a high-performance learning name-finder. arXiv preprint cmp-lg/9803003.Google Scholar

Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 135–146.CrossRef Google Scholar

Boros, E., Hamdi, A., Linhares Pontes, E., Cabrera-Diego, L.A., Moreno, J.G., Sidere, N. and Doucet, A. (2020a). Alleviating digitization errors in named entity recognition for historical documents. In Proceedings of the 24th Conference on Computational Natural Language Learning, Online. Association for Computational Linguistics, pp. 431–441.Google Scholar

Boros, E., Linhares Pontes, E., Cabrera-Diego, L.A., Hamdi, A., Moreno, J.G., Sidère, N. and Doucet, A. (2020b). Robust named entity recognition and linking on historical multilingual documents. In Conference and Labs of the Evaluation Forum (CLEF 2020). Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, vol. 2696, Thessaloniki, Greece. CEUR-WS Working Notes, pp. 1–17.Google Scholar

Borthwick, A., Sterling, J., Agichtein, E. and Grishman, R. (1998). Nyu: description of the mene named entity system as used in muc-7. In Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, April 29–May 1, 1998.Google Scholar

Brando, C., Frontini, F. and Ganascia, J.-G. (2016). REDEN: named entity linking in digital literary editions using linked data sets. Complex Systems Informatics and Modeling Quarterly 7, 60–80.CrossRef Google Scholar

Broscheit, S. (2019). Investigating entity knowledge in BERT with simple neural end-to-end entity linking. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Hong Kong, China. Association for Computational Linguistics, pp. 677–685.CrossRef Google Scholar

Cao, N.D., Wu, L., Popat, K., Artetxe, M., Goyal, N., Plekhanov, M., Zettlemoyer, L., Cancedda, N., Riedel, S. and Petroni, F. (2021). Multilingual autoregressive entity linking. CoRR. https://arxiv.org/abs/2103.12528.Google Scholar

Chen, H., Zukov-Gregoric, A., Li, X.D. and Wadhwa, S. (2019). Contextualized end-to-end neural entity linking. arXiv preprint arXiv:1911.03834.Google Scholar

Chiron, G., Doucet, A., Coustaty, M., Visani, M. and Moreux, J.-P. (2017). Impact of ocr errors on the use of digital libraries: towards a better access to information. In Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries. IEEE Press, pp. 249–252.CrossRef Google Scholar

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 2493–2537.Google Scholar

Croft, W., Harding, S., Taghva, K. and Borsack, J. (1994). An evaluation of information retrieval accuracy with simulated ocr output. In Symposium on Document Analysis and Information Retrieval, pp. 115–126.Google Scholar

Cucerzan, S. (2007). Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic. Association for Computational Linguistics, pp. 708–716.Google Scholar

Dernoncourt, F., Lee, J.Y. and Szolovits, P. (2017). Neuroner: an easy-to-use program for named-entity recognition based on neural networks. arXiv preprint arXiv:1705.05487.Google Scholar

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota. Association for Computational Linguistics, pp. 4171–4186.Google Scholar

Dredze, M., McNamee, P., Rao, D., Gerber, A. and Finin, T. (2010). Entity disambiguation for knowledge base population. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING’10, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 277–285.Google Scholar

Ehrmann, M., Hamdi, A., Pontes, E.L., Romanello, M. and Doucet, A. (2021). Named entity recognition and classification on historical documents: a survey. CoRR, abs/2109.11406.Google Scholar

Fang, Z., Cao, Y., Li, Q., Zhang, D., Zhang, Z. and Liu, Y. (2019). Joint entity linking with deep reinforcement learning. In The World Wide Web Conference, WWW’19. New York, NY, USA: Association for Computing Machinery, pp. 438–447.CrossRef Google Scholar

Favre, B., Béchet, F. and Nocéra, P. (2005). Robust named entity extraction from large spoken archives. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp. 491–498.CrossRef Google Scholar

Filannino, M., Brown, G. and Nenadic, G. (2013). Mantime: temporal expression identification and normalization in the tempeval-3 challenge. arXiv preprint arXiv:1304.7942.Google Scholar

Ganea, O.-E. and Hofmann, T. (2017). Deep joint entity disambiguation with local neural attention. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp. 2619–2629.CrossRef Google Scholar

Gefen, A. (2014). Les enjeux épistémologiques des humanités numériques. Socio-La nouvelle revue des sciences sociales, (4), 61–74.Google Scholar

Goldberg, Y. and Levy, O. (2014). word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722.Google Scholar

Gotoh, Y. and Renals, S. (2000). Information extraction from broadcast news. Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences 358(1769), 1295–1310.CrossRef Google Scholar

Grave, E., Bojanowski, P., Gupta, P., Joulin, A. and Mikolov, T. (2018). Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation.Google Scholar

Grishman, R. and Sundheim, B. (1996). Message understanding conference-6: a brief history. In COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics, vol. 1.Google Scholar

Grover, C., Givon, S., Tobin, R. and Ball, J. (2008). Named entity recognition for digitised historical texts. In LREC.Google Scholar

Guo, J., Xu, G., Cheng, X. and Li, H. (2009). Named entity recognition in query. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’09. New York, NY, USA: Association for Computing Machinery, pp. 267–274.Google Scholar

Guo, S., Chang, M.-W. and Kiciman, E. (2013). To link or not to link? a study on end-to-end tweet entity linking. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia. Association for Computational Linguistics, pp. 1020–1030.Google Scholar

Guo, Z. and Barbosa, D. (2014). Robust entity linking via random walks. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM’14. New York, NY, USA: ACM, pp. 499–508.CrossRef Google Scholar

Hamdi, A., Jean-Caurant, A., Sidere, N., Coustaty, M. and Doucet, A. (2019). An analysis of the performance of named entity recognition over ocred documents. In 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, pp. 333–334.CrossRef Google Scholar

Hamdi, A., Jean-Caurant, A., Sidère, N., Coustaty, M. and Doucet, A. (2020). Assessing and minimizing the impact of OCR quality on named entity recognition. In Hall M., Merčun T., Risse T. and Duchateau F. (eds), Digital Libraries for Open Knowledge. Cham: Springer International Publishing, pp. 87–101.Google Scholar

Han, X. and Zhao, J. (1999). NLPR_KBP in TAC 2009 KBP track: a two-stage method to entity linking. In In Proceedings of Test Analysis Conference 2009 (TAC 09). MIT Press.Google Scholar

Heino, E., Tamper, M., Mäkelä, E., Leskinen, P., Ikkala, E., Tuominen, J., Koho, M. and Hyvönen, E. (2017). Named entity linking in a complex domain: case second world war history. In Gracia, J., Bond F., McCrae J.P., Buitelaar P., Chiarcos C. and Hellmann S. (eds), Language, Data, and Knowledge. Cham: Springer International Publishing, pp. 120–133.Google Scholar

Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S. and Weikum, G. (2011). Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP’11. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 782–792.Google Scholar

Holley, R. (2009). How good can it get? analysing and improving ocr accuracy in large scale historic newspaper digitisation programs. D-Lib Magazine 15(3/4).Google Scholar

Huynh, V.-N., Hamdi, A. and Doucet, A. (2020). When to use ocr post-correction for named entity recognition? In Digital Libraries at Times of Massive Societal Transition - Collaborating and Connecting Community during Global Change. Springer International Publishing.Google Scholar

Ittner, D.J., Lewis, D.D. and Ahn, D.D. (1995). Text categorization of low quality images. In Symposium on Document Analysis and Information Retrieval. Citeseer, pp. 301–315.Google Scholar

Jing, H., Lopresti, D. and Shih, C. (2003). Summarizing noisy documents. In Proceedings of the Symposium on Document Image Understanding Technology, pp. 111–119.Google Scholar

Journet, N., Visani, M., Mansencal, B., Van-Cuong, K. and Billy, A. (2017). Doccreator: a new software for creating synthetic ground-truthed document images. Journal of Imaging 3(4), 62.CrossRef Google Scholar

Kolitsas, N., Ganea, O.-E. and Hofmann, T. (2018). End-to-end neural entity linking. In Proceedings of the 22nd Conference on Computational Natural Language Learning. Association for Computational Linguistics, pp. 519–529.CrossRef Google Scholar

Kukich, K. (1992). Spelling correction for the telecommunications network for the deaf. Communications of the ACM 35(5), 80–91.CrossRef Google Scholar

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K. and Dyer, C. (2016). Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.Google Scholar

Lawrie, D., Mayfield, J. and Etter, D. (2020). Building OCR/NER test collections. In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 4639–4646.Google Scholar

Le, P. and Titov, I. (2018). Improving entity linking by modeling latent relations between mentions. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, pp. 1595–1604.CrossRef Google Scholar

Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S. and Bizer, C. (2015). DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web Journal 6(2), 167–195.CrossRef Google Scholar

Li, Y., Wang, C., Han, F., Han, J., Roth, D. and Yan, X. (2013). Mining evidences for named entity disambiguation. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’13. New York, NY, USA: ACM, pp. 1070–1078.Google Scholar

Linhares Pontes, E., Cabrera-Diego, L.A., Moreno, J.G., Boros, E., Hamdi, A., Sidère, N., Coustaty, M. and Doucet, A. (2020a). Entity linking for historical documents: challenges and solutions. In Digital Libraries at Times of Massive Societal Transition - Collaborating and Connecting Community during Global Change. Springer International Publishing.Google Scholar

Linhares Pontes, E., Hamdi, A., Sidere, N. and Doucet, A. (2019). Impact of ocr quality on named entity linking. In Jatowt A., Maeda A. and Syn S.Y. (eds), Digital Libraries at the Crossroads of Digital Information for the Future. Cham: Springer International Publishing, pp. 102–115.Google Scholar

Linhares Pontes, E., Moreno, J.G. and Doucet, A. (2020b). Linking named entities across languages using multilingual word embeddings. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, JCDL’20. New York, NY, USA: Association for Computing Machinery, pp. 329–332.Google Scholar

Lopresti, D. (2005). Performance evaluation for text processing of noisy inputs. In Proceedings of the 2005 ACM Symposium on Applied Computing. ACM, pp. 759–763.CrossRef Google Scholar

Lopresti, D. (2009). Optical character recognition errors and their effects on natural language processing. International Journal on Document Analysis and Recognition (IJDAR) 12(3), 141–151.CrossRef Google Scholar

Luo, G., Huang, X., Lin, C.-Y. and Nie, Z. (2015). Joint entity recognition and disambiguation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 879–888.CrossRef Google Scholar

Ma, X. and Hovy, E. (2016). End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. arXiv preprint arXiv:1603.01354.Google Scholar

Martins, P.H., Marinho, Z. and Martins, A.F. (2019a). Joint learning of named entity recognition and entity linking. arXiv preprint arXiv:1907.08243.CrossRef Google Scholar

Martins, P.H., Marinho, Z. and Martins, A.F.T. (2019b). Joint learning of named entity recognition and entity linking. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Florence, Italy. Association for Computational Linguistics, pp. 190–196.CrossRef Google Scholar

Maynard, D., Tablan, V., Ursu, C., Cunningham, H. and Wilks, Y. (2001). Named entity recognition from diverse text types. In Recent Advances in Natural Language Processing 2001 Conference, pp. 257–274.Google Scholar

McDonald, D.D. (1993). Internal and external evidence in the identification and semantic categorization of proper names. In Acquisition of Lexical Knowledge from Text.Google Scholar

Mikheev, A. (1999). A knowledge-free method for capitalized word disambiguation. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 159–166.CrossRef Google Scholar

Miller, D., Boisen, S., Schwartz, R., Stone, R. and Weischedel, R. (2000). Named entity extraction from noisy input: speech and ocr. In Proceedings of the sixth conference on Applied natural language processing. Association for Computational Linguistics, pp. 316–324.Google Scholar

Mitton, R. (1987). Spelling checkers, spelling correctors and the misspellings of poor spellers. Information Processing & Management 23(5), 495–505.CrossRef Google Scholar

Munnelly, G. and Lawless, S. (2018). Investigating entity linking in early english legal documents. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, JCDL’18. New York, NY, USA: Association for Computing Machinery, pp. 59–68.CrossRef Google Scholar

Mutuvi, S., Doucet, A., Odeo, M. and Jatowt, A. (2018). Evaluating the impact of OCR errors on topic modeling. In Dobreva M., Hinze A. and Zumer, M. (eds), Maturity and Innovation in Digital Libraries - 20th International Conference on Asia-Pacific Digital Libraries, ICADL 2018, Hamilton, New Zealand, November 19–22, 2018, Proceedings. Lecture Notes in Computer Science, vol. 11279. Springer, pp. 3–14.CrossRef Google Scholar

Nadeau, D. and Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26.CrossRef Google Scholar

Nguyen, T.T.H., Jatowt, A., Coustaty, M. and Doucet, A. (2021). Survey of post-ocr processing approaches. ACM Computing Surveys 54(6), 124:1–124:37.Google Scholar

Nguyen, T.-T.-H., Jatowt, A., Coustaty, M., Nguyen, N.-V. and Doucet, A. et al. (2019). Deep statistical analysis of ocr errors for effective post-ocr processing. In 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, pp. 29–38.CrossRef Google Scholar

Palmer, D.D. and Ostendorf, M. (2001). Improving information extraction by modeling errors in speech recognizer output. In Proceedings of the First International Conference on Human Language Technology Research. Association for Computational Linguistics, pp. 1–5.Google Scholar

Pennington, J., Socher, R. and Manning, C. (2014). Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543.Google Scholar

Peters, M.E., Ammar, W., Bhagavatula, C. and Power, R. (2017). Semi-supervised sequence tagging with bidirectional language models. arXiv preprint arXiv:1705.00108.Google Scholar

Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.Google Scholar

Petkova, D. and Croft, W.B. (2007). Proximity-based document representation for named entity retrieval. In Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, pp. 731–740.CrossRef Google Scholar

Ravi, M.P.K., Singh, K., Mulang, I.O., Shekarpour, S., Hoffart, J. and Lehmann, J. (2021). Cholan: a modular approach for neural entity linking on wikipedia and wikidata. arXiv preprint arXiv:2101.09969.Google Scholar

Ritter, A., Clark, S., Mausam, and Etzioni, O. et al. (2011). Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp. 1524–1534.Google Scholar

Rodriquez, K.J., Bryant, M., Blanke, T. and Luszczynska, M. (2012). Comparison of named entity recognition tools for raw OCR text. In KONVENS, pp. 410–414.Google Scholar

Ruiz, P. and Poibeau, T. (2019). Mapping the Bentham Corpus: concept-based navigation. Journal of Data Mining and Digital Humanities. Special Issue: Digital Humanities between knowledge and know-how (Atelier Digit_Hum).Google Scholar

Shen, W., Wang, J. and Han, J. (2015). Entity linking with a knowledge base: issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering 27(2), 443–460.CrossRef Google Scholar

Sil, A. and Yates, A. (2013). Re-ranking for joint named-entity recognition and linking. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 2369–2374.CrossRef Google Scholar

Suchanek, F.M., Kasneci, G. and Weikum, G. (2007). Yago: a core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, WWW’07. New York, NY, USA: ACM, pp. 697–706.CrossRef Google Scholar

Taghva, K., Borsack, J. and Condit, A. (1996). Effects of OCR errors on ranking and feedback using the vector space model. Information Processing and Management 32(3), 317–327.CrossRef Google Scholar

van Hooland, S., De Wilde, M., Verborgh, R., Steiner, T. and Van de Walle, R. (2013). Exploring entity recognition and disambiguation for cultural heritage collections. Digital Scholarship in the Humanities 30(2), 262–279.CrossRef Google Scholar

van Hulst, J.M., Hasibi, F., Dercksen, K., Balog, K. and de Vries, A.P. (2020). Rel: an entity linker standing on the shoulders of giants. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.CrossRef Google Scholar

van Strien, D., Beelen, K., Ardanuy, M.C., Hosseini, K., McGillivray, B. and Colavizza, G. (2020). Assessing the impact of ocr quality on downstream NLP tasks. In Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: ARTIDIGH. INSTICC, SciTePress, pp. 484–496.CrossRef Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I. (2017). Attention is all you need. In Guyon I., Luxburg U.V., Bengio S., Wallach H., Fergus R., Vishwanathan S. and Garnett R. (eds), Advances in Neural Information Processing Systems, vol. 30, Long Beach, California, USA. Curran Associates, Inc., pp. 5998–6008 Google Scholar

Yaser, A.-O. (2005). Effect of degraded input on statistical machine translation. In 2005 Symposium on Document Image Understanding Technology, p. 103.Google Scholar

Zhang, W., Sim, Y.C., Su, J. and Tan, C.L. (2011). Entity linking with effective acronym expansion, instance selection and topic modeling. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Three, IJCAI’11. AAAI Press, pp. 1909–1914.Google Scholar

Zheng, Z., Li, F., Huang, M. and Zhu, X. (2010). Learning to link entities with knowledge base. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT’10. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 483–491.Google Scholar

Zuccon, G., Nguyen, A.N., Bergheim, A., Wickman, S. and Grayson, N. (2012). The impact of OCR accuracy on automated cancer classification of pathology reports. In HIC, pp. 250–256.Google Scholar

Article contents

In-depth analysis of the impact of OCR errors on named entity recognition and linking

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests