Skip to main content
×
×
Home

Arabic spelling error detection and correction

  • MOHAMMED ATTIA (a1) (a2), PAVEL PECINA (a3), YOUNES SAMIH (a4), KHALED SHAALAN (a2) and JOSEF VAN GENABITH (a1)...
Abstract

A spelling error detection and correction application is typically based on three main components: a dictionary (or reference word list), an error model and a language model. While most of the attention in the literature has been directed to the language model, we show how improvements in any of the three components can lead to significant cumulative improvements in the overall performance of the system. We develop our dictionary of 9.2 million fully-inflected Arabic words (types) from a morphological transducer and a large corpus, validated and manually revised. We improve the error model by analyzing error types and creating an edit distance re-ranker. We also improve the language model by analyzing the level of noise in different data sources and selecting an optimal subset to train the system on. Testing and evaluation experiments show that our system significantly outperforms Microsoft Word 2013, OpenOffice Ayaspell 3.4 and Google Docs.

Copyright
Footnotes
Hide All

We are grateful to our anonymous reviewers whose comments and suggestions have helped us to improve the paper considerably. This research is funded by the Irish Research Council for Science Engineering and Technology (IRCSET), the UAE National Research Foundation (NRF) (Grant No. 0514/2011), the Czech Science Foundation (grant no. P103/12/G084), DFG Collaborative Research Centre 991: The Structure of Representations in Language, Cognition, and Science (http://www.sfb991.uni-duesseldorf.de/sfb991), and the Science Foundation Ireland (Grant No. 07/CE/I1142) as part of the Centre for Next Generation Localisation (www.cngl.ie) at Dublin City University.

Footnotes
References
Hide All
Alfaifi, A., and Atwell, E. 2012. Arabic learner corpora (ALC): a taxonomy of coding errors. In Proceedings of the 8th International Computing Conference in Arabic (ICCA 2012), Cairo, Egypt.
Alkanhal, M. I., Al-Badrashiny, M. A., Alghamdi, M. M., and Al-Qabbany, A. O., 2012. Automatic stochastic arabic spelling correction with emphasis on space insertions and deletions. IEEE Transactions on Audio, Speech, and Language Processing 20 (7): 21112122.
Attia, M., 2006. An ambiguity-controlled morphological analyzer for modern standard arabic modelling finite state networks. In The Challenge of Arabic for NLP/MT Conference, The British Computer Society. London, UK, pp. 4867.
Attia, M., Pecina, P., Tounsi, L., Toral, A., and van Genabith, J. 2011. An Open-source finite state morphological transducer for modern standard arabic. In International Workshop on Finite State Methods and Natural Language Processing (FSMNLP), Blois, France, pp. 125–133.
Beesley, K., 1998. Arabic morphology using only finite-state operations. In The Workshop on Computational Approaches to Semitic Languages, Montreal, Quebec, pp. 5057.
Beesley, K., and Karttunen, L., 2003. Finite State Morphology. CSLI Studies in Computational Linguistics. Stanford, California: CSLI.
Brill, E., and Moore, R. C. 2000. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, pp. 286–293.
Brown, P. F., Della Pietra, V. J., de Souza, P. V., Lai, J. C., and Mercer, R. L., 1992. Class-based n-gram models of natural language. Computational Linguistics 18 (4): 467479.
Buckwalter, T., 2004a. Issues in Arabic orthography and morphology analysis. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 3134.
Buckwalter, T. 2004b. Buckwalter Arabic Morphological Analyzer (BAMA) Version 2.0. Linguistic Data Consortium (LDC) catalogue number: LDC2004L02.
Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., and Basu, A., 2007. Investigation and modeling of the structure of texting language. International Journal on Document Analysis and Recognition 10 (3–4): 157174.
Church, K. W., and Gale, W. A., 1991. Probability scoring for spelling correction. Statistics and Computing 1: 93103.
Damerau, F. J., 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM 7 (3): 171176.
El Kholy, A., and Habash, N., 2010. Techniques for Arabic morphological detokenization and orthographic denormalization. In Proceedings of the Workshop on Semitic Languages in the Seventh International Conference on Language Resources and Evaluation (LREC), Valletta, Malta, pp. 4551.
Gao, J., Li, X., Micol, D., Quirk, C., and Sun, X., 2010. A large scale ranker-based system for search query spelling correction. In Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China, pp. 358366.
Habash, N., and Rambow, O., 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, Michigan, US, pp. 573580.
Haddad, B., and Yaseen, M., 2007. Detection and correction of non-words in Arabic: a hybrid approach. International Journal of Computer Processing of Oriental Languages 20: 237257.
Hajič, J., Smrž, O., Buckwalter, T., and Jin, H., 2005. Feature-based tagger of approximations of functional arabic morphology. In Proceedings of the 4th Workshop on Treebanks and Linguistic Theories (TLT), Barcelona, Spain, pp. 5364.
Han, B., and Baldwin, T., 2011. Lexical normalisation of short text messages: makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, OR, pp. 368378.
Han, J., and Kamber, M., 2006. Data Mining, Southeast Asia Edition: Concepts and Techniques. San Francisco, CA: Morgan Kaufmann Publishers.
Hassan, A., Noeman, S., and Hassan, H., 2008. Language independent text correction using finite state automata. In IJCNLP, Hyderabad, India, pp. 913918.
Heift, T., and Rimrott, A., 2008. Learner responses to corrective feedback for spelling errors in CALL. System 36 (2): 196213.
Hulden, M., 2009a. Fast approximate string matching with finite automata. In Proceedings of the 25th Conference of the Spanish Society for Natural Language Processing (SEPLN), San Sebastian, Spain, pp. 5764.
Hulden, M., 2009b. Foma: a finite-state compiler and library. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics. Stroudsburg, PA, USA, pp. 2932.
Kernigan, M., Church, K., and Gale, W. 1990. A spelling correction program based on a noisy channel model. AT & T Laboratories, 600 Mountain Ave., Murray Hill, NJ, pp. 205–210.
Kiraz, G. A. 2001. Computational Nonlinear Morphology: With Emphasis on Semitic Languages, Cambridge University. Cambridge, United Kingdom.
Kukich, K., 1992. Techniques for automatically correcting words in text. Computing Surveys 24 (4): 377439.
Levenshtein, V. I., 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10 (8): 707710.
Magdy, W., and Darwish, K., 2006. Arabic OCR error correction using character segment correction, language modeling, and shallow morphology. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, Australia, pp. 408414.
Mitton, R., 1996. English Spelling and the Computer. Harlow, Essex: Longman Group.
Mooney, R. J., and Bunescu, R., 2005. ACM SIGKDD explorations newsletter. Natural Language Processing and Text Mining 7 (1): 310.
Moussa, M., Fakhr, M. W., and Darwish, K. 2012. Statistical denormalization for arabic text. In Proceedings of KONVENS 2012, Vienna, pp. 228–232.
Norvig, P. 2009. Natural language corpus data. In Segaran, T. and Hammerbacher, J. (eds.), Beautiful Data, pp. 219242. Sebastopol, California: O’Reilly.
Och, F. J., and Genzel, D. 2013. Automatic spelling correction for machine translation. Patent US 20130144592 A1. June 6, 2013.
Oflazer, K., 1996. Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction. Computational Linguistics 22 (1): 7390.
Parker, R., Graff, D., Chen, K., Kong, J., and Maeda, K. 2011. Arabic Gigaword Fifth Edition. LDC Catalog No.: LDC2011T11.
Ratcliffe, R. R. 1998. The Broken Plural Problem in Arabic and Comparative Semitic: Allomorphy and Analogy in Non-concatenative Morphology, Amsterdam Studies in the Theory and History of Linguistic Science, Series IV, Current issues in linguistic theory, vol. 168. Amsterdam, Philadelphia: J. Benjamins.
Roth, R., Rambow, O., Habash, N., Diab, M., and Rudin, C., 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of ACL-08: HLT, Columbus, Ohio, US, pp. 117120.
Shaalan, K., Allam, A., and Gomah, A., 2003. Towards automatic spell checking for arabic. In Proceedings of the 4th Conference on Language Engineering, Egyptian Society of Language Engineering (ELSE), Cairo, Egypt, pp. 240247.
Shaalan, K., Magdy, M., and Fahmy, A. 2013. Analysis and feedback of erroneous arabic verbs. Journal of Natural Language Engineering, Cambridge University, UK. FirstView: 153.
Shaalan, K., Samih, Y., Attia, M., Pecina, P., and van Genabith, J. 2012. Arabic word generation and modelling for spell checking. In Language Resources and Evaluation (LREC), Istanbul, Turkey. pp. 719725.
Stolcke, A., Zheng, J., Wang, W., and Abrash, V. 2011. SRILM at sixteen: update and outlook. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, Waikoloa, Hawaii.
Tong, X., and Evans, D. A., 1996. A statistical approach to automatic OCR error correction in context. In Proceedings of the 4th Workshop on Very Large Corpora, Copenhagen, Denmark, pp. 88100.
Ukkonen, E. 1983. On approximate string matching. In Foundations of Computation Theory, vol. 158, pp. 487495. Lecture Notes in Computer Science, Berlin: Springer.
van Delden, S., Bracewell, D. B., and Gomez, F. 2004. Supervised and unsupervised automatic spelling correction algorithms. In Proceedings of the 2004 IEEE International Conference on Web Services, pp. 530–535.
Watson, J. 2002. The Phonology and Morphology of Arabic, New York: Oxford University.
Wintner, S., 2008. Strengths and weaknesses of finite-state technology: a case study in morphological grammar development. Natural Language Engineering 14 (4): 457469.
Wu, J., Chiu, H., and Chang, J. S., 2013. Integrating dictionary and web N-grams for chinese spell checking. Computational Linguistics and Chinese Language Processing 18 (4): 1730.
Zaghouani, W., Mohit, B., Habash, N., Obeid, O., Tomeh, N., Rozovskaya, A., Farra, N., Alkuhlani, S., and Oflazer, K., 2014. Large scale arabic error annotation: guidelines and framework. In The 9th Edition of the Language Resources and Evaluation Conference (LREC), Reykjavik, Iceland, pp. 2631.
Zribi, C. B. O., and Ben Ahmed, M. 2003. Efficient automatic correction of misspelled arabic words based on contextual information. Lecture Notes in Computer Science, Springer, 2773: 770777.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed