Skip to main content
×
Home
    • Aa
    • Aa

Arabic spelling error detection and correction

  • MOHAMMED ATTIA (a1) (a2), PAVEL PECINA (a3), YOUNES SAMIH (a4), KHALED SHAALAN (a2) and JOSEF VAN GENABITH (a1)...
Abstract
Abstract

A spelling error detection and correction application is typically based on three main components: a dictionary (or reference word list), an error model and a language model. While most of the attention in the literature has been directed to the language model, we show how improvements in any of the three components can lead to significant cumulative improvements in the overall performance of the system. We develop our dictionary of 9.2 million fully-inflected Arabic words (types) from a morphological transducer and a large corpus, validated and manually revised. We improve the error model by analyzing error types and creating an edit distance re-ranker. We also improve the language model by analyzing the level of noise in different data sources and selecting an optimal subset to train the system on. Testing and evaluation experiments show that our system significantly outperforms Microsoft Word 2013, OpenOffice Ayaspell 3.4 and Google Docs.

Copyright
Footnotes
Hide All

We are grateful to our anonymous reviewers whose comments and suggestions have helped us to improve the paper considerably. This research is funded by the Irish Research Council for Science Engineering and Technology (IRCSET), the UAE National Research Foundation (NRF) (Grant No. 0514/2011), the Czech Science Foundation (grant no. P103/12/G084), DFG Collaborative Research Centre 991: The Structure of Representations in Language, Cognition, and Science (http://www.sfb991.uni-duesseldorf.de/sfb991), and the Science Foundation Ireland (Grant No. 07/CE/I1142) as part of the Centre for Next Generation Localisation (www.cngl.ie) at Dublin City University.

Footnotes
Linked references
Hide All

This list contains references from the content that can be linked to their source. For a full set of references and notes please see the PDF or HTML where available.

M. I. Alkanhal , M. A. Al-Badrashiny , M. M. Alghamdi , and A. O. Al-Qabbany , 2012. Automatic stochastic arabic spelling correction with emphasis on space insertions and deletions. IEEE Transactions on Audio, Speech, and Language Processing 20 (7): 21112122.

K. Beesley , 1998. Arabic morphology using only finite-state operations. In The Workshop on Computational Approaches to Semitic Languages, Montreal, Quebec, pp. 5057.

M. Choudhury , R. Saraf , V. Jain , A. Mukherjee , S. Sarkar , and A. Basu , 2007. Investigation and modeling of the structure of texting language. International Journal on Document Analysis and Recognition 10 (3–4): 157174.

K. W. Church , and W. A. Gale , 1991. Probability scoring for spelling correction. Statistics and Computing 1: 93103.

F. J. Damerau , 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM 7 (3): 171176.

B. Haddad , and M. Yaseen , 2007. Detection and correction of non-words in Arabic: a hybrid approach. International Journal of Computer Processing of Oriental Languages 20: 237257.

T. Heift , and A. Rimrott , 2008. Learner responses to corrective feedback for spelling errors in CALL. System 36 (2): 196213.

G. A. Kiraz 2001. Computational Nonlinear Morphology: With Emphasis on Semitic Languages, Cambridge University. Cambridge, United Kingdom.

K. Kukich , 1992. Techniques for automatically correcting words in text. Computing Surveys 24 (4): 377439.

W. Magdy , and K. Darwish , 2006. Arabic OCR error correction using character segment correction, language modeling, and shallow morphology. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, Australia, pp. 408414.

R. R. Ratcliffe 1998. The Broken Plural Problem in Arabic and Comparative Semitic: Allomorphy and Analogy in Non-concatenative Morphology, Amsterdam Studies in the Theory and History of Linguistic Science, Series IV, Current issues in linguistic theory, vol. 168. Amsterdam, Philadelphia: J. Benjamins.

E. Ukkonen 1983. On approximate string matching. In Foundations of Computation Theory, vol. 158, pp. 487495. Lecture Notes in Computer Science, Berlin: Springer.

S. Wintner , 2008. Strengths and weaknesses of finite-state technology: a case study in morphological grammar development. Natural Language Engineering 14 (4): 457469.

C. B. O. Zribi , and M. Ben Ahmed 2003. Efficient automatic correction of misspelled arabic words based on contextual information. Lecture Notes in Computer Science, Springer, 2773: 770777.

Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 8
Total number of PDF views: 36 *
Loading metrics...

Abstract views

Total abstract views: 610 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 25th September 2017. This data will be updated every 24 hours.