A survey of automatic Arabic diacritization techniques

AQIL M. AZMI; REHAM S. ALMAJED

doi:10.1017/S1351324913000284

A survey of automatic Arabic diacritization techniques

Published online by Cambridge University Press: 10 October 2013

AQIL M. AZMI and

REHAM S. ALMAJED

Show author details

AQIL M. AZMI: Affiliation:
Department of Computer Science, King Saud University, Riyadh 11543, Saudi Arabia e-mails: aqil@ksu.edu.sa, reham.imamu@gmail.com
REHAM S. ALMAJED: Affiliation:
Department of Computer Science, King Saud University, Riyadh 11543, Saudi Arabia e-mails: aqil@ksu.edu.sa, reham.imamu@gmail.com

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

In Modern Standard Arabic texts are typically written without diacritical markings. The diacritics are important to clarify the sense and meaning of words. Lack of these markings may lead to ambiguity even for the natives. Often the natives successfully disambiguate the meaning through the context; however, many Arabic applications, such as machine translation, text-to-speech, and information retrieval, are vulnerable due to lack of diacritics. The process of automatically restoring diacritical marks is called diacritization or diacritic restoration. In this paper we discuss the properties of the Arabic language and the issues that are related to the lack of the diacritical marking. It will be followed by a survey of the recent algorithms that were developed to solve the diacritization problem. We also look into the future trend for researchers working in this area.

Information

Type: Articles
Information: Natural Language Engineering , Volume 21 , Issue 3 , May 2015 , pp. 477 - 495

DOI: https://doi.org/10.1017/S1351324913000284 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2013

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Al-Azami, M. M. 2011. The History of the Qur'anic Text: From Revelation to Compilation, 2nd ed., pp. 123–9. Sherwoord Park, Alberta, Canada: Al-Qalam Publishing.Google Scholar

Alghamdi, M., Khursheed, M., Elshafei, M., Alhargan, F., Alkanhal, M., Alshamsan, A., Alqahtani, S., Muzaffar, Z., Altowim, Y., Yusuf, A., and Almuhtasib, H. 2006. Automatic Arabic text diacritizer. Technical Report CI.25.02, King Abdulaziz City for Science and Technology (KACST), Riyadh, Saudi Arabia.Google Scholar

Alghamdi, M., and Muzaffar, Z. 2007. KACST Arabic diacritizer. In Proceedings of the First International Symposium on Computers and Arabic Language, Riyadh, Saudi Arabia.Google Scholar

Alghamdi, M., Muzaffar, Z., and Alhakami, H., 2010. Automatic restoration of Arabic diacritics: a simple, purely statistical approach. Arabian Journal for Science and Engineering 35 (2C): 125–35.Google Scholar

Ali, N. 1988. Arabic Language and Computer (in Arabic). Cairo, Egypt: Ta'reep.Google Scholar

Alotaiby, F., Alkharashi, I., and Foda, S., 2009. Processing large Arabic text corpora: preliminary analysis and results. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools, Cairo, Egypt, pp. 78–82.Google Scholar

Al-Sughaiyer, I., and Al-Kharashi, I., 2004. Arabic morphological analysis techniques: a comprehensive survey. Journal of American Society for Information Science and Technology 55 (3): 189–213.CrossRef Google Scholar

Ananthakrishnan, S., Narayanan, S., and Bangalore, S. 2005. Automatic diacritization of Arabic transcripts for automatic speech recognition. In Proceedings of the International Conference on Natural Language Processing (ICON-05), Kanpur, India.Google Scholar

Attia, M. 2000. A Large Scale Computational Processor of the Arabic Morphology and Applications (unpublished Master's thesis), Cairo University, Cairo, Egypt.Google Scholar

Attia, M. 2008. Handling Arabic Morphological and Syntactic Ambiguity within the LFG Framework with a View to Machine Translation (unpublished D.Sc thesis), University of Manchester, Manchester, UK.Google Scholar

Bahanshal, A., and Al-Khalifa, H. 2012. A first approach to the evaluation of Arabic diacritization systems. In Proceedings of the 7th International Conference on Digital Information Management (ICDIM 2012), Macau, China.Google Scholar

Bellamy, J., 1988. Two pre-Islamic Arabic inscriptions revised: Jabal Ramm and Umm Al-Jimal. Journal of the American Oriental Society 108 (3): 369–72.CrossRef Google Scholar

Buckwalter, T., 2004. Buckwalter Arabic Morphological Analyzer Version 2.0. Philadelphia, PA: Linguistic Data Consortium (LDC).Google Scholar

Central Intelligence Agency. 2008. World Factbook. Washington DC: CIA.Google Scholar

Debili, F., Achour, H., and Souissi, E. 2002. De l'etiquetage grammatical a la voyellation automatique de l'arabe. Technical Report, Correspondances de l'Institut de Recherche sur le Maghreb Contemporain 17.Google Scholar

Diab, M., Ghoneim, M., and Habash, N. 2007. Arabic diacritization in the context of statistical machine translation. In Proceedings of Machine Translation Summit XI(MT-Summit), Copenhagen, Denmark.Google Scholar

El-Imam, Y., 2003. Phonetization of Arabic: rules and algorithms. Computer Speech and Language 18: 339–73.CrossRef Google Scholar

Elshafei, E., Al-Muhtaseb, H., and Alghamdi, M., 2006. Statistical methods for automatic diacritization of Arabic text. In Proceedings of Saudi 18th National Computer Conference (NCC18), Riyadh, Saudi Arabia, pp. 301–6.Google Scholar

Emam, O., and Fisher, V. 2004. A hierarchical approach for the statistical vowelization of Arabic text. Technical Report, IBM patent led DE9–2004–0006, US Patent Application US2005/0192809 A1.Google Scholar

Farghaly, A., and Shaalan, K., 2009. Arabic natural language processing: challenges and solutions. ACM Transaction on Asian Language Information Processing 8 (4): 1–22.CrossRef Google Scholar

Gal, Y., 2002. An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages (SEMITIC '02), Philadelphia, PA, pp. 27–33.Google Scholar

Habash, N. 2010. Introduction to Arabic language processing. In Hirst, G. (ed.), Synthesis Lectures on Human Language Technologies. San Rafael, CA: Morgan & Claypool.Google Scholar

Habash, N., and Rambow, O., 2007. Arabic diacritization through full morphological tagging. In Human Language Technologies: The Conference of the North American Chapter of the Association of Computational Linguistics (NAACL '07), Rochester, NY, pp. 53–6.Google Scholar

Habib, M. 2008. An Intelligent System for Automated Arabic Text Categorization (unpublished Master's thesis), Ain Shams University, Cairo, Egypt.Google Scholar

Hattab, A. M., and Hussain, A.K., 2012. Hybrid statistical and morpho-syntactical Arabic language diacritizing system. International Journal of Academic Research (Part A) 4 (4): 51–6.CrossRef Google Scholar

Hifny, Y. 2012a. Smoothing techniques for Arabic diacritics restoration. In Proceedings of the 12th Conference on Language Engineering (ESOLEC '12), Cairo, Egypt.Google Scholar

Hifny, Y. 2012b. Higher order n-gram language models for Arabic diacritics restoration. In Proceedings of the 12th Conference on Language Engineering (ESOLEC '12), Cairo, Egypt.Google Scholar

Ibn Mandhur. 2009. Lisan al-Arab, Haider, A. A. and Ibrahim, A. (eds.). Beirut, Lebanon: Dar al-Kutub al-Ilmiyah (in Arabic).Google Scholar

Kirchhoff, K., and Vergyri, D., 2005. Cross-dialectal data sharing for acoustic modeling in Arabic speech recognition. Speech Communication 46: 37–51.CrossRef Google Scholar

Klinger, R., and Tomanek, K. 2007. Classical probabilistic models and conditional random fields. Technical Report TR07-2-013, Department of Computer Science, Dortmund University of Technology, Dortmund, Germany.Google Scholar

Liu, Y., Shriberg, E., Stolcke, A., and Harper, M., 2005. Comparing HMM, maximum entropy, and conditional random fields for disfluency detection. In Proceedings of European Conference on Speech Communication and Technology, Lisbon, Portugal, pp. 3313–6.Google Scholar

Maamouri, M., and Bies, A. 2010. The Penn Arabic treebank. In Farghaly, A. (ed.), Arabic Computational Linguistics. Stanford, CA: CSLI.Google Scholar

Maamouri, M., Bies, A., and Kulick, S. 2006. Diacritization: a challenge to Arabic treebank annotation and parsing. In Proceedings of Arabic NLP/MT Conference, The British Computer Society, Natural Language Translation Specialist Group, London.Google Scholar

Manning, C., and Schtze, C. 1999. Foundations of Statistical Natural Language Processing, 2nd ed.Cambridge, MA: MIT Press.Google Scholar

Nelken, R., and Shieber, S., 2005. Arabic diacritization using weighted finite-state transducers. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages (SEMITIC '05), Ann Arbor, MI, pp. 79–86.CrossRef Google Scholar

Rashwan, M., Al Badrashiny, M., Attia, M., Abdou, S., and Rafea, A., 2011. A stochastic Arabic diacritizer based on a hybrid of factorized and unfactorized textual features. IEEE Transaction on Audio, Speech, and Language Processing 19 (1): 166–75.CrossRef Google Scholar

Ryding, K., 2006. A Reference Grammar of Modern Standard Arabic. Cambridge, UK: Cambridge University Press.Google Scholar

Salton, G., 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Reading, MA: Addison-Wesley.Google Scholar

Schlippe, T., Nguyen, T., and Vogel, S., 2008. Diacritization as a machine translation problem and as a sequence labeling problem. In Proceedings of the 8th Conference of the Association for Machine Translation in Americas (AMTA-2008), Waikiki, HI, pp. 192–201.Google Scholar

Shaalan, K. 2010. Rule-based approach in Arabic natural language processing. International Journal on Information and Communication Technologies (IJICT), Serial Publications 3 (3): 11–9 (Special Issue on Advances in Arabic Language Processing).Google Scholar

Shaalan, K., Abo Bakr, H., and Ziedan, I., 2009. A hybrid approach for building Arabic diacritizer. In Proceedings of EACL 2009 Workshop on Computational Approaches to Semitic Language, Morristown, NJ, pp. 27–35.CrossRef Google Scholar

Smrž, O. 2007. Functional Arabic Morphology: Formal System and Implementation (unpublished PhD thesis), Charles University in Prague, Prague, Czech Republic.Google Scholar

Van Gompel, M., 2008. Automatic Arabic Vocalisation. Tilberg, Netherlands: University of Tilburg.Google Scholar

Vergyri, D., and Kirchhoff, K., 2004. Automatic diacritization of Arabic for acoustic modeling in speech recognition. In Proceedings of the Workshop on Computational Approaches Arabic Script-Based Languages (SEMITIC '04), Stroudsburg, PA, pp. 66–73.CrossRef Google Scholar

Wikipedia. n.d. Danish and Norwegian alphabet. Retreived March 17, 2013, from http://en.wikipedia.org/wiki/Danish_and_Norwegian_alphabet.Google Scholar

Zerrouki, T. 2011. Tashkeela: Arabic vocalized text corpus. Retreived June 9, 2013, from http://aracorpus.e3rab.com/.Google Scholar

Zitouni, I., and Sarikaya, R., 2009. Arabic diacritic restoration approach based on maximum entropy models. Computer Speech and Language 23: 257–76.CrossRef Google Scholar

Article contents

A survey of automatic Arabic diacritization techniques

Abstract

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests