Hostname: page-component-8448b6f56d-wq2xx Total loading time: 0 Render date: 2024-04-24T06:03:57.389Z Has data issue: false hasContentIssue false

A survey of diacritic restoration in abjad and alphabet writing systems

Published online by Cambridge University Press:  20 November 2017

FRANKLIN ỌLÁDIÍPỌ̀ ASAHIAH
Affiliation:
Department of Computer Science and Engineering, Obafemi Awolowo University, Ile-Ife, Nigeria e-mails: sobusola@oauife.edu.ng, oodejobi@oauife.edu.ng, eadagun@oauife.edu.ng
ỌDẸ́TÚNJÍ ÀJÀDÍ ỌDẸ́JỌBÍ
Affiliation:
Department of Computer Science and Engineering, Obafemi Awolowo University, Ile-Ife, Nigeria e-mails: sobusola@oauife.edu.ng, oodejobi@oauife.edu.ng, eadagun@oauife.edu.ng
EMMANUEL RÓTÌMÍ ADÁGÚNODÒ
Affiliation:
Department of Computer Science and Engineering, Obafemi Awolowo University, Ile-Ife, Nigeria e-mails: sobusola@oauife.edu.ng, oodejobi@oauife.edu.ng, eadagun@oauife.edu.ng

Abstract

A diacritic is a mark placed near or through a character to alter its original phonetic or orthographic value. Many languages around the world use diacritics in their orthography, whatever the writing system the orthography is based on. In many languages, diacritics are ignored either by convention or as a matter of convenience. For users who are not familiar with the text domain, the absence of diacritics within text has been known to cause mild to serious readability and comprehension problems. However, the absence of diacritics in text causes near-intractable problems for natural language processing systems. This situation has led to extensive research on diacritization. Several techniques have been applied to address diacritic restoration (or diacritization) but the existing surveys of techniques have been restricted to some languages and hence left gaps for practitioners to fill. Our survey examined diacritization from the angle of resources deployed and various formulation employed for diacritization. It was concluded by recommending that (a) any proposed technique for diacritization should consider the language features and the purpose served by diacritics, (b) that evaluation metrics needed to be more rigorously defined for easy comparison of performance of models.

Type
Survey Paper
Copyright
Copyright © Cambridge University Press 2017 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abandah, G. A., Graves, A., Al-Shagoor, B., Arabiyat, A., Jamour, F., and Al-Taee, M., 2015. Automatic diacritization of Arabic text using recurrent neural networks. International Journal on Document Analysis and Recognition (IJDAR) 18 (2): 183–97.CrossRefGoogle Scholar
Adalı, K., and Eryiǧit, G. 2014. Vowels and diacritic restoration for social media texts. In Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM), April 26–30, 2014, at EACL, Association for Computational Linguistics, pp. 53–61. Gothenburg, Sweden.CrossRefGoogle Scholar
Adegbola, T., and Odilinye, L. U. 2012. Quantifying the effect of corpus size on the quality of automatic diacritization of Yorùbá texts. In Proceedings of 3rd International Workshop on Spoken Languages Technologies for Under-resourced Languages, Cape Town, South Africa. Retrieved August 12, 2012 from http://www.mica.edu.vn/sltu2012/files/proceedings/10.pdf.Google Scholar
Ager, S. 2008. Arabic alphabet, pronunciation and language. Web. in Omniglot, writing systems and languages of the world. Retrieved February 12, 2008 from http://www.omniglot.com/writing/arabic.htm.Google Scholar
Aha, D. W., Kilber, D., and Albert, M. K., 1991. Instance—based learning algorithms. Machine Learning 6 (1): 3766.CrossRefGoogle Scholar
Ahmed, F., Nürnberger, A., and Nitsche, M., 2011. Supporting Arabic Cross-Lingual Retrieval Using Contextual Information. Berlin: Springer-Verlag.CrossRefGoogle Scholar
Alansary, S. 2017. Alserag: an automatic diacritization system for arabic. In Hassanien, A. E., Shaalan, K., Gaber, T., Azar, A. T., and Tolba, M. F. (eds.), Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2016, pp. 182–92. Cham: Springer International Publishing.CrossRefGoogle Scholar
Al Badrashiny, M. A., 2009. Automatic Diacritizer for Arabic Text. M.sc Thesis, Egypt: Cairo University. Retrieved October 4, 2011 from http://www.rdi-eg.com/Downloads/ArabicNLP/Mohamed-Badashiny_MSc-Thesis_June2009.pdf.Google Scholar
Al-Badrashiny, M., Hawwari, A., and Diab, M. 2017. A layered language model based hybrid approach to automatic full diacritization of arabic. In WANLP 2017 (co-located with EACL 2017), pp. 177–84.Google Scholar
Alghamdi, M., Muzaffar, Z., and Alhakami, H., 2010. Automatic restoration of arabic diacritics: a simple, purely statistical approach. The Arabian Journal for Science and Engineering 35 (2c): 125–35.Google Scholar
Ali, A. R., 2009. Automatic Urdu Diacritization. M.sc Thesis, Pakistan: National University of Computer & Emerging Sciences. Retrieved October 4, 2011 from http://www.cle.org.pk/Publication/theses/2009/Automatic_Urdu_Diacritization.pdf.Google Scholar
Ananthakrishnan, S., Bangalore, S., and Narayanan, S. S., 2005. Automatic diacritization of arabic transcripts for automatic speech recognition. In Proceedings of the International Conference on Natural Language Processing ICON, Kanpur, India, pp. 4754.Google Scholar
Asahiah, F. O. 2014. Development of A Standard Yorùbá Automatic Diacritic Restoration System. PhD Thesis, Ile-Ife, Nigeria: Obafemi Awolowo University.Google Scholar
Azmi, A. M., and Almajed, R. S. 2013. A survey of automatic Arabic diacritization techniques. Natural Language Engineering 21 (3): 477495. doi:10.1017/S1351324913000284.CrossRefGoogle Scholar
Ball, M. J. 2001. On the status of diacritics. Journal of the International Phonetic Association 31 (2): 259–64. doi:10.1017/S0025100301002067.CrossRefGoogle Scholar
Berger, A. L., Della Pietra, V. J., and Della Pietra, S. A., 1996. A maximum entropy approach to natural language processing. Computational Linuistics 22 (1): 3971.Google Scholar
Bolshakov, I., Gelbukh, A., and Galicia-Haro, S. 1999. A simple method to detect and correct Spanish accentuation typos. In Farghaly, A., and Megerdoomian, K. (eds.), CProc. PACLING-99, Pacific Association for Computational Linguistics, pp. 104–13. Waterloo, Ontario, Canada: University of Waterloo, August 25–28.Google Scholar
Borin, L. 2009. One in the Bush: Low-Density Language Technology. University of Gothenburg.Google Scholar
Buckland, M. 2013. Document theory: an introduction. In Willer, M., Gilliland, A. J., and Tomić, M. (eds.), Records, Archives and Memory: Selected Papers from the Conference and School on Records, Archives and Memory Studies, pp. 223–37. Croatia: University of Zadar, May 2013.Google Scholar
Cocks, J., and Keegan, T. T. 2011. A word-based approach for diacritic restoration in Māori. In Australasian Language Technology Association Workshop 2011, Canberra, Australia. pp. 126–30.Google Scholar
De Pauw, G., Wagacha, P. W., and de Schryver, G. 2007. Automatic diacritic restoration for resource-scarce languages. In Matousek, V. , M. P. (ed.), Text, Speech and Dialogue, 10th International Conference, TSD 2007, Pilsen, Czech Republic, September 3–7, 2007, Proceedings Lecture Notes in Artificial Intelligence LNAI, subseries of Lecture Notes in Computer Science LNCS, vol. 4629, pp. 170–79. Berlin: Springer-Verlag.CrossRefGoogle Scholar
Diab, M., Ghoneim, M., and Habash, N. 2007. Arabic diacritization in the context of statistical machine translation. In Proceedings of the Eleventh Machine Translation Summit, Copenhagen, Denmark.Google Scholar
Ding, P. S. 2005. Tone languages. In Strasny, P. (ed.), Routledge Encyclopedia of Linguistics, pp. 1117–20. London, UK: Routledge.Google Scholar
Ekpenyong, M., Udoinyang, M., and Urua, E., 2009. A robust language processor for African tone language systems. Georgian Electronic Scientific Journal: Computer Science and Telecommunication 6 : 312.Google Scholar
El-Harby, A. A., El-Shehawey, M. A., and El-Barogy, R., 2008. A Statistical Approach for Qur’an Vowel Restoration. ICGST-AIML Journal 8 (3): 916.Google Scholar
El-Imam, Y., 2003. Phonetization of arabic: rules and algorithms. Computer Speech and Language 18 (4): 339–73.CrossRefGoogle Scholar
El-Sadany, T., and Hashish, M. 1988. Semi-automatic vowelization of arabic verbs. In Proceedings of 10th National Computer Conference, Jeddah, pp. 725–32.Google Scholar
Elshafei, M., Al-Muhtaseb, H., and Alghamdi, M. 2006a. Statistical methods for automatic diacritization of arabic text. In Proceedings of the Saudi 18th National Computer Conference NCC18, Riyadh, vol. 18, Saudi Arabia, pp. 301–6.Google Scholar
Elshafei, M., Al-Muhtaseb, H., and Alghamdi, M. 2006b. Machine generation of arabic diacritical marks. In Proceedings of the 2006 International Conference on Machine earning; Models, Technologies and Applications. June 2006, USA: CSREA Press, pp. 128–33.Google Scholar
European Language Resources Association (ELRA). 2015. What is a language Resource?. Web Article. Retrieved on March 15, 2016 from http://www.elra.info/en/about/what-language-resource/ Google Scholar
Ezeani, I., Hepple, M., and Onyenwe, I. 2016. Automatic restoration of diacritics for Igbo language. In Sojka, P., Horák, A., Kopeček, I., and Pala, K. (eds.), Text, Speech, and Dialogue: 19th International Conference, TSD 2016, Brno, Czech Republic, September 12–16, 2016, Proceedings, Cham: Springer International Publishing, pp. 198–205.Google Scholar
Ezeani, I., Hepple, M., and Onyenwe, I., 2017. Lexical disambiguation of Igbo through diacritic restoration. In Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, April 4 2017, Association for Computational Linguistics, Valencia, Spain, pp. 53–60.Google Scholar
Gal, Y., 2002. An HMM Approach to vowel restoration in Arabic and Hebrew. In Proceedings of the ACL-02Workshop on Computational Aroaches to Semitic Languages, Philadelphia, PA, pp. 2733.Google Scholar
Gambäck, B. 1997. Processing Swedish Sentences: A Unification-Based Grammar and Some Applications. PhD Thesis, Sweden: Royal Institute of Technology.Google Scholar
Grönqvist, L., and Helgadóttir, S. 2002. Literature review of representativeness of linguistic resources. GSLT course on Linguistic Resources. Retrieved on July 09, 2017 from http://www.gslt.hum.gu.se/~leifg/gslt/doc/rep021202.pdf Google Scholar
Habash, N., and Rambow, O., 2007. Arabic diacritization through full morphological tagging. In Proceedings of NAACL HLT 2007, Companion Volume, Association for Computational Linguistics, Rochester, NY, pp. 5356.Google Scholar
Haertel, R. A., McClanahan, P., and Ringger, E. R., 2010. Automatic diacritization for low–resource languages using a hybrid word and consonant CMM. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, June 2010, Los Angeles, CA, pp. 519–27.Google Scholar
Hládek, D., Staš, J., and Juhár, J. 2016. Diacritics restoration in the slovak texts using hidden Markov model. In Vetulani, Z., Uszkoreit, H., and Kubis, M. (eds.), Human Language Technology. Challenges for Computer Science and Linguistics: 6th Language and Technology Conference, LTC 2013, Poznań, Poland, December 7–9, 2013. Revised Selected Papers, Cham: Springer International Publishing, pp. 29–40.Google Scholar
Jurafsky, D., and Martin, J. H. 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistic and Speech Recognition. Upper Saddle River, NJ. Pearson Prentice-Hall.Google Scholar
Kanis, J., and Müller, L. 2005. Using lemmatization technique for automatic diacritics restoration. In Proceedings of SPECOM 2005, Moscow: Moscow State Linguistic University, pp. 255–58.Google Scholar
Khoja, S., 2001. APT: arabic part-of-speech tagger. In Proceedings of the Student Workshop at NAACL, Pittsburg, PA, pp. 20–5.Google Scholar
Lafferty, J., McCallum, A., and Pereira, F., 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning ICML 2001, San Francisco, CA: Morgan Kaufmann, pp. 282–9.Google Scholar
Liddy, E. D. 2001. Natural language processing. In: Drake, M. A., editor, Encyclopedia of Library and Information Science, 2nd ed., Marcel Decker Inc. NY.Google Scholar
Ljubešic, N., Erjavec, T., and Fišer, D. 2016. Corpus-based diacritic restoration for South Slavic languages. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA) (May 23–28). Portorož, Slovenia, pp. 3612–16.Google Scholar
Mahar, J. A., and Memon, G. Q. (2011). Lexicon based diacritic restorations using wordnet for sindhi. International Journal of Academic Research 3 (2): 37–43.Google Scholar
Manning, C. D., and Schütze, H., 1999. Foundations of Statistical Natural Language Processing. USA: MIT Press.Google Scholar
Marty, F., and Hart, R. S. 1985. Computer program to transcribe french text into speech: Problems and suggested solutions. Technical Report LLL-T-6-85, University of Illinois, Urbana, Illinois, Language Learning Laboratory.Google Scholar
Mihalcea, R. 2002. Diacritic restoration: learning from letters versus learning from words. In Proceedings of Computational Linguistics and Intelligent Text Processing, 3rd International Conference, CICLing 2002, Mexico City: Springer, vol. 2276, pp. 339–348.Google Scholar
Mihalcea, R., and Nastase, V. 2002. Letter level learning for language independent diacritics restoration. In Proceedings of CoNLL-2002, Taipei, Taiwan, pp. 105–111.Google Scholar
Mohamed, E., and Kübler, S. 2009. Diacritization for real-world Arabic texts. In Proceedings of Recent Advances in Natural Language Processing, Borovets, Bulgaria, pp. 251–7.Google Scholar
Nelken, R., and Shieber, S. M. 2005. Arabic diacritization using weighted finite-state transducers. In ACL2005 Workshop on Computational Approaches to Semitic Languages, Ann Arbor, Michigan, pp. 7986.CrossRefGoogle Scholar
Novák, A., and Siklósi, B. 2015. Automatic diacritics restoration for Hungarian. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 17–21 September 2015. Association for Computational Linguistics, Lisbon, Portugal.CrossRefGoogle Scholar
Ọdẹjọbí, O. A., 2005. A Computational Model of Prosody for Yorùbá Text-to-Speech Synthesis. Phd Thesis, Aston: Aston University.Google Scholar
Pham, L.-N., Tran, V.-H., and Nguyen, V.-V., 2013. Vietnamese text accent restoration with statistical machine translation. In Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation, Taipei, Taiwan, pp. 423–9.Google Scholar
Roth, R., Rambow, O., Habash, N., Diab, M., and Rudin, C., 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of Association for Computational Linguistics (ACL) ACL-08: HLT, Short Papers (Companion Volume), Columbus, OH, pp. 117–20.Google Scholar
Šantić, N., Šnajder, J., and Bašić, B. D., 2009. Automatic diacritics restoration in croatian texts. In INFuture2009: Digital Resources and Knowledge Sharing, Zagreb, Croatia, pp. 309–18.Google Scholar
Sarikaya, R., Emam, O., Zitouni, I., and Gao, Y. 2006. Maximum entropy modeling for diacritization of arabic text. In INTERSPEECH 2006 - ICSLP, 9th International Conference on Spoken Language Processing, Pittsburgh, PA. ISCA. paper 1418-Mon1BuP.11.Google Scholar
Scannell, K. P. 2011. Statistical unicodification of African languages. Language Resources and Evaluation, 1–12. Retrieved July 20, 2011 from http://borel.slu.edu/pub/lre.pdf.Google Scholar
Schlippe, T., Nguyen, T., and Vogel, S., 2008. Diacritization as a machine translation problem and as a sequence labeling problem. In AMTA-2008. MT at work: In Proceedings of the 8th Conference of the Association for Machine Translation in the Americas, Waikiki, Hawai'i, pp. 270–8.Google Scholar
Shaalan, K., Abo Bakr, H. M., and Ziedan, I. 2009. A hybrid approach for building arabic diacritizer. In Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages, Athens Greece, March 31, pp. 27–35.Google Scholar
Shatta, U. 1994. A systemic functional syntax analyzer and case-marker generator for speech acts in Arabic. In 19th International Conference for Statistics, Computer Science, Scientific & Social Applications. Cairo.Google Scholar
Simard, M. 1998. Automatic insertion of accents in french texts. In Ide and Vuotilainen (eds.), Proceedings of the 3rd Conference on Empirical Methods in Natural Language Processing, Granada, Spain. Association for Computational Linguistics (ACL), Somerset, NJ, pp. 27–35.Google Scholar
Simard, M., and Deslauriers, A. 2001. Real-time automatic insertion of accents in French text. Natural Language Engineering 7 (2), 143–65.Google Scholar
Spriet, T., and El-Bèze, M. 1997. Réaccentuation Automatique de Textes. In FRACTAL 97. Besançon, France.Google Scholar
Sutton, C., and McCallum, A. 2006. An Introduction to conditional random fields for relational learning. In Introduction to Statistical Relational Learning. Cambridge, MA: MIT Press.Google Scholar
Truyen, T. T., Phung, D. Q., and Venkatesh, S. 2008. Constrained sequence classification for lexical disambiguation. In Tu-Bao, H. and Zhi-Hua, Z., Editors 10th Pacific Rim International Conference on Artificial Intelligence, Hanoi, Vietnam, December 15–19, 2008. Lecture Notes in Computer Science, Springer. Berlin vol. 5351: 430–41.Google Scholar
Tufiş, D., and Ceauşu, A. 2008. DIAC: A professional diacritics recovering system. In Proceedings of the 6th International Language Resources and Evaluation (LREC’08). Marrakech, Morocco. paper 54 on Conference CD.Google Scholar
Tufiş, D., and Chiţu, A. 1999. Automatic Diacritic Insertion in Romanian Texts. In Proceedings of the International Conference on Computational Lexicography COMPLEX’99. Pecs, Hungary, pp. 185–94.Google Scholar
Ungurean, C., Burileanu, D., Popescu, V., Negrescu, C., and Derviş, A., 2008. Automatic diacritic restoration for a TTS-based e-mail reader application. UPB Scientific Bulletin 70 (4): 312.Google Scholar
The Unicode Consortium. 2011. The Unicode Standard Version 6.0 Core Specification. Retrieved February 21, 2015 from http://www.unicode.org/versions/Unicode6.0.0/ch06.pdf.Google Scholar
Vergyri, D., and Kirchhoff, K. 2004. Automatic Diacritization of Arabic for Acoustic Modeling in Speech Recognition. In Farghaly, A., and Megerdoomian, K. (eds.), COLING 2004 Computational Approaches to Arabic Script-based Languages, pp. 66–73. Geneva, Switzerland: COLING. Retrieved July 25,2017 from http://melodi.ee.washington.edu/people/katrin/Papers/vergyri-kirchhoff-coling04.pdf.CrossRefGoogle Scholar
Wagacha, P., De Pauw, G., and Githinji, P. 2006. A Grapheme-based Approach for Accent Restoration in Gĩkũyũ. In Proceedings of the Fifth International Conference on Language Resources and Evaluation, Genoa, Italy, ELRA, pp. 1937–40.Google Scholar
Wells, J. C. 2000. Orthographic diacritics and multilingual computing. Language problems & language planning 24 (3): 249–72. Retrieved July 12, 2010 from http://www.phon.ucl.ac.uk/home/wells/dia/diacritics-revised.htm.Google Scholar
Yarowsky, D. 1994. Decision List for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French. In Proceedings of 32nd Annual Meeting of Association for Computational Linguistics, Las Cruces, NM, pp. 88–95.Google Scholar
Yarowsky, D. 1999. A comparison of corpus-based techniques for restoring accents in Spanish and French text. Natural language processing using very large corpora, pp. 99120. Springer.Google Scholar
Zainkó, C., Csapó, T. G., and Németh, G. 2010. Special Speech Synthesis for Social Network Websites. In Sojka, P., Hora, A. , K. P. (eds.), Text Speech and Dialogue: 13th International Conference TSD 2010, Brno, Czech Republic. September 2010 Proceedings, pp. 455–63, Berlin, Germany. Springer-Verlag.CrossRefGoogle Scholar
Zitouni, I., Sorensen, J. S., and Sarikaya, R. 2006. Maximum Entropy Based Restoration of Arabic Diacritics, In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, Sydney Australia, pp. 577–84.Google Scholar