Unsupervised Arabic dialect segmentation for machine translation

Wael Salloum; Nizar Habash

doi:10.1017/S1351324920000455

Unsupervised Arabic dialect segmentation for machine translation

Published online by Cambridge University Press: 23 September 2020

Wael Salloum and

Nizar Habash

Show author details

Wael Salloum*: Affiliation:
AI Research Department, Mendel.ai, San Jose, CA, USA
Nizar Habash: Affiliation:
AI Research Department, Mendel.ai, San Jose, CA, USA
*: *Corresponding author. E-mail: wael@ccls.columbia.edu

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Resource-limited and morphologically rich languages pose many challenges to natural language processing tasks. Their highly inflected surface forms inflate the vocabulary size and increase sparsity in an already scarce data situation. In this article, we present an unsupervised learning approach to vocabulary reduction through morphological segmentation. We demonstrate its value in the context of machine translation for dialectal Arabic (DA), the primarily spoken, orthographically unstandardized, morphologically rich and yet resource poor variants of Standard Arabic. Our approach exploits the existence of monolingual and parallel data. We show comparable performance to state-of-the-art supervised methods for DA segmentation.

Keywords

Machine translation Morphology Arabic dialects Unsupervised learning

Information

Type: Article
Information: Natural Language Engineering , Volume 28 , Issue 2 , March 2022 , pp. 223 - 248

DOI: https://doi.org/10.1017/S1351324920000455 [Opens in a new window]
Copyright: © The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Abdelali, A., Darwish, K., Durrani, N. and Mubarak, H. (2016). Farasa: A Fast and Furious Segmenter for Arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. San Diego, California: Association for Computational Linguistics, pp. 11–16.CrossRef Google Scholar

Abo Bakr, H., Shaalan, K. and Ziedan, I. (2008). A hybrid approach for converting written Egyptian colloquial dialect into Diacritized Arabic. In The 6th International Conference on Informatics and Systems, INFOS2008. Cairo University.Google Scholar

Al-Badrashiny, M., Pasha, A., Diab, M.T., Habash, N., Rambow, O., Salloum, W. and Eskander, R. (2016). SPLIT: Smart Preprocessing (Quasi) Language Independent Tool. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC-2016).Google Scholar

Al-Sabbagh, R. and Girju, R. (2010). Mining the web for the induction of a Dialectical Arabic Lexicon. In Calzolari N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M. and Tapias, D. (eds), LREC. European Language Resources Association.Google Scholar

Banerjee, S. and Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72.Google Scholar

Brown, P.F., Pietra, S.A. Della P., Della V.J. and Mercer, R.L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19, 263–312.Google Scholar

Buckwalter, T. (2004). Buckwalter Arabic Morphological Analyzer Version 2.0. LDC catalog number LDC2004L02, ISBN 1-58563-324-0.Google Scholar

Callison-Burch, C., Koehn, P. and Osborne, M. (2006). Improved statistical machine translation using paraphrases. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pp. 17–24.CrossRef Google Scholar

Chiang, D., Diab, M., Habash, N., Rambow, O. and Shareef, S. (2006). Parsing arabic dialects. In Proceedings of the European Chapter of ACL (EACL).Google Scholar

Creutz, M. and Lagus, K. (2002). Unsupervised discovery of morphemes. In: ACL 2002 Workshop on Morphological and Phonological Learning. ACL.Google Scholar

Creutz, M. and Lagus, K. (2007). Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing (TSLP), 4(1).Google Scholar

Diab, M., Hacioglu, K. and Jurafsky, D. (2007). Automated methods for processing Arabic text: From tokenization to base phrase chunking. In van den Bosch A. and Soudi A.morphological analyzer for Egyptian Arabic (eds), Arabic Computational Morphology: Knowledge-based and Empirical Methods. Kluwer/Springer.Google Scholar

Du, J., Jiang, J. and Way, A. (2010). Facilitating translation using source language paraphrase lattices. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. EMNLP 2010, pp. 420–429.Google Scholar

Duh, K. and Kirchhoff, K. (2005). POS tagging of dialectal Arabic: A minimally supervised approach. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Semitic 2005, pp. 55–62.CrossRef Google Scholar

El Kholy, A. and Habash, N. (2010). Techniques for Arabic morphological detokenization and orthographic denormalization. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC).Google Scholar

Erdmann, A., Khalifa, S., Oudah, M., Habash, N. and Bouamor, H. (2019). A little linguistics goes a long way: Unsupervised segmentation with limited language specific guidance. In Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology. Florence, Italy: Association for Computational Linguistics, pp. 113–124.CrossRef Google Scholar

Eskander, R., Habash, N. and Rambow, O. (2013). Automatic extraction of morphological lexicons from morphologically annotated corpora. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, Washington, USA: Association for Computational Linguistics, pp. 1032–1043.Google Scholar

Eskander, R., Habash, N., Rambow, O. and Pasha, A. (2016). Creating resources for dialectal Arabic from a single annotation: A case study on Egyptian and Levantine. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 3455–3465.Google Scholar

Graff, D. and Cieri, C. (2003). English Gigaword, LDC Catalog No.: LDC2003T05. Linguistic Data Consortium, University of Pennsylvania.Google Scholar

Graff, D., Maamouri, M., Bouziri, B., Krouna, S., Kulick, S. and Buckwalter, T. (2009). Standard Arabic Morphological Analyzer (SAMA) Version 3.1. Linguistic Data Consortium LDC2009E73.Google Scholar

Habash, N. (2006). On Arabic and its dialects. Multilingual Magazine, 17(81).Google Scholar

Habash, N. (2010). Introduction to Arabic Natural Language Processing. Morgan & Claypool Publishers.10.2200/S00277ED1V01Y201008HLT010CrossRef Google Scholar

Habash, N. and Rambow, O. (2005). Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 573–580.CrossRef Google Scholar

Habash, N. and Rambow, O. (2006). MAGEAD: A morphological analyzer and generator for the Arabic dialects. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 681–688.CrossRef Google Scholar

Habash, N. and Sadat, F. (2006). Arabic preprocessing schemes for statistical machine translation. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pp. 49–52.CrossRef Google Scholar

Habash, N., Soudi, A. and Buckwalter, T. (2007). On Arabic transliteration. In van den Bosch A. and Soudi A. (eds.), Arabic Computational Morphology: Knowledge-based and Empirical Methods. Springer.Google Scholar

Habash, N., Eskander, R. and Hawwari, A. (2012a). A morphological analyzer for Egyptian Arabic. In NAACL-HLT 2012 Workshop on Computational Morphology and Phonology (SIGMORPHON2012), pp. 1–9.Google Scholar

Habash, N., Eskander, R. and Hawwari, A. (2012b). A morphological analyzer for Egyptian Arabic. In Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology, pp. 1–9.Google Scholar

Habash, N., Diab, M. and Rabmow, O. (2012c). Conventional orthography for dialectal Arabic. In Proceedings of the Language Resources and Evaluation Conference (LREC).Google Scholar

Habash, N., Roth, R., Rambow, O., Eskander, R. and Tomeh, N. (2013). Morphological analysis and disambiguation for dialectal Arabic. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).Google Scholar

Habash, N., Eryani, F., Khalifa, S., Rambow, O., Abdulrahim, D., Erdmann, A., Faraj, R., Zaghouani, W., Bouamor, H., Zalmout, N., Hassan, S., Shargi, F.A., Alkhereyf, S., Abdulkareem, B., Eskander, R., Salameh, M. and Saddiki, H. (2018). Unified guidelines and resources for Arabic Dialect orthography. In: Proceedings of the Language Resources and Evaluation Conference (LREC).Google Scholar

Hajič, J., Hric, J. and Kubon, V. (2000). Machine translation of very close languages. Proceedings of the 6th Applied Natural Language Processing Conference (ANLP 2000), pp. 7–12.CrossRef Google Scholar

Hamdi, A., Boujelbane, R., Habash, N., Nasr, A., et al. (2013). The effects of factorizing root and pattern mapping in bidirectional Tunisian-Standard Arabic machine translation. MT Summit 2013.Google Scholar

Khalifa, S., Zalmout, N. and Habash, N. (2016). YAMAMA: Yet another multi-dialect Arabic morphological analyzer. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations. Osaka, Japan: The COLING 2016 Organizing Committee, pp. 223–227.Google Scholar

Khalifa, S., Hassan, S. and Habash, N. (2017). A morphological analyzer for Gulf Arabic verbs. In Proceedings of the Workshop for Arabic Natural Language Processing (WANLP).CrossRef Google Scholar

Kilany, H., Gadalla, H., Arram, H., Yacoub, A., El-Habashi, A. and McLemore, C. (2002). Egyptian Colloquial Arabic Lexicon. LDC catalog number LDC99L22.Google Scholar

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A. and Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 177–180.CrossRef Google Scholar

Kumar, S., Och, F.J. and Macherey, W. (2007). Improving word alignment with bridge languages. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 42–50.Google Scholar

Mikolov, T., Chen, K. Corrado G. and Dean J. (2013). Efficient estimation of word representations in vector space. CoRR.Google Scholar

Mohamed, E., Mohit, B. and Oflazer, K. (2012). Annotating and learning morphological segmentation of Egyptian colloquial Arabic. In Proceedings of the Language Resources and Evaluation Conference (LREC).Google Scholar

Nakov, P. and Ng, H.T. (2011). Translating from morphologically complex languages: A paraphrase-based approach. In Proceedings of the Meeting of the Association for Computational Linguistics (ACL 2011).Google Scholar

Narasimhan, K., Barzilay, R. and Jaakkola, T. (2015). An unsupervised method for uncovering morphological chains. Transactions of the Association for Computational Linguistics (TACL), 3, 157–167.CrossRef Google Scholar

Och, F.J. (2003). Minimum error rate training for statistical machine translation. In Proceedings of the 41st Annual Conference of the Association for Computational Linguistics, pp. 160–167.CrossRef Google Scholar

Och, F.J. and Ney, H. (2003a). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.CrossRef Google Scholar

Och, F.J. and Ney, H. (2003b). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–52.CrossRef Google Scholar

Oudah, M., Almahairi, A. and Habash, N. (2019). The impact of preprocessing on Arabic-English statistical and neural machine translation. CoRR, abs/1906.11751.Google Scholar

Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318.Google Scholar

Parker, R., Graff, D., Chen, K., Kong, J. and Maeda, K. (2009). Arabic Gigaword Fourth Edition. LDC catalog number No. LDC2009T30, ISBN 1-58563-532-4.Google Scholar

Pasha, A., Al-Badrashiny, M., Diab, M.T., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O. and Roth, R. (2014). MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014).Google Scholar

Riesa, J. and Yarowsky, D. (2006). Minimally supervised morphological segmentation with applications to machine translation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas (AMTA 2006), pp. 185–192.Google Scholar

Sadat, F. and Habash, N. (2006). Combination of Arabic preprocessing schemes for statistical machine translation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Sydney, Australia: Association for Computational Linguistics, pp. 1–8.CrossRef Google Scholar

Sajjad, H., Darwish, K. and Belinkov, Y. (2013). Translating dialectal Arabic to English. In The 51st Annual Meeting of the Association for Computational Linguistics - Short Papers (ACL Short Papers 2013), Sofia, Bulgaria.Google Scholar

Salloum, W. (2018). Machine Translation of Arabic Dialects. Ph.D. thesis, Columbia University in the City of New York.Google Scholar

Salloum, W. and Habash, N. (2011). Dialectal to standard Arabic paraphrasing to improve Arabic-English statistical machine translation. In Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties, pp. 10–21.Google Scholar

Salloum, W. and Habash, N. (2012). Elissa: A dialectal to standard Arabic machine translation system. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012): Demonstration Papers, pp. 385–392.Google Scholar

Salloum, W. and Habash, N. (2013). Dialectal Arabic to English machine translation: Pivoting through modern standard Arabic. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).Google Scholar

Samih, Y., Eldesouki, M., Attia, M., Darwish, K., Abdelali, A., Mubarak, H. and Kallmeyer, L. (2017a). Learning from relatives: Unified dialectal Arabic segmentation. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). Vancouver, Canada: Association for Computational Linguistics, pp. 432–441.CrossRef Google Scholar

Samih, Y., Attia, M., Eldesouki, M., Abdelali, A., Mubarak, H., Kallmeyer, L. and Darwish, K. (2017b). A neural architecture for dialectal Arabic segmentation. In Proceedings of the Third Arabic Natural Language Processing Workshop. Valencia, Spain: Association for Computational Linguistics, pp. 46–54.CrossRef Google Scholar

Sawaf, H. (2010). Arabic dialect handling in hybrid machine translation. In Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA).Google Scholar

Schone, P. and Jurafsky, D. (2000). Knowledge-free induction of morphology using latent semantic analysis. In Proceedings of CoNLL-2000 and LLL-2000, pp. 67–72.CrossRef Google Scholar

Stallard, D., Devlin, J., Kayser, M., Lee, Y.K. and Barzilay, R. (2012). Unsupervised morphology rivals supervised morphology for Arabic MT. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, pp. 322–327.Google Scholar

Stolcke, A. (2002). SRILM an Extensible Language Modeling Toolkit. In Proceedings of the International Conference on Spoken Language Processing.Google Scholar

Utiyama, M. and Isahara, H. (2007). A comparison of pivot methods for phrase-based statistical machine translation. In HLT-NAACL, pp. 484–491.Google Scholar

Zalmout, N. and Habash, N. (2017a). Don’t throw those morphological analyzers away just yet: Neural morphological disambiguation for Arabic. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 704–713.CrossRef Google Scholar

Zalmout, N. and Habash, N. (2017b). Optimizing tokenization choice for machine translation across multiple target languages. The Prague Bulletin of Mathematical Linguistics, 108(1), 257–269.CrossRef Google Scholar

Zalmout, N. and Habash, N. (2019). Adversarial multitask learning for joint multi-feature and multi-dialect morphological modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 1775–1786.CrossRef Google Scholar

Zbib, R., Malchiodi, E., Devlin, J., Stallard, D., Matsoukas, S., Schwartz, R., Makhoul, J., Zaidan, O.F. and Callison-Burch, C. (2012). Machine translation of Arabic dialects. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Montréal, Canada: Association for Computational Linguistics, pp. 49–59.Google Scholar

Zhang, X. (1998). Dialect MT: A case study between Cantonese and Mandarin. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, ACL 1998, pp. 1460–1464.Google Scholar

Article contents

Unsupervised Arabic dialect segmentation for machine translation

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests