Effective multi-dialectal arabic POS tagging

Kareem Darwish; Mohammed Attia; Hamdy Mubarak; Younes Samih; Ahmed Abdelali; Lluís Màrquez; Mohamed Eldesouki; Laura Kallmeyer

doi:10.1017/S1351324920000078

Effective multi-dialectal arabic POS tagging

Published online by Cambridge University Press: 14 April 2020

Mohamed Eldesouki and

Laura Kallmeyer

Show author details

Kareem Darwish: Affiliation:
Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
Mohammed Attia: Affiliation:
Google Inc New York, New York, NY, USA
Hamdy Mubarak: Affiliation:
Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
Younes Samih: Affiliation:
Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
Ahmed Abdelali*: Affiliation:
Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
Lluís Màrquez: Affiliation:
Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
Mohamed Eldesouki: Affiliation:
Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
Laura Kallmeyer: Affiliation:
Computational Linguistics Department, Heinrich-Heine-University Düsseldorf, 40204Düsseldorf, Germany
*: *Corresponding author. E-mail: aabdelali@hbku.edu.qa

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

This work introduces robust multi-dialectal part of speech tagging trained on an annotated data set of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses conditional random fields (CRFs), while the second combines word- and character-based representations in a deep neural network with stacked layers of convolutional and recurrent networks with a CRF output layer. We successfully exploit a variety of features that help generalize our models, such as Brown clusters and stem templates. Also, we develop robust joint models that tag multi-dialectal tweets and outperform uni-dialectal taggers. We achieve a combined accuracy of 92.4% across all dialects, with per dialect results ranging between 90.2% and 95.4%. We obtained the results using a train/dev/test split of 70/10/20 for a data set of 350 tweets per dialect.

Keywords

Part-of-speech tagging Arabic Dialects Deep neural network Brown clusters

Type: Article
Information: Natural Language Engineering , Volume 26 , Issue 6: Natural Language Processing for Similar Languages, Varieties, and Dialects , November 2020 , pp. 677 - 690

DOI: https://doi.org/10.1017/S1351324920000078 [Opens in a new window]
Copyright: © Cambridge University Press 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abdelali, A., Darwish, K., Durrani, N. and Mubarak, H. (2016). Farasa: A fast and furious segmenter for arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, San Diego, California. Association for Computational Linguistics, pp. 11–16.CrossRef Google Scholar

Al-Sabbagh, R. and Girju, R. (2010). Mining the web for the induction of a dialectical arabic lexicon. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), pp. 288–293.Google Scholar

Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.Google Scholar

Bouamor, H., Habash, N. and Oflazer, K. (2014). A multidialectal parallel corpus of arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 1240–1245.Google Scholar

Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D. and Lai, J.C. (1992). Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479.Google Scholar

Caruana, R., Lawrence, S. and Giles, L. (2000). Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In NIPS, pp. 402–408.Google Scholar

Chiu, J. and Nichols, E. (2016). Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics 4, 357–370.CrossRef Google Scholar

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 2493–2537.Google Scholar

Cotterell, R. and Callison-Burch, C. (2014). A multi-dialect, multi-genre corpus of informal written arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 241–245.Google Scholar

Darwish, K., Mubarak, H., Abdelali, A. and Eldesouki, M. (2017). Arabic pos tagging: Don’t abandon feature engineering just yet. In WANLP 2017 (co-located with EACL 2017), p. 130.Google Scholar

Darwish, K., Mubarak, H., Abdelali, A., Eldesouki, M., Samih, Y., Alharbi, R., Attia, M., Magdy, W. and Kallmeyer, L. (2018). Multi-dialect arabic POS tagging: A CRF approach. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7–12, 2018.Google Scholar

Darwish, K., Sajjad, H. and Mubarak, H. (2014). Verifiably effective arabic dialect identification. In EMNLP, pp. 1465–1468.CrossRef Google Scholar

Derczynski, L., Ritter, A., Clark, S. and Bontcheva, K. (2013). Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In RANLP, pp. 198–206.Google Scholar

Duh, K. and Kirchhoff, K. (2005). Pos tagging of dialectal arabic: A minimally supervised approach. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages. Association for Computational Linguistics, pp. 55–62.CrossRef Google Scholar

Elfardy, H. and Diab, M.T. (2013). Sentence level dialect identification in arabic. In ACL (2), pp. 456–461.Google Scholar

Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J. and Smith, N.A. (2011). Part-of-speech tagging for twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers-Volume 2. Association for Computational Linguistics, pp. 42–47.Google Scholar

Graja, M., Jaoua, M. and Hadrich Belguith, L. (2010). Lexical study of a spoken dialogue corpus in tunisian dialect. In The International Arab Conference on Information Technology (ACIT), Benghazi–Libya.Google Scholar

Habash, N., Diab, M.T. and Rambow, O. (2012). Conventional orthography for dialectal arabic. In LREC, pp. 711–718.Google Scholar

Habash, N., Roth, R., Rambow, O., Eskander, R. and Tomeh, N. (2013). Morphological analysis and disambiguation for dialectal arabic. In Hlt-Naacl, pp. 426–432.Google Scholar

Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R.R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.Google Scholar

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9(8), 1735–1780.CrossRef Google Scholar PubMed

Huang, Z., Xu, W. and Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991.Google Scholar

Jurafsky, D. and Martin, J.H. (2009). Speech and Language Processing, 2nd Edn. New Jersey: Pearson Prentice Hall. ISBN 978-0-13-187321-6.Google Scholar

Khalifa, S., Hassan, S. and Habash, N. (2017). A morphological analyzer for gulf arabic verbs. In WANLP 2017 (co-located with EACL 2017), p. 35.Google Scholar

Lafferty, J., McCallum, A. and Pereira, F.C.N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning 2001 (ICML 2001), pp. 282–289.Google Scholar

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K. and Dyer, C. (2016). Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.Google Scholar

Liang, P. (2005). Semi-Supervised Learning for Natural Language . PhD Thesis, Massachusetts Institute of Technology.Google Scholar

Ma, X. and Hovy, E. (2016). End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany. Association for Computational Linguistics, pp. 1064–1074.Google Scholar

Malmasi, S., Refaee, E. and Dras, M. (2015). Arabic dialect identification using a parallel multidialectal corpus. In International Conference of the Pacific Association for Computational Linguistics. Springer, pp. 35–53.Google Scholar

Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N. and Smith, N.A. (2013). Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of NAACL-HLT 2013. Association for Computational Linguistics, pp. 380–390.Google Scholar

Pasha, A., Al-Badrashiny, M., Diab, M.T., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O. and Roth, R. (2014). MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 1094–1101.Google Scholar

Reimers, N. and Gurevych, I. (2017). Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, pp. 338–348.CrossRef Google Scholar

Ryan, R., Rambow, O., Habash, N., Diab, M. and Rudin, C. (2008). Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of the Conference of American Association for Computational Linguistics (ACL08).Google Scholar

Samih, Y., Eldesouki, M., Attia, M., Darwish, K., Abdelali, A., Mubarak, H. and Kallmeyer, L. (2017). Learning from relatives: Unified dialectal arabic segmentation. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 432–441.CrossRef Google Scholar

dos Santos, C. and Guimarães, V. (2015). Boosting named entity recognition with neural character embeddings. In Proceedings of the Fifth Named Entity Workshop, Beijing, China. Association for Computational Linguistics, pp. 25–33.CrossRef Google Scholar

Schuster, M. and Paliwal, K.K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45(11), 2673–2681.CrossRef Google Scholar

Stratos, K. and Collins, M. (2015). Simple semi-supervised pos tagging. In VS@ HLT-NAACL, pp. 79–87.CrossRef Google Scholar

Zaidan, O.F. and Callison-Burch, C. (2011). The arabic online commentary dataset: An annotated dataset of informal arabic with high dialectal content. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers-Volume 2. Association for Computational Linguistics, pp. 37–41.Google Scholar

Article contents

Effective multi-dialectal arabic POS tagging

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests