Hostname: page-component-8448b6f56d-mp689 Total loading time: 0 Render date: 2024-04-18T02:00:19.899Z Has data issue: false hasContentIssue false

Effective multi-dialectal arabic POS tagging

Published online by Cambridge University Press:  14 April 2020

Kareem Darwish
Affiliation:
Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
Mohammed Attia
Affiliation:
Google Inc New York, New York, NY, USA
Hamdy Mubarak
Affiliation:
Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
Younes Samih
Affiliation:
Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
Ahmed Abdelali*
Affiliation:
Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
Lluís Màrquez
Affiliation:
Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
Mohamed Eldesouki
Affiliation:
Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
Laura Kallmeyer
Affiliation:
Computational Linguistics Department, Heinrich-Heine-University Düsseldorf, 40204Düsseldorf, Germany
*
*Corresponding author. E-mail: aabdelali@hbku.edu.qa

Abstract

This work introduces robust multi-dialectal part of speech tagging trained on an annotated data set of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses conditional random fields (CRFs), while the second combines word- and character-based representations in a deep neural network with stacked layers of convolutional and recurrent networks with a CRF output layer. We successfully exploit a variety of features that help generalize our models, such as Brown clusters and stem templates. Also, we develop robust joint models that tag multi-dialectal tweets and outperform uni-dialectal taggers. We achieve a combined accuracy of 92.4% across all dialects, with per dialect results ranging between 90.2% and 95.4%. We obtained the results using a train/dev/test split of 70/10/20 for a data set of 350 tweets per dialect.

Type
Article
Copyright
© Cambridge University Press 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abdelali, A., Darwish, K., Durrani, N. and Mubarak, H. (2016). Farasa: A fast and furious segmenter for arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, San Diego, California. Association for Computational Linguistics, pp. 1116.CrossRefGoogle Scholar
Al-Sabbagh, R. and Girju, R. (2010). Mining the web for the induction of a dialectical arabic lexicon. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), pp. 288293.Google Scholar
Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.Google Scholar
Bouamor, H., Habash, N. and Oflazer, K. (2014). A multidialectal parallel corpus of arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 12401245.Google Scholar
Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D. and Lai, J.C. (1992). Class-based n-gram models of natural language. Computational Linguistics 18(4), 467479.Google Scholar
Caruana, R., Lawrence, S. and Giles, L. (2000). Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In NIPS, pp. 402408.Google Scholar
Chiu, J. and Nichols, E. (2016). Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics 4, 357370.CrossRefGoogle Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 24932537.Google Scholar
Cotterell, R. and Callison-Burch, C. (2014). A multi-dialect, multi-genre corpus of informal written arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 241245.Google Scholar
Darwish, K., Mubarak, H., Abdelali, A. and Eldesouki, M. (2017). Arabic pos tagging: Don’t abandon feature engineering just yet. In WANLP 2017 (co-located with EACL 2017), p. 130.Google Scholar
Darwish, K., Mubarak, H., Abdelali, A., Eldesouki, M., Samih, Y., Alharbi, R., Attia, M., Magdy, W. and Kallmeyer, L. (2018). Multi-dialect arabic POS tagging: A CRF approach. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7–12, 2018.Google Scholar
Darwish, K., Sajjad, H. and Mubarak, H. (2014). Verifiably effective arabic dialect identification. In EMNLP, pp. 14651468.CrossRefGoogle Scholar
Derczynski, L., Ritter, A., Clark, S. and Bontcheva, K. (2013). Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In RANLP, pp. 198206.Google Scholar
Duh, K. and Kirchhoff, K. (2005). Pos tagging of dialectal arabic: A minimally supervised approach. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages. Association for Computational Linguistics, pp. 5562.CrossRefGoogle Scholar
Elfardy, H. and Diab, M.T. (2013). Sentence level dialect identification in arabic. In ACL (2), pp. 456461.Google Scholar
Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J. and Smith, N.A. (2011). Part-of-speech tagging for twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers-Volume 2. Association for Computational Linguistics, pp. 4247.Google Scholar
Graja, M., Jaoua, M. and Hadrich Belguith, L. (2010). Lexical study of a spoken dialogue corpus in tunisian dialect. In The International Arab Conference on Information Technology (ACIT), Benghazi–Libya.Google Scholar
Habash, N., Diab, M.T. and Rambow, O. (2012). Conventional orthography for dialectal arabic. In LREC, pp. 711718.Google Scholar
Habash, N., Roth, R., Rambow, O., Eskander, R. and Tomeh, N. (2013). Morphological analysis and disambiguation for dialectal arabic. In Hlt-Naacl, pp. 426432.Google Scholar
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R.R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.Google Scholar
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9(8), 17351780.CrossRefGoogle ScholarPubMed
Huang, Z., Xu, W. and Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991.Google Scholar
Jurafsky, D. and Martin, J.H. (2009). Speech and Language Processing, 2nd Edn. New Jersey: Pearson Prentice Hall. ISBN 978-0-13-187321-6.Google Scholar
Khalifa, S., Hassan, S. and Habash, N. (2017). A morphological analyzer for gulf arabic verbs. In WANLP 2017 (co-located with EACL 2017), p. 35.Google Scholar
Lafferty, J., McCallum, A. and Pereira, F.C.N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning 2001 (ICML 2001), pp. 282289.Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K. and Dyer, C. (2016). Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.Google Scholar
Liang, P. (2005). Semi-Supervised Learning for Natural Language . PhD Thesis, Massachusetts Institute of Technology.Google Scholar
Ma, X. and Hovy, E. (2016). End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany. Association for Computational Linguistics, pp. 10641074.Google Scholar
Malmasi, S., Refaee, E. and Dras, M. (2015). Arabic dialect identification using a parallel multidialectal corpus. In International Conference of the Pacific Association for Computational Linguistics. Springer, pp. 3553.Google Scholar
Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N. and Smith, N.A. (2013). Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of NAACL-HLT 2013. Association for Computational Linguistics, pp. 380390.Google Scholar
Pasha, A., Al-Badrashiny, M., Diab, M.T., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O. and Roth, R. (2014). MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 10941101.Google Scholar
Reimers, N. and Gurevych, I. (2017). Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, pp. 338348.CrossRefGoogle Scholar
Ryan, R., Rambow, O., Habash, N., Diab, M. and Rudin, C. (2008). Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of the Conference of American Association for Computational Linguistics (ACL08).Google Scholar
Samih, Y., Eldesouki, M., Attia, M., Darwish, K., Abdelali, A., Mubarak, H. and Kallmeyer, L. (2017). Learning from relatives: Unified dialectal arabic segmentation. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 432441.CrossRefGoogle Scholar
dos Santos, C. and Guimarães, V. (2015). Boosting named entity recognition with neural character embeddings. In Proceedings of the Fifth Named Entity Workshop, Beijing, China. Association for Computational Linguistics, pp. 2533.CrossRefGoogle Scholar
Schuster, M. and Paliwal, K.K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45(11), 26732681.CrossRefGoogle Scholar
Stratos, K. and Collins, M. (2015). Simple semi-supervised pos tagging. In VS@ HLT-NAACL, pp. 7987.CrossRefGoogle Scholar
Zaidan, O.F. and Callison-Burch, C. (2011). The arabic online commentary dataset: An annotated dataset of informal arabic with high dialectal content. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers-Volume 2. Association for Computational Linguistics, pp. 3741.Google Scholar