Skip to main content Accessibility help
×
Home

Effective multi-dialectal arabic POS tagging

  • Kareem Darwish (a1), Mohammed Attia (a2), Hamdy Mubarak (a1), Younes Samih (a1), Ahmed Abdelali (a1), Lluís Màrquez (a1), Mohamed Eldesouki (a1) and Laura Kallmeyer (a3)...

Abstract

This work introduces robust multi-dialectal part of speech tagging trained on an annotated data set of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses conditional random fields (CRFs), while the second combines word- and character-based representations in a deep neural network with stacked layers of convolutional and recurrent networks with a CRF output layer. We successfully exploit a variety of features that help generalize our models, such as Brown clusters and stem templates. Also, we develop robust joint models that tag multi-dialectal tweets and outperform uni-dialectal taggers. We achieve a combined accuracy of 92.4% across all dialects, with per dialect results ranging between 90.2% and 95.4%. We obtained the results using a train/dev/test split of 70/10/20 for a data set of 350 tweets per dialect.

Copyright

Corresponding author

*Corresponding author. E-mail: aabdelali@hbku.edu.qa

References

Hide All
Abdelali, A., Darwish, K., Durrani, N. and Mubarak, H. (2016). Farasa: A fast and furious segmenter for arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, San Diego, California. Association for Computational Linguistics, pp. 1116.
Al-Sabbagh, R. and Girju, R. (2010). Mining the web for the induction of a dialectical arabic lexicon. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), pp. 288293.
Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.
Bouamor, H., Habash, N. and Oflazer, K. (2014). A multidialectal parallel corpus of arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 12401245.
Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D. and Lai, J.C. (1992). Class-based n-gram models of natural language. Computational Linguistics 18(4), 467479.
Caruana, R., Lawrence, S. and Giles, L. (2000). Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In NIPS, pp. 402408.
Chiu, J. and Nichols, E. (2016). Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics 4, 357370.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 24932537.
Cotterell, R. and Callison-Burch, C. (2014). A multi-dialect, multi-genre corpus of informal written arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 241245.
Darwish, K., Mubarak, H., Abdelali, A. and Eldesouki, M. (2017). Arabic pos tagging: Don’t abandon feature engineering just yet. In WANLP 2017 (co-located with EACL 2017), p. 130.
Darwish, K., Mubarak, H., Abdelali, A., Eldesouki, M., Samih, Y., Alharbi, R., Attia, M., Magdy, W. and Kallmeyer, L. (2018). Multi-dialect arabic POS tagging: A CRF approach. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7–12, 2018.
Darwish, K., Sajjad, H. and Mubarak, H. (2014). Verifiably effective arabic dialect identification. In EMNLP, pp. 14651468.
Derczynski, L., Ritter, A., Clark, S. and Bontcheva, K. (2013). Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In RANLP, pp. 198206.
Duh, K. and Kirchhoff, K. (2005). Pos tagging of dialectal arabic: A minimally supervised approach. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages. Association for Computational Linguistics, pp. 5562.
Elfardy, H. and Diab, M.T. (2013). Sentence level dialect identification in arabic. In ACL (2), pp. 456461.
Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J. and Smith, N.A. (2011). Part-of-speech tagging for twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers-Volume 2. Association for Computational Linguistics, pp. 4247.
Graja, M., Jaoua, M. and Hadrich Belguith, L. (2010). Lexical study of a spoken dialogue corpus in tunisian dialect. In The International Arab Conference on Information Technology (ACIT), Benghazi–Libya.
Habash, N., Diab, M.T. and Rambow, O. (2012). Conventional orthography for dialectal arabic. In LREC, pp. 711718.
Habash, N., Roth, R., Rambow, O., Eskander, R. and Tomeh, N. (2013). Morphological analysis and disambiguation for dialectal arabic. In Hlt-Naacl, pp. 426432.
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R.R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9(8), 17351780.
Huang, Z., Xu, W. and Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991.
Jurafsky, D. and Martin, J.H. (2009). Speech and Language Processing, 2nd Edn. New Jersey: Pearson Prentice Hall. ISBN 978-0-13-187321-6.
Khalifa, S., Hassan, S. and Habash, N. (2017). A morphological analyzer for gulf arabic verbs. In WANLP 2017 (co-located with EACL 2017), p. 35.
Lafferty, J., McCallum, A. and Pereira, F.C.N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning 2001 (ICML 2001), pp. 282289.
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K. and Dyer, C. (2016). Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.
Liang, P. (2005). Semi-Supervised Learning for Natural Language. PhD Thesis, Massachusetts Institute of Technology.
Ma, X. and Hovy, E. (2016). End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany. Association for Computational Linguistics, pp. 10641074.
Malmasi, S., Refaee, E. and Dras, M. (2015). Arabic dialect identification using a parallel multidialectal corpus. In International Conference of the Pacific Association for Computational Linguistics. Springer, pp. 3553.
Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N. and Smith, N.A. (2013). Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of NAACL-HLT 2013. Association for Computational Linguistics, pp. 380390.
Pasha, A., Al-Badrashiny, M., Diab, M.T., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O. and Roth, R. (2014). MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 10941101.
Reimers, N. and Gurevych, I. (2017). Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, pp. 338348.
Ryan, R., Rambow, O., Habash, N., Diab, M. and Rudin, C. (2008). Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of the Conference of American Association for Computational Linguistics (ACL08).
Samih, Y., Eldesouki, M., Attia, M., Darwish, K., Abdelali, A., Mubarak, H. and Kallmeyer, L. (2017). Learning from relatives: Unified dialectal arabic segmentation. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 432441.
dos Santos, C. and Guimarães, V. (2015). Boosting named entity recognition with neural character embeddings. In Proceedings of the Fifth Named Entity Workshop, Beijing, China. Association for Computational Linguistics, pp. 2533.
Schuster, M. and Paliwal, K.K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45(11), 26732681.
Stratos, K. and Collins, M. (2015). Simple semi-supervised pos tagging. In VS@ HLT-NAACL, pp. 7987.
Zaidan, O.F. and Callison-Burch, C. (2011). The arabic online commentary dataset: An annotated dataset of informal arabic with high dialectal content. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers-Volume 2. Association for Computational Linguistics, pp. 3741.

Keywords

Effective multi-dialectal arabic POS tagging

  • Kareem Darwish (a1), Mohammed Attia (a2), Hamdy Mubarak (a1), Younes Samih (a1), Ahmed Abdelali (a1), Lluís Màrquez (a1), Mohamed Eldesouki (a1) and Laura Kallmeyer (a3)...

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed.