Article contents
Effective multi-dialectal arabic POS tagging
Published online by Cambridge University Press: 14 April 2020
Abstract
This work introduces robust multi-dialectal part of speech tagging trained on an annotated data set of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses conditional random fields (CRFs), while the second combines word- and character-based representations in a deep neural network with stacked layers of convolutional and recurrent networks with a CRF output layer. We successfully exploit a variety of features that help generalize our models, such as Brown clusters and stem templates. Also, we develop robust joint models that tag multi-dialectal tweets and outperform uni-dialectal taggers. We achieve a combined accuracy of 92.4% across all dialects, with per dialect results ranging between 90.2% and 95.4%. We obtained the results using a train/dev/test split of 70/10/20 for a data set of 350 tweets per dialect.
- Type
- Article
- Information
- Natural Language Engineering , Volume 26 , Issue 6: Natural Language Processing for Similar Languages, Varieties, and Dialects , November 2020 , pp. 677 - 690
- Copyright
- © Cambridge University Press 2020
References
- 2
- Cited by