Skip to main content
×
×
Home

Part of speech tagging for Arabic

  • SANDRA KÜBLER (a1) and EMAD MOHAMED (a2)
Abstract

This paper presents an investigation of part of speech (POS) tagging for Arabic as it occurs naturally, i.e. unvocalized text (without diacritics). We also do not assume any prior tokenization, although this was used previously as a basis for POS tagging. Arabic is a morphologically complex language, i.e. there is a high number of inflections per word; and the tagset is larger than the typical tagset for English. Both factors, the second one being partly dependent on the first, increase the number of word/tag combinations, for which the POS tagger needs to find estimates, and thus they contribute to data sparseness. We present a novel approach to Arabic POS tagging that does not require any pre-processing, such as segmentation or tokenization: whole word tagging. In this approach, the complete word is assigned a complex POS tag, which includes morphological information. A competing approach investigates the effect of segmentation and vocalization on POS tagging to alleviate data sparseness and ambiguity. In the segmentation-based approach, we first automatically segment words and then POS tags the segments. The complex tagset encompasses 993 POS tags, whereas the segment-based tagset encompasses only 139 tags. However, segments are also more ambiguous, thus there are more possible combinations of segment tags. In realistic situations, in which we have no information about segmentation or vocalization, whole word tagging reaches the highest accuracy of 94.74%. If gold standard segmentation or vocalization is available, including this information improves POS tagging accuracy. However, while our automatic segmentation and vocalization modules reach state-of-the-art performance, their performance is not reliable enough for POS tagging and actually impairs POS tagging performance. Finally, we investigate whether a reduction of the complex tagset to the Extra-Reduced Tagset as suggested by Habash and Rambow (Habash, N., and Rambow, O. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), Ann Arbor, MI, USA, pp. 573–80) will alleviate the data sparseness problem. While the POS tagging accuracy increases due to the smaller tagset, a closer look shows that using a complex tagset for POS tagging and then converting the resulting annotation to the smaller tagset results in a higher accuracy than tagging using the smaller tagset directly.

Copyright
References
Hide All
Adler, M., and Elhadad, M. 2006. An unsupervised morpheme-based HMM for Hebrew morphological disambiguation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL), Sydney, Australia, pp. 665–72.
Aha, D., Kibler, D., and Albert, M. K. 1991. Instance-based learning algorithms. Machine Learning 6: 3766.
Bar-Haim, R., Sima'an, K., and Winter, Y. 2008. Part-of-speech tagging of Modern Hebrew text. Natural Language Engineering 14 (2): 223–51.
Beesley, K. 1990. Finite-state description of Arabic morphology. In Proceedings of the Second Cambridge Conference on Bilingual Computing in Arabic and English, Cambridge, UK.
Beesley, K. 1996. Arabic finite-state morphological analysis and generation. In Proceedings of the 16th International Conference on Computational Linguistics (COLING), Copenhagen, Denmark, pp. 8994.
Bies, A., and Maamouri, M. 2003. Penn Arabic Treebank guidelines. Technical report, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA.
Brants, T. 2000. TnT–a statistical part-of-speech tagger. In Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics and the 6th Conference on Applied Natural Language Processing (ANLP/NAACL), Seattle, WA, pp. 224–31.
Buckwalter, T. 2002. Arabic Morphological Analyzer Version 1.0. Philadelphia, PA: Linguistic Data Consortium.
Canavan, A., Zipperlen, G., and Graff, D. 1997. CALLHOME Egyptian Arabic Speech. Philadelphia, PA: Linguistic Data Consortium.
Chiang, D., Diab, M., Habash, N., Rambow, O., and Sharif, S. 2006. Parsing Arabic dialects. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Trento, Italy, pp. 369–76.
Daelemans, W., and van den Bosch, A. 2005. Memory Based Language Processing. Cambridge, UK: Cambridge University Press.
Daelemans, W., van den Bosch, A., and Zavrel, J. 1999. Forgetting exceptions is harmful in language learning. Machine Learning 34 (1–3): 1143 (special issue on Natural Language Learning).
Daelemans, W., Zavrel, J., Berck, P., and Gillis, S. 1996. MBT: a memory-based part of speech tagger-generator. In Proceedings of the 4th Workshop on Very Large Corpora, Copenhagen, Denmark, pp. 1427.
Daelemans, W., Zavrel, J., van der Sloot, K., and van den Bosch, A. 2007. TiMBL: Tilburg memory based learner – version 6.1 – reference guide. Technical Report ILK 07-07, Induction of Linguistic Knowledge, Computational Linguistics, Tilburg University, Tilburg, Netherlands.
Darwish, K. 2002. Building a shallow Arabic morpholgical analyzer in one day. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Philadelphia, PA.
Diab, M. 2007. Towards an optimal POS tag set for Arabic processing. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria, pp. 157–61.
Diab, M. 2009. Second generation AMIRA tools for Arabic processing: fast and robust tokenization, POS tagging, and base phrase chunking. In Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt, pp. 285– 88.
Diab, M., Hacioglu, K., and Jurafsky, D. 2004. Automatic tagging of Arabic text: from raw text to base phrase chunks. In Proceedings of the 5th Meeting of the North American Chapter of the Association for Computational Linguistics and Human Language Technologies Conference (HLT-NAACL), Boston, MA, pp. 149–52.
Gal, Y. 2002. An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Philadelphia, PA.
Giménez, J., and Màrquez, L. 2004. SVMTool: a general POS tagger generator based on support vector machines. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC), Lisbon, Portugal, pp. 43–6.
Green, S., and Manning, C. D. 2010. Better Arabic parsing: baselines, evaluations, and analysis. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING), Beijing, China, pp. 394402.
Habash, N., Gabbard, R., Rambow, O., Kulick, S., and Marcus, M. 2007. Determining case in Arabic: learning complex linguistic behavior requires complex linguistic features. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, pp. 1084–92.
Habash, N., and Rambow, O. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), Ann Arbor, MI, pp. 573– 80.
Habash, N., and Rambow, O. 2006. MAGEAD: a morphological analyzer and generator for the Arabic dialects. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL), Sydney, Australia, 681–8.
Habash, N., and Rambow, O. 2007. Arabic diacritization through full morphological tagging. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), Rochester, NY, pp. 53–6.
Habash, N., Rambow, O., and Kiraz, G. 2005. Morphological analysis and generation for Arabic dialects. In Proceedings of the ACL Workshop on Semitic Languages, Ann Arbor, MI, pp. 1724.
Habash, N., Rambow, O., and Roth, R. 2009. MADA+TOKAN: a toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt, pp. 102–9.
Haertel, R., McClanahan, P., and Ringger, E. 2010. Automatic diacritization for low-resource languages using a hybrid word and consonant CMM. In Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), Los Angeles, CA, pp. 519–27.
Karttunen, L., and Beesley, K. 1992. Two-level rule compiler. Technical Report ISTL-92-2, Xerox PARC, Palo Alto, CA, USA.
Kirchhoff, K., Bilmes, J., Henderson, J., Schwartz, R., Noamany, M., Schone, P., Ji, G., Das, S., Egan, M., He, F., Vergyri, D., Liu, D., and Duta, N. 2002. Novel speech recognition models for Arabic – final report of the JHU summer workshop. Technical Report, Johns Hopkins University, Baltimore, MD, USA.
Kübler, S., and Mohamed, E. 2008. Memory-based vocalization of Arabic. In Proceedings of the LREC Workshop on HLT and NLP within the Arabic World, Marrakech, Morocco, pp. 5862.
Lee, Y.-S., Papineni, K., Roukos, S., Emam, O., and Hassan, H. 2003. Language model-based Arabic word segmentation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL), Sapporo, Japan, pp. 399406.
Marton, Y., Habash, N., and Rambow, O. 2010. Improving Arabic dependency parsing with inflectional and lexical morphological features. In Proceedings of the NAACL Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL), Los Angeles, CA, pp. 1321.
Mohamed, E., and Kübler, S. 2009. Diacritization for real-world Arabic texts. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria, 251–7.
Mohamed, E., and Kübler, S. 2010a. Arabic part of speech tagging. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC), Valetta, Malta, pp. 2537–43.
Mohamed, E., and Kübler, S. 2010b. Is Arabic part of speech tagging feasible without word segmentation? In Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), Los Angeles, CA, pp. 705–8.
Nelken, R., and Shieber, S. 2005. Arabic diacritization using weighed finite-state transducers. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor, MI, pp. 7986.
Ratnaparkhi, A. 1996. A maximum entropy model for part-of-speech tagging. In Proceedings of the Empirical Methods in Natural Language Processing Conference (EMNLP), Philadelphia, PA, pp. 133–42.
Roth, R., Rambow, O., Habash, N., Diab, M., and Rudin, C. 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of ACL-08: HLT, Short Papers, Columbus, OH, pp. 117–20.
Segal, E. 2000. Hebrew Morphological Analyzer for Hebrew Undotted Text. Master's thesis, Technion, Haifa, Israel (in Hebrew).
Shaalan, K., Abo Bakr, H., and Ziedan, I. 2009. A hybrid approach for building Arabic diacritizer. In Proceedings of the EACL Workshop on Computational Approaches to Semitic Languages, Athens, Greece, pp. 2735.
Shacham, D., and Wintner, S. 2007. Morphological disambiguation of Hebrew: a case study in classifier combination. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, pp. 439–47.
Sima'an, K., Itai, A., Winter, Y., Altmann, A., and Nativ, N. 2001. Building a tree-bank of Modern Hebrew text. Traitement Automatique des Langues 42: 347–80.
Tsarfaty, R. 2006. Integrated morphological and syntactic disambiguation for Modern Hebrew. In Proceedings of the COLING/ACL 2006 Student Research Workshop, Sydney, Australia, pp. 4954.
Tsarfaty, R., and Goldberg, Y. 2008. Word-based or morpheme-based? Annotation strategies for Modern Hebrew clitics. In Proceedings of the Sixth International Language Resources and Evaluation (LREC), Marrakech, Morocco, 1421–7.
van den Bosch, A., Marsi, E., and Soudi, A. 2007. Memory-based morphological analysis and part-of-speech tagging of Arabic. In Soudi, A., van den Bosch, A., and Neumann, G. (eds.), Arabic Computational Morphology, pp. 203–19. Berlin: Springer.
Zavrel, J., and Daelemans, W. 1997. Memory-based learning: using similarity for smoothing. In Proceedings of the 35th Annual Meeting of the Association of Computational Linguistics and the 8th Conference of the European Chapter of the ACL (ACL-EACL), Madrid, Spain, pp. 436–43.
Zitouni, I., Sorensen, J., and Sarikaya, R. 2006. Maximum entropy based restoration of Arabic diacritics. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL), Sydney, Australia, pp. 577–84.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed