Hostname: page-component-8448b6f56d-cfpbc Total loading time: 0 Render date: 2024-04-17T14:19:46.463Z Has data issue: false hasContentIssue false

Part of speech tagging for Arabic

Published online by Cambridge University Press:  06 December 2011

SANDRA KÜBLER
Affiliation:
Department of Linguistics Indiana University Bloomington, IN 47405, USA e-mail: skuebler@indiana.edu
EMAD MOHAMED
Affiliation:
Computer Science Program Carnegie Mellon University in Qatar Education City, Doha, Qatar e-mail: emohamed@qatar.cmu.edu

Abstract

This paper presents an investigation of part of speech (POS) tagging for Arabic as it occurs naturally, i.e. unvocalized text (without diacritics). We also do not assume any prior tokenization, although this was used previously as a basis for POS tagging. Arabic is a morphologically complex language, i.e. there is a high number of inflections per word; and the tagset is larger than the typical tagset for English. Both factors, the second one being partly dependent on the first, increase the number of word/tag combinations, for which the POS tagger needs to find estimates, and thus they contribute to data sparseness. We present a novel approach to Arabic POS tagging that does not require any pre-processing, such as segmentation or tokenization: whole word tagging. In this approach, the complete word is assigned a complex POS tag, which includes morphological information. A competing approach investigates the effect of segmentation and vocalization on POS tagging to alleviate data sparseness and ambiguity. In the segmentation-based approach, we first automatically segment words and then POS tags the segments. The complex tagset encompasses 993 POS tags, whereas the segment-based tagset encompasses only 139 tags. However, segments are also more ambiguous, thus there are more possible combinations of segment tags. In realistic situations, in which we have no information about segmentation or vocalization, whole word tagging reaches the highest accuracy of 94.74%. If gold standard segmentation or vocalization is available, including this information improves POS tagging accuracy. However, while our automatic segmentation and vocalization modules reach state-of-the-art performance, their performance is not reliable enough for POS tagging and actually impairs POS tagging performance. Finally, we investigate whether a reduction of the complex tagset to the Extra-Reduced Tagset as suggested by Habash and Rambow (Habash, N., and Rambow, O. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), Ann Arbor, MI, USA, pp. 573–80) will alleviate the data sparseness problem. While the POS tagging accuracy increases due to the smaller tagset, a closer look shows that using a complex tagset for POS tagging and then converting the resulting annotation to the smaller tagset results in a higher accuracy than tagging using the smaller tagset directly.

Type
Articles
Copyright
Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Adler, M., and Elhadad, M. 2006. An unsupervised morpheme-based HMM for Hebrew morphological disambiguation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL), Sydney, Australia, pp. 665–72.Google Scholar
Aha, D., Kibler, D., and Albert, M. K. 1991. Instance-based learning algorithms. Machine Learning 6: 3766.CrossRefGoogle Scholar
Bar-Haim, R., Sima'an, K., and Winter, Y. 2008. Part-of-speech tagging of Modern Hebrew text. Natural Language Engineering 14 (2): 223–51.CrossRefGoogle Scholar
Beesley, K. 1990. Finite-state description of Arabic morphology. In Proceedings of the Second Cambridge Conference on Bilingual Computing in Arabic and English, Cambridge, UK.Google Scholar
Beesley, K. 1996. Arabic finite-state morphological analysis and generation. In Proceedings of the 16th International Conference on Computational Linguistics (COLING), Copenhagen, Denmark, pp. 8994.CrossRefGoogle Scholar
Bies, A., and Maamouri, M. 2003. Penn Arabic Treebank guidelines. Technical report, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA.Google Scholar
Brants, T. 2000. TnT–a statistical part-of-speech tagger. In Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics and the 6th Conference on Applied Natural Language Processing (ANLP/NAACL), Seattle, WA, pp. 224–31.Google Scholar
Buckwalter, T. 2002. Arabic Morphological Analyzer Version 1.0. Philadelphia, PA: Linguistic Data Consortium.Google Scholar
Canavan, A., Zipperlen, G., and Graff, D. 1997. CALLHOME Egyptian Arabic Speech. Philadelphia, PA: Linguistic Data Consortium.Google Scholar
Chiang, D., Diab, M., Habash, N., Rambow, O., and Sharif, S. 2006. Parsing Arabic dialects. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Trento, Italy, pp. 369–76.Google Scholar
Daelemans, W., and van den Bosch, A. 2005. Memory Based Language Processing. Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
Daelemans, W., van den Bosch, A., and Zavrel, J. 1999. Forgetting exceptions is harmful in language learning. Machine Learning 34 (1–3): 1143 (special issue on Natural Language Learning).CrossRefGoogle Scholar
Daelemans, W., Zavrel, J., Berck, P., and Gillis, S. 1996. MBT: a memory-based part of speech tagger-generator. In Proceedings of the 4th Workshop on Very Large Corpora, Copenhagen, Denmark, pp. 1427.Google Scholar
Daelemans, W., Zavrel, J., van der Sloot, K., and van den Bosch, A. 2007. TiMBL: Tilburg memory based learner – version 6.1 – reference guide. Technical Report ILK 07-07, Induction of Linguistic Knowledge, Computational Linguistics, Tilburg University, Tilburg, Netherlands.Google Scholar
Darwish, K. 2002. Building a shallow Arabic morpholgical analyzer in one day. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Philadelphia, PA.Google Scholar
Diab, M. 2007. Towards an optimal POS tag set for Arabic processing. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria, pp. 157–61.Google Scholar
Diab, M. 2009. Second generation AMIRA tools for Arabic processing: fast and robust tokenization, POS tagging, and base phrase chunking. In Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt, pp. 285– 88.Google Scholar
Diab, M., Hacioglu, K., and Jurafsky, D. 2004. Automatic tagging of Arabic text: from raw text to base phrase chunks. In Proceedings of the 5th Meeting of the North American Chapter of the Association for Computational Linguistics and Human Language Technologies Conference (HLT-NAACL), Boston, MA, pp. 149–52.Google Scholar
Gal, Y. 2002. An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Philadelphia, PA.Google Scholar
Giménez, J., and Màrquez, L. 2004. SVMTool: a general POS tagger generator based on support vector machines. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC), Lisbon, Portugal, pp. 43–6.Google Scholar
Green, S., and Manning, C. D. 2010. Better Arabic parsing: baselines, evaluations, and analysis. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING), Beijing, China, pp. 394402.Google Scholar
Habash, N., Gabbard, R., Rambow, O., Kulick, S., and Marcus, M. 2007. Determining case in Arabic: learning complex linguistic behavior requires complex linguistic features. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, pp. 1084–92.Google Scholar
Habash, N., and Rambow, O. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), Ann Arbor, MI, pp. 573– 80.Google Scholar
Habash, N., and Rambow, O. 2006. MAGEAD: a morphological analyzer and generator for the Arabic dialects. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL), Sydney, Australia, 681–8.Google Scholar
Habash, N., and Rambow, O. 2007. Arabic diacritization through full morphological tagging. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), Rochester, NY, pp. 53–6.Google Scholar
Habash, N., Rambow, O., and Kiraz, G. 2005. Morphological analysis and generation for Arabic dialects. In Proceedings of the ACL Workshop on Semitic Languages, Ann Arbor, MI, pp. 1724.CrossRefGoogle Scholar
Habash, N., Rambow, O., and Roth, R. 2009. MADA+TOKAN: a toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt, pp. 102–9.Google Scholar
Haertel, R., McClanahan, P., and Ringger, E. 2010. Automatic diacritization for low-resource languages using a hybrid word and consonant CMM. In Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), Los Angeles, CA, pp. 519–27.Google Scholar
Karttunen, L., and Beesley, K. 1992. Two-level rule compiler. Technical Report ISTL-92-2, Xerox PARC, Palo Alto, CA, USA.Google Scholar
Kirchhoff, K., Bilmes, J., Henderson, J., Schwartz, R., Noamany, M., Schone, P., Ji, G., Das, S., Egan, M., He, F., Vergyri, D., Liu, D., and Duta, N. 2002. Novel speech recognition models for Arabic – final report of the JHU summer workshop. Technical Report, Johns Hopkins University, Baltimore, MD, USA.Google Scholar
Kübler, S., and Mohamed, E. 2008. Memory-based vocalization of Arabic. In Proceedings of the LREC Workshop on HLT and NLP within the Arabic World, Marrakech, Morocco, pp. 5862.Google Scholar
Lee, Y.-S., Papineni, K., Roukos, S., Emam, O., and Hassan, H. 2003. Language model-based Arabic word segmentation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL), Sapporo, Japan, pp. 399406.Google Scholar
Marton, Y., Habash, N., and Rambow, O. 2010. Improving Arabic dependency parsing with inflectional and lexical morphological features. In Proceedings of the NAACL Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL), Los Angeles, CA, pp. 1321.Google Scholar
Mohamed, E., and Kübler, S. 2009. Diacritization for real-world Arabic texts. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria, 251–7.Google Scholar
Mohamed, E., and Kübler, S. 2010a. Arabic part of speech tagging. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC), Valetta, Malta, pp. 2537–43.Google Scholar
Mohamed, E., and Kübler, S. 2010b. Is Arabic part of speech tagging feasible without word segmentation? In Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), Los Angeles, CA, pp. 705–8.Google Scholar
Nelken, R., and Shieber, S. 2005. Arabic diacritization using weighed finite-state transducers. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor, MI, pp. 7986.CrossRefGoogle Scholar
Ratnaparkhi, A. 1996. A maximum entropy model for part-of-speech tagging. In Proceedings of the Empirical Methods in Natural Language Processing Conference (EMNLP), Philadelphia, PA, pp. 133–42.Google Scholar
Roth, R., Rambow, O., Habash, N., Diab, M., and Rudin, C. 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of ACL-08: HLT, Short Papers, Columbus, OH, pp. 117–20.Google Scholar
Segal, E. 2000. Hebrew Morphological Analyzer for Hebrew Undotted Text. Master's thesis, Technion, Haifa, Israel (in Hebrew).Google Scholar
Shaalan, K., Abo Bakr, H., and Ziedan, I. 2009. A hybrid approach for building Arabic diacritizer. In Proceedings of the EACL Workshop on Computational Approaches to Semitic Languages, Athens, Greece, pp. 2735.Google Scholar
Shacham, D., and Wintner, S. 2007. Morphological disambiguation of Hebrew: a case study in classifier combination. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, pp. 439–47.Google Scholar
Sima'an, K., Itai, A., Winter, Y., Altmann, A., and Nativ, N. 2001. Building a tree-bank of Modern Hebrew text. Traitement Automatique des Langues 42: 347–80.Google Scholar
Tsarfaty, R. 2006. Integrated morphological and syntactic disambiguation for Modern Hebrew. In Proceedings of the COLING/ACL 2006 Student Research Workshop, Sydney, Australia, pp. 4954.Google Scholar
Tsarfaty, R., and Goldberg, Y. 2008. Word-based or morpheme-based? Annotation strategies for Modern Hebrew clitics. In Proceedings of the Sixth International Language Resources and Evaluation (LREC), Marrakech, Morocco, 1421–7.Google Scholar
van den Bosch, A., Marsi, E., and Soudi, A. 2007. Memory-based morphological analysis and part-of-speech tagging of Arabic. In Soudi, A., van den Bosch, A., and Neumann, G. (eds.), Arabic Computational Morphology, pp. 203–19. Berlin: Springer.Google Scholar
Zavrel, J., and Daelemans, W. 1997. Memory-based learning: using similarity for smoothing. In Proceedings of the 35th Annual Meeting of the Association of Computational Linguistics and the 8th Conference of the European Chapter of the ACL (ACL-EACL), Madrid, Spain, pp. 436–43.Google Scholar
Zitouni, I., Sorensen, J., and Sarikaya, R. 2006. Maximum entropy based restoration of Arabic diacritics. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL), Sydney, Australia, pp. 577–84.Google Scholar