Part of speech tagging for Arabic

SANDRA KÜBLER; EMAD MOHAMED

doi:10.1017/S1351324911000325

Part of speech tagging for Arabic

Published online by Cambridge University Press: 06 December 2011

SANDRA KÜBLER and

EMAD MOHAMED

Show author details

SANDRA KÜBLER: Affiliation:
Department of Linguistics Indiana University Bloomington, IN 47405, USA e-mail: skuebler@indiana.edu
EMAD MOHAMED: Affiliation:
Computer Science Program Carnegie Mellon University in Qatar Education City, Doha, Qatar e-mail: emohamed@qatar.cmu.edu

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

This paper presents an investigation of part of speech (POS) tagging for Arabic as it occurs naturally, i.e. unvocalized text (without diacritics). We also do not assume any prior tokenization, although this was used previously as a basis for POS tagging. Arabic is a morphologically complex language, i.e. there is a high number of inflections per word; and the tagset is larger than the typical tagset for English. Both factors, the second one being partly dependent on the first, increase the number of word/tag combinations, for which the POS tagger needs to find estimates, and thus they contribute to data sparseness. We present a novel approach to Arabic POS tagging that does not require any pre-processing, such as segmentation or tokenization: whole word tagging. In this approach, the complete word is assigned a complex POS tag, which includes morphological information. A competing approach investigates the effect of segmentation and vocalization on POS tagging to alleviate data sparseness and ambiguity. In the segmentation-based approach, we first automatically segment words and then POS tags the segments. The complex tagset encompasses 993 POS tags, whereas the segment-based tagset encompasses only 139 tags. However, segments are also more ambiguous, thus there are more possible combinations of segment tags. In realistic situations, in which we have no information about segmentation or vocalization, whole word tagging reaches the highest accuracy of 94.74%. If gold standard segmentation or vocalization is available, including this information improves POS tagging accuracy. However, while our automatic segmentation and vocalization modules reach state-of-the-art performance, their performance is not reliable enough for POS tagging and actually impairs POS tagging performance. Finally, we investigate whether a reduction of the complex tagset to the Extra-Reduced Tagset as suggested by Habash and Rambow (Habash, N., and Rambow, O. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), Ann Arbor, MI, USA, pp. 573–80) will alleviate the data sparseness problem. While the POS tagging accuracy increases due to the smaller tagset, a closer look shows that using a complex tagset for POS tagging and then converting the resulting annotation to the smaller tagset results in a higher accuracy than tagging using the smaller tagset directly.

Type: Articles
Information: Natural Language Engineering , Volume 18 , Issue 4 , October 2012 , pp. 521 - 548

DOI: https://doi.org/10.1017/S1351324911000325 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Adler, M., and Elhadad, M. 2006. An unsupervised morpheme-based HMM for Hebrew morphological disambiguation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL), Sydney, Australia, pp. 665–72.Google Scholar

Aha, D., Kibler, D., and Albert, M. K. 1991. Instance-based learning algorithms. Machine Learning 6: 37–66.CrossRef Google Scholar

Bar-Haim, R., Sima'an, K., and Winter, Y. 2008. Part-of-speech tagging of Modern Hebrew text. Natural Language Engineering 14 (2): 223–51.CrossRef Google Scholar

Beesley, K. 1990. Finite-state description of Arabic morphology. In Proceedings of the Second Cambridge Conference on Bilingual Computing in Arabic and English, Cambridge, UK.Google Scholar

Beesley, K. 1996. Arabic finite-state morphological analysis and generation. In Proceedings of the 16th International Conference on Computational Linguistics (COLING), Copenhagen, Denmark, pp. 89–94.CrossRef Google Scholar

Bies, A., and Maamouri, M. 2003. Penn Arabic Treebank guidelines. Technical report, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA.Google Scholar

Brants, T. 2000. TnT–a statistical part-of-speech tagger. In Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics and the 6th Conference on Applied Natural Language Processing (ANLP/NAACL), Seattle, WA, pp. 224–31.Google Scholar

Buckwalter, T. 2002. Arabic Morphological Analyzer Version 1.0. Philadelphia, PA: Linguistic Data Consortium.Google Scholar

Canavan, A., Zipperlen, G., and Graff, D. 1997. CALLHOME Egyptian Arabic Speech. Philadelphia, PA: Linguistic Data Consortium.Google Scholar

Chiang, D., Diab, M., Habash, N., Rambow, O., and Sharif, S. 2006. Parsing Arabic dialects. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Trento, Italy, pp. 369–76.Google Scholar

Daelemans, W., and van den Bosch, A. 2005. Memory Based Language Processing. Cambridge, UK: Cambridge University Press.CrossRef Google Scholar

Daelemans, W., van den Bosch, A., and Zavrel, J. 1999. Forgetting exceptions is harmful in language learning. Machine Learning 34 (1–3): 11–43 (special issue on Natural Language Learning).CrossRef Google Scholar

Daelemans, W., Zavrel, J., Berck, P., and Gillis, S. 1996. MBT: a memory-based part of speech tagger-generator. In Proceedings of the 4th Workshop on Very Large Corpora, Copenhagen, Denmark, pp. 14–27.Google Scholar

Daelemans, W., Zavrel, J., van der Sloot, K., and van den Bosch, A. 2007. TiMBL: Tilburg memory based learner – version 6.1 – reference guide. Technical Report ILK 07-07, Induction of Linguistic Knowledge, Computational Linguistics, Tilburg University, Tilburg, Netherlands.Google Scholar

Darwish, K. 2002. Building a shallow Arabic morpholgical analyzer in one day. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Philadelphia, PA.Google Scholar

Diab, M. 2007. Towards an optimal POS tag set for Arabic processing. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria, pp. 157–61.Google Scholar

Diab, M. 2009. Second generation AMIRA tools for Arabic processing: fast and robust tokenization, POS tagging, and base phrase chunking. In Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt, pp. 285– 88.Google Scholar

Diab, M., Hacioglu, K., and Jurafsky, D. 2004. Automatic tagging of Arabic text: from raw text to base phrase chunks. In Proceedings of the 5th Meeting of the North American Chapter of the Association for Computational Linguistics and Human Language Technologies Conference (HLT-NAACL), Boston, MA, pp. 149–52.Google Scholar

Gal, Y. 2002. An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Philadelphia, PA.Google Scholar

Giménez, J., and Màrquez, L. 2004. SVMTool: a general POS tagger generator based on support vector machines. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC), Lisbon, Portugal, pp. 43–6.Google Scholar

Green, S., and Manning, C. D. 2010. Better Arabic parsing: baselines, evaluations, and analysis. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING), Beijing, China, pp. 394–402.Google Scholar

Habash, N., Gabbard, R., Rambow, O., Kulick, S., and Marcus, M. 2007. Determining case in Arabic: learning complex linguistic behavior requires complex linguistic features. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, pp. 1084–92.Google Scholar

Habash, N., and Rambow, O. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), Ann Arbor, MI, pp. 573– 80.Google Scholar

Habash, N., and Rambow, O. 2006. MAGEAD: a morphological analyzer and generator for the Arabic dialects. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL), Sydney, Australia, 681–8.Google Scholar

Habash, N., and Rambow, O. 2007. Arabic diacritization through full morphological tagging. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), Rochester, NY, pp. 53–6.Google Scholar

Habash, N., Rambow, O., and Kiraz, G. 2005. Morphological analysis and generation for Arabic dialects. In Proceedings of the ACL Workshop on Semitic Languages, Ann Arbor, MI, pp. 17–24.CrossRef Google Scholar

Habash, N., Rambow, O., and Roth, R. 2009. MADA+TOKAN: a toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt, pp. 102–9.Google Scholar

Haertel, R., McClanahan, P., and Ringger, E. 2010. Automatic diacritization for low-resource languages using a hybrid word and consonant CMM. In Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), Los Angeles, CA, pp. 519–27.Google Scholar

Karttunen, L., and Beesley, K. 1992. Two-level rule compiler. Technical Report ISTL-92-2, Xerox PARC, Palo Alto, CA, USA.Google Scholar

Kirchhoff, K., Bilmes, J., Henderson, J., Schwartz, R., Noamany, M., Schone, P., Ji, G., Das, S., Egan, M., He, F., Vergyri, D., Liu, D., and Duta, N. 2002. Novel speech recognition models for Arabic – final report of the JHU summer workshop. Technical Report, Johns Hopkins University, Baltimore, MD, USA.Google Scholar

Kübler, S., and Mohamed, E. 2008. Memory-based vocalization of Arabic. In Proceedings of the LREC Workshop on HLT and NLP within the Arabic World, Marrakech, Morocco, pp. 58–62.Google Scholar

Lee, Y.-S., Papineni, K., Roukos, S., Emam, O., and Hassan, H. 2003. Language model-based Arabic word segmentation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL), Sapporo, Japan, pp. 399–406.Google Scholar

Marton, Y., Habash, N., and Rambow, O. 2010. Improving Arabic dependency parsing with inflectional and lexical morphological features. In Proceedings of the NAACL Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL), Los Angeles, CA, pp. 13–21.Google Scholar

Mohamed, E., and Kübler, S. 2009. Diacritization for real-world Arabic texts. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria, 251–7.Google Scholar

Mohamed, E., and Kübler, S. 2010a. Arabic part of speech tagging. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC), Valetta, Malta, pp. 2537–43.Google Scholar

Mohamed, E., and Kübler, S. 2010b. Is Arabic part of speech tagging feasible without word segmentation? In Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), Los Angeles, CA, pp. 705–8.Google Scholar

Nelken, R., and Shieber, S. 2005. Arabic diacritization using weighed finite-state transducers. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor, MI, pp. 79–86.CrossRef Google Scholar

Ratnaparkhi, A. 1996. A maximum entropy model for part-of-speech tagging. In Proceedings of the Empirical Methods in Natural Language Processing Conference (EMNLP), Philadelphia, PA, pp. 133–42.Google Scholar

Roth, R., Rambow, O., Habash, N., Diab, M., and Rudin, C. 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of ACL-08: HLT, Short Papers, Columbus, OH, pp. 117–20.Google Scholar

Segal, E. 2000. Hebrew Morphological Analyzer for Hebrew Undotted Text. Master's thesis, Technion, Haifa, Israel (in Hebrew).Google Scholar

Shaalan, K., Abo Bakr, H., and Ziedan, I. 2009. A hybrid approach for building Arabic diacritizer. In Proceedings of the EACL Workshop on Computational Approaches to Semitic Languages, Athens, Greece, pp. 27–35.Google Scholar

Shacham, D., and Wintner, S. 2007. Morphological disambiguation of Hebrew: a case study in classifier combination. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, pp. 439–47.Google Scholar

Sima'an, K., Itai, A., Winter, Y., Altmann, A., and Nativ, N. 2001. Building a tree-bank of Modern Hebrew text. Traitement Automatique des Langues 42: 347–80.Google Scholar

Tsarfaty, R. 2006. Integrated morphological and syntactic disambiguation for Modern Hebrew. In Proceedings of the COLING/ACL 2006 Student Research Workshop, Sydney, Australia, pp. 49–54.Google Scholar

Tsarfaty, R., and Goldberg, Y. 2008. Word-based or morpheme-based? Annotation strategies for Modern Hebrew clitics. In Proceedings of the Sixth International Language Resources and Evaluation (LREC), Marrakech, Morocco, 1421–7.Google Scholar

van den Bosch, A., Marsi, E., and Soudi, A. 2007. Memory-based morphological analysis and part-of-speech tagging of Arabic. In Soudi, A., van den Bosch, A., and Neumann, G. (eds.), Arabic Computational Morphology, pp. 203–19. Berlin: Springer.Google Scholar

Zavrel, J., and Daelemans, W. 1997. Memory-based learning: using similarity for smoothing. In Proceedings of the 35th Annual Meeting of the Association of Computational Linguistics and the 8th Conference of the European Chapter of the ACL (ACL-EACL), Madrid, Spain, pp. 436–43.Google Scholar

Zitouni, I., Sorensen, J., and Sarikaya, R. 2006. Maximum entropy based restoration of Arabic diacritics. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL), Sydney, Australia, pp. 577–84.Google Scholar

Article contents

Part of speech tagging for Arabic

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests