Part-of-speech tagging of Modern Hebrew text

ROY BAR-HAIM; KHALIL SIMA'AN; YOAD WINTER

doi:10.1017/S135132490700455X

Part-of-speech tagging of Modern Hebrew text

Published online by Cambridge University Press: 01 April 2008

ROY BAR-HAIM ,

KHALIL SIMA'AN and

YOAD WINTER

Show author details

ROY BAR-HAIM: Affiliation:
Dept. of Computer Science, Bar-Ilan University, Ramat-Gan 52900, Israel e-mail: barhair@cs.biu.ac.il
KHALIL SIMA'AN: Affiliation:
Institute for Logic, Language and Computation, Universiteit van Amsterdam, Amsterdam, The Netherlandssimaan@science.uva.nl
YOAD WINTER: Affiliation:
Dept. of Computer Science, Technion, Haifa 32000, Israelwinter@cs.technion.ac.il Netherlands Institute for Advanced Study, Meijboomlaan 1, 2242 PR Wassenaar, The Netherlands

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Words in Semitic texts often consist of a concatenation of word segments, each corresponding to a part-of-speech (POS) category. Semitic words may be ambiguous with regard to their segmentation as well as to the POS tags assigned to each segment. When designing POS taggers for Semitic languages, a major architectural decision concerns the choice of the atomic input tokens (terminal symbols). If the tokenization is at the word level, the output tags must be complex, and represent both the segmentation of the word and the POS tag assigned to each word segment. If the tokenization is at the segment level, the input itself must encode the different alternative segmentations of the words, while the output consists of standard POS tags. Comparing these two alternatives is not trivial, as the choice between them may have global effects on the grammatical model. Moreover, intermediate levels of tokenization between these two extremes are conceivable, and, as we aim to show, beneficial. To the best of our knowledge, the problem of tokenization for POS tagging of Semitic languages has not been addressed before in full generality. In this paper, we study this problem for the purpose of POS tagging of Modern Hebrew texts. After extensive error analysis of the two simple tokenization models, we propose a novel, linguistically motivated, intermediate tokenization model that gives better performance for Hebrew over the two initial architectures. Our study is based on the well-known hidden Markov models (HMMs). We start out from a manually devised morphological analyzer and a very small annotated corpus, and describe how to adapt an HMM-based POS tagger for both tokenization architectures. We present an effective technique for smoothing the lexical probabilities using an untagged corpus, and a novel transformation for casting the segment-level tagger in terms of a standard, word-level HMM implementation. The results obtained using our model are on par with the best published results on Modern Standard Arabic, despite the much smaller annotated corpus available for Modern Hebrew.

Information

Type: Papers
Information: Natural Language Engineering , Volume 14 , Issue 2 , April 2008 , pp. 223 - 251

DOI: https://doi.org/10.1017/S135132490700455X [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2007

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Adler, M. and Elhadad, M. 2006. An unsupervised morpheme-based HMM for Hebrew morphological disambiguation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 665–672. East Stroudsburg, PA: Association for Computational Linguistics.CrossRef Google Scholar

Bar-Haim, R., Sima'an, K. and Winter, Y. 2005. Choosing an optimal architecture for segmentation and POS-tagging of Modern Hebrew. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor, pp. 39–46, MI. East Stroudsburg, PA: Association for Computational Linguistics.CrossRef Google Scholar

Baum, L. 1972. An inequality and associated maximization technique in statistical estimation for probabilistic functions of a Markov process. In Inequalities III: Proceedings of the Third Symposium on Inequalities, University of California, Los Angeles, pp. 1–8.Google Scholar

Brants, T. 2000. TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing, Seattle, WA.CrossRef Google Scholar

Brill, E. 1995. Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Computational Linguistic 21: 784–789.Google Scholar

Buckwalter, T. 2002. Buckwalter Arabic Morphological Analyzer Version 1.0. Linguistic Data Consortium (LDC). LDC Catalog No.: LDC2002L49, ISBN:1-58563-257-0.Google Scholar

Carmel, D. and Maarek, Y. 1999. Morphological disambiguation for Hebrew search systems. In Proceedings of the 4th international workshop, NGITS-99.Google Scholar

Charniak, E., Hendrickson, C., Jacobson, N. and Perkowitz, M. 1993. Equations for part-of-speech tagging. In National Conference on Artificial Intelligence, pp. 784–789.Google Scholar

Church, K.W. 1988. A stochastic parts program and noun phrase parser for unrestricted text. In Proc. of the Second Conference on Applied Natural Language Processing, Austin, TX, pp. 136–143.Google Scholar

Cutting, D., Kupiec, J., Pedersen, J. and Sibun, P. 1992. A practical part-of-speech tagger. In Proceedings of the third conference on Applied natural language processing, Association for Computational Linguistics pp. 133–140.Google Scholar

Danon, G. 2001. Syntactic definiteness in the grammar of Modern Hebrew. Linguistics 39: 1071–1116.CrossRef Google Scholar

Daya, E., Roth, D. and Wintner, S. 2004. Learning Hebrew roots: machine learning with linguistic constraints. In Proceedings of EMNLP'04, Barcelona, Spain, pp. 357–364.Google Scholar

Dermatas, E. and Kokkinakis, G. 1995. Automatic stochastic tagging of natural language texts. Computational Linguistics 21 (2): 137–163.Google Scholar

DeRose, S. J. 1988. Grammatical category disambiguation by statistical optimization. Computational Linguistics 14 (1): 31–39.Google Scholar

Diab, M., Hacioglu, K. and Jurafsky, D. 2006. Automatic Tagging of arabic text: from raw text to base phrase chunks. In Dumais, D. M. S. and Roukos, S. (eds), HLT-NAACL 2004: Short Papers, Boston, MA, pp. 149–152. East Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Elworthy, D. 1994. Does Baum-Welch re-estimation help taggers? In Proceedings of the fourth conference on Applied natural language processing, Morgan Kaufmann Publishers Inc. pp. 53–58.Google Scholar

Glinert, L. 1989. The Grammar of Modern Hebrew. Cambridge, England: Cambridge University Press.Google Scholar

Goldsmith, J. 2001. Unsupervised learning of the morphology of a natural language. Computational Linguistics 27 (2): 153–198.CrossRef Google Scholar

Good, I. J. 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40: 237–264.CrossRef Google Scholar

Habash, N. and Rambow, O. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pp. 573–580, Ann Arbor, MI. Association for Computational Linguistics.CrossRef Google Scholar

Hakkani-Tür, D., Oflazer, K. and Tür, G. 2000. Statistical morphological disambiguation for agglutinative languages. In Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000).CrossRef Google Scholar

ISO. 1999. Information and documentation – conversion of Hebrew characters into Latin characters – part 3: Phonemic conversion, ISO/FDIS 259-3: (E).Google Scholar

Katz, S. M. 1987. Estimation of probabilities from sparse data from the language model component of a speech recognizer. IEEE Transactions of Acoustics, Speech and Signal Processing 35 (3): 400–401.CrossRef Google Scholar

Lee, Y. S., Papineni, K., Roukos, S., Emam, O. and Hassan, H. 2003. Language model based arabic word segmentation. In ACL ‘03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, Morristown, NJ, USA, pp. 399–406. East Stroudsburg, PA: Association for Computational Linguistics.CrossRef Google Scholar

Levinger, M., Ornan, U. and Itai, A. 1995. Morphological disambiguation in Hebrew using a priori probabilities. Computational Linguistics 21: 383–404.Google Scholar

Levinger, M. 1992. Morphological Disambiguation in Hebrew. Master's thesis, Computer Science Department, Technion, Haifa, Israel. In Hebrew.Google Scholar

Maamouri, M., Bies, A., Buckwalter, T. and Mekki, W. 2004. The Penn Arabic Treebank: building a large-scale annotated Arabic corpus. In NEMLAR International Conference on Arabic Language Resources and Tools, Cairo.Google Scholar

Merialdo, B. 1994. Tagging English text with a probabilistic model. Computational Linguistics 20 (2): 155–171.Google Scholar

Nakagawa, T. 2004. Chinese and japanese word segmentation using word-level and character-level information. In Proceedings of Coling 2004, Geneva, Switzerland, pp. 466–472.Google Scholar

Nigam, K., Mccallum, A. K., Thrun, S. and Mitchell, T. 2000. Text classification from labeled and unlabeled documents using EM. Machine Learning 39 (2–3): 103–134.CrossRef Google Scholar

Rogati, M., McCarley, S. and Yang, Y. 2003. Unsupervised learning of Arabic stemming using a parallel corpus. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL'03), Sapporo, Japan, pp. 391–398.Google Scholar

Schone, P. and Jurafsky, D. 2000. Knowledge-free induction of morphology using latent semantic analysis. In Proceedings of CoNLL-2000 and LLL-2000, Lisbon, Portugal, pp. 67–72.Google Scholar

Segal, E., 2000. Hebrew Morphological Analyzer for Hebrew Undotted Texts. Master's thesis. Computer Science Department, Technion, Haifa, Israel. http://www.cs.technion.ac.il/-~erelsgl/bxi/hmntx/teud.html Google Scholar

Sima'an, K., Itai, A., Winter, Y., Altman, A. and Nativ, N. 2001. Building a tree-bank of Modern Hebrew text. Traitment Automatique des Langues 42: 347–380.Google Scholar

Stolcke, A. 2002. SRILM –- an extensible language modeling toolkit. In ICSLP, Denver, CO, pp. 901–904.Google Scholar

Viterbi, A. J. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transaction of Information Theory IT-13 (2): 260–269.CrossRef Google Scholar

Watson, J. C. E. 2002. The Phonology and Morphology of Arabic. Oxford University Press, Oxford.CrossRef Google Scholar

Weischedel, R., Schwartz, R., Palmucci, J., Meteer, M. and Ramshaw, L. 1993. Coping with ambiguity and unknown words through probabilistic models. Computational Linguistics 19 (2): 361–382.Google Scholar

Wintner, S. 2000. Definiteness in the Hebrew noun phrase. Journal of Linguistics 36: 319–363.CrossRef Google Scholar

Xue, N. 2003. Chinese word segmentation as character tagging. International Journal of Computational Linguistics and Chinese 8 (1): 29–48.Google Scholar

Yarowsky, D. and Wicentowski, R. 2000. Minimally supervised morphological analysis by multimodal alignment. In Proceedings of ACL-2000, Hong Kong, pp. 207–216.Google Scholar

Article contents

Part-of-speech tagging of Modern Hebrew text

Abstract

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests