The Penn Chinese TreeBank: Phrase structure annotation of a large corpus

NAIWEN XUE; FEI XIA; FU-DONG CHIOU; MARTA PALMER

doi:10.1017/S135132490400364X

Abstract

With growing interest in Chinese Language Processing, numerous NLP tools (e.g., word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the world. However, since no large-scale bracketed corpora are available to the public, these tools are trained on corpora with different segmentation criteria, part-of-speech tagsets and bracketing guidelines, and therefore, comparisons are difficult. As a first step towards addressing this issue, we have been preparing a large bracketed corpus since late 1998. The first two installments of the corpus, 250 thousand words of data, fully segmented, POS-tagged and syntactically bracketed, have been released to the public via LDC (www.ldc.upenn.edu). In this paper, we discuss several Chinese linguistic issues and their implications for our treebanking efforts and how we address these issues when developing our annotation guidelines. We also describe our engineering strategies to improve speed while ensuring annotation quality.

Information

Crossref Citations

This article has been cited by the following publications. This list is generated based on data provided by Crossref.

Juffs, Alan 2007. Second Language Acquisition of Relative Clauses in the Languages of East Asia. Studies in Second Language Acquisition, Vol. 29, Issue. 2, p. 361.

Wen, Juan Qin, Ying and Wang, Xiaojie 2007. Chinese Verb Sense Disambiguation Using AdaBoosting. p. 316.

Xue, Nianwen 2007. A Chinese semantic lexicon of senses and roles. Language Resources and Evaluation, Vol. 40, Issue. 3-4, p. 395.

Duan, Xiangyu Zhao, Jun and Xu, Bo 2007. Machine Learning: ECML 2007. Vol. 4701, Issue. , p. 559.

Shi, Xiaodong Chen, Yidong and Jia, Jianfeng 2007. Computational Linguistics and Intelligent Text Processing. Vol. 4394, Issue. , p. 385.

Ding, Hua-Fu Zhao, Tie-Jun and Li, Sheng 2007. Parsing Chinese Text Based on Semantic Class. p. 3377.

Che, Wanxiang Zhang, Min Aw, AiTi Tan, ChewLim Liu, Ting and Li, Sheng 2008. Using a Hybrid Convolution Tree Kernel for Semantic Role Labeling. ACM Transactions on Asian Language Information Processing, Vol. 7, Issue. 4, p. 1.

Xue, Nianwen 2008. Labeling Chinese Predicates with Semantic Roles. Computational Linguistics, Vol. 34, Issue. 2, p. 225.

SAMPSON, GEOFFREY and BABARCZY, ANNA 2008. Definitional and human constraints on structural annotation of English. Natural Language Engineering, Vol. 14, Issue. 4, p. 471.

Jian, Ping and Zong, Chengqing 2009. Two-Pass Deterministic Dependency Parsing for Long Chinese Sentences. p. 240.

DING, WEIWEI and CHANG, BAOBAO 2009. Word Based Chinese Semantic Role Labeling with Semantic Chunking. International Journal of Computer Processing of Languages, Vol. 22, Issue. 02n03, p. 133.

Ding, Weiwei and Chang, Baobao 2009. Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy. Vol. 5459, Issue. , p. 79.

Pang, Wenbo and Fan, Xiaozhong 2009. Chinese Nominal Entity Recognition with Semantic Role Labeling. p. 263.

XUE, NIANWEN and PALMER, MARTHA 2009. Adding semantic roles to the Chinese Treebank. Natural Language Engineering, Vol. 15, Issue. 1, p. 143.

Wang, Bo Zhao, Tiejun Yang, Muyun and Li, Sheng 2009. Automatic Syntactic Segment Filtration for Mass Syntax Corpus with Mutual Information. p. 234.

Liu, Haitao Zhao, Yiyi and Li, Wenwen 2009. Chinese Syntactic and Typological Properties Based on Dependency Syntactic Treebanks. Poznań Studies in Contemporary Linguistics, Vol. 45, Issue. 4,

Cai, Shu Lû, Yajuan and Liu, Qun 2009. Improved Reordering Rules for Hierarchical Phrase-Based Translation. p. 65.

Bo Wang Tiejun Zhao Muyun Yang and Sheng Li 2010. Discover Linguistic Patterns in Parsed Corpus with Frequent Subrtree Mining. p. 86.

Rehbein, Ines 2010. Der Einfluss der Dependenzgrammatik auf die Computerlinguistik. zfgl, Vol. 38, Issue. 2, p. 224.

Chan, Samuel W. K. Cheung, Lawrence Y. L. and Chong, Mickey W. C. 2010. Computational Linguistics and Intelligent Text Processing. Vol. 6008, Issue. , p. 121.

Download full list

Article contents

The Penn Chinese TreeBank: Phrase structure annotation of a large corpus

Abstract

Information

Access options

Article purchase

Temporarily unavailable

This article has been cited by the following publications. This list is generated based on data provided by Crossref.

Article contents

The Penn Chinese TreeBank: Phrase structure annotation of a large corpus

Abstract

Information

Access options

Article purchase

Temporarily unavailable

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests