Skip to main content
×
×
Home

Dropped personal pronoun recovery in Chinese SMS*

  • CHRIS GIANNELLA (a1), RANSOM WINDER (a1) and STACY PETERSEN (a1)
Abstract
Abstract

In written Chinese, personal pronouns are commonly dropped when they can be inferred from context. This practice is particularly common in informal genres like Short Message Service messages sent via cell phones. Restoring dropped personal pronouns can be a useful preprocessing step for information extraction. Dropped personal pronoun recovery can be divided into two subtasks: (1) detecting dropped personal pronoun slots and (2) determining the identity of the pronoun for each slot. We address a simpler version of restoring dropped personal pronouns wherein only the person numbers are identified. After applying a word segmenter, we used a linear-chain conditional random field to predict which words were at the start of an independent clause. Then, using the independent clause start information, as well as lexical and syntactic information, we applied a conditional random field or a maximum-entropy classifier to predict whether a dropped personal pronoun immediately preceded each word and, if so, the person number of the dropped pronoun. We conducted a series of experiments using a manually annotated corpus of Chinese Short Message Service. Our approaches substantially outperformed a rule-based approach based partially on rules developed by Chung and Gildea (2010, Effects of Empty Categories on Machine Translation. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. pp. 636–45). Our approaches also outperformed (though by a considerably smaller margin) a machine-learning approach based closely on work by Yang, Liu, and Xue in (2015, Recovering Dropped Pronouns from Chinese Text Messages. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics. pp. 309–13). Features derived from parsing largely did not help our approaches. We conclude that, given independent clause start information, the parse information we used was largely superfluous for identifying dropped personal pronouns.

Copyright
Footnotes
Hide All
1

Also affiliated with The Dept. of Linguistics, Georgetown University, 3700 O Street NW, Washington DC USA.

*

We are thankful for the assistance provided by our MITRE colleagues. Dr Sichu Li annotated, in efficient and professional fashion, a large subset of the SMS we downloaded from the National University of Singapore. Dr John Prange, Mr Rob Case, and Mr Rod Holland provided valuable feedback on a presentation we gave describing our preliminary research findings. We are also thankful for the assistance provided by our colleagues at other institutions. Professor Nianwen (Bert) Xue at Brandeis University, Boston, USA shared his thoughts and expertise on Chinese dropped pronoun detection, at an early stage of our research. Professor Derek F. Wong and Mr Junwen Xing at the University of Macau, Macau, SAR PRC applied their word segmenter to the National University of Singapore corpus.

Footnotes
References
Hide All
Baran E., Yang Y., and Xue N. 2012. Annotating dropped pronouns in Chinese newswire text. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC). European Language Resources Association (ELRA). pp. 2795–9.
Cai S., Chiang D., and Goldberg Y. 2011. Language-independent parsing with empty elements. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 212–6.
Chen C., and Ng V. 2013. Chinese zero pronoun resolution: some recent advances. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 1360–5.
Chen T., and Kan M.-Y. 2013. Creating a live, public short message service corpus: the NUS SMS corpus. Language Resources and Evaluation 47 (2): 299335. doi: 10.1007/s10579-012-9197-9.
Chen C., and Ng V. 2014. Chinese zero pronoun resolution: an unsupervised approach combining ranking and integer linear programming. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, Palo Alto, CA USA, Association for the Advancement of Artificial Intelligence Press. pp. 1622–8.
Chung T., and Gildea D. 2010. Effects of empty categories on machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 636–45.
Edington E., and Onghena P. 2007. Randomization Tests, 4th ed. Boca Raton, FL, USA: CRC Press, Taylor & Francis Group. ISBN: 978-1-58488-589-4.
Grosz B., Joshi A., and Weinstein S., 1995. Centering: a framework for modeling the local coherence of discourse. Computational Linguistics 21 (2): 203–25.
Huang C. T. J. 1989. Pro-drop in chinese: a generalized control theory. In Jaeggli O. and Safir K. (eds.), Studies in Natural Language and Linguistic Theory: The Null Subject Parameter, vol. 15, pp. 185214. Netherlands: Springer. doi: 10.1007/978-94-009-2540-3_6.
Kawahara D., and Kurohashi S. 2005. Zero pronoun resolution based on automatically constructed case frames and structural preference of antecedents. In Su K.-Y., Tsujii J., Lee J.-L., and Kwong O. Y. (eds.), Lecture Notes in Computer Science, vol. 3248, pp. 1221. Berlin Heidelberg: Springer. doi: 10.1007/978-3-540-30211-7_2.
Kong F., and Zhou G. 2010. A tree kernel-based unified framework for Chinese zero anaphora resolution. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 882–91.
Kong F., and Zhou G. 2013. A clause-level hybrid approach to Chinese empty element recovery. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI), Palo Alto, CA USA, Association for the Advancement of Artificial Intelligence Press. pp. 2113–9.
Lafferty J., McCallum A., and Pereira F. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML), Burlington, MA USA, Morgan Kaufmann. pp. 282–9.
Levy R., and Galen A. 2006. Tregex and tsurgeon: tools for querying and manipulating tree data structures. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC). European Language Resources Association (ELRA). pp. 2231–4.
Manning C., Surdeanu M., Bauer J., Finkel J., Bethard S. J., and McClosky D. 2014. The stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 5560.
McCallum A. 2002. Accessed July 16, 2013. http://mallet.cs.umass.edu.
Rahman A., and Ng V. 2012. Translation-based projection for multilingual coreference resolution. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Palo Alto, CA USA, Association for Computational Linguistics. pp. 1051–60.
Rao S., Ettinger A., Daume H. III, and Resnik P. 2015. Dialogue focus tracking for zero pronoun resolution. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Palo Alto, CA USA, Association for Computational Linguistics. pp. 494502.
Sasano R., and Kurohashi S. 2011. A discriminative approach to japanese zero anaphora resolution with large-scale lexicalized case frames. In Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 758–66.
Seki K., Fujii A., and Ishikawa T. 2002. A probabilistic method for analtyzing japanese anaphora integrating zero pronoun detection and resolution. In Proceedings of the 19th International Conference on Computational Linguistics (COLING), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 17.
Wang L., Wong D., Chao L., and Xing J. 2012. CRFs-based Chinese word segmentation for micro-blog with small-scale data. In Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, Stroudsburg, PA USA, Association for Computational Linguistics. pp. 51–7.
Xue N., and Yang Y. 2011. Chinese sentence segmentation as comma classification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 631–5.
Xue N., and Yang Y. 2013. Dependency-based empty category detection via phrase structure trees. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 1051–60.
Xue N., Xia F., Huang S., and Kroch A. 2000. The bracketing guidelines for the penn Chinese treebank (3.0). Technical Report No. IRCS-00-08, University of Pennsylvania Institute for Research in Cognitive Science. http://repository.upenn.edu/ircs_reports/39/.
Yang W., Dai R., and Cui X. 2008. Zero pronoun resolution in Chinese using machine learning plus shallow parsing. In Proceedings of the IEEE International Conference on Information and Automation, New York, NY USA, Institute of Electrical and Electronics Engineers. pp. 905–10.
Yang Y. 2014. Reading between the lines: recovering implicit information from Chinese texts. Ph.D. Stroudsburg, PA USA: Dissertation, Department of Computer Science, Brandeis University.
Yang Y., Liu Y., and Xue N. 2015. Recovering dropped pronouns from Chinese text messages. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 309–13.
Yang Y., and Xue N. 2010. Chasing the ghost: recovering empty categories in the Chinese treebank . In Proceedings of the 23rd International Conference on Computational Linguistics (COLING), Beijing, P.R. CHINA, Tsinghua University Press. pp. 1382–90.
Yeh A. 2000. More accurate tests for the statistical significance of result differences. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 947–53
Yeh C.-L., and Chen Y.-C., 2007. Zero anaphora resolution in Chinese with shallow parsing. Journal of Chinese Language and Computing 17 (1): 4156.
Zhao S., and Ng H.T. 2007. Identification and resolution of Chinese zero pronouns: a machine learning approach. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 541–50.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 4
Total number of PDF views: 42 *
Loading metrics...

Abstract views

Total abstract views: 256 *
Loading metrics...

* Views captured on Cambridge Core between 30th May 2017 - 21st February 2018. This data will be updated every 24 hours.