Hostname: page-component-8448b6f56d-42gr6 Total loading time: 0 Render date: 2024-04-18T03:36:39.526Z Has data issue: false hasContentIssue false

Morphosyntactic annotation of CHILDES transcripts*

Published online by Cambridge University Press:  25 March 2010

KENJI SAGAE*
Affiliation:
Institute for Creative Technologies, University of Southern California
ERIC DAVIS
Affiliation:
Language Technologies Institute, Carnegie Mellon University
ALON LAVIE
Affiliation:
Language Technologies Institute, Carnegie Mellon University
BRIAN MACWHINNEY
Affiliation:
Department of Psychology, Carnegie Mellon University
SHULY WINTNER
Affiliation:
Department of Computer Science, University of Haifa, Israel
*
Address for correspondence: Kenji Sagae, USC Institute for Creative Technologies, 13274 Fiji Way, Marina del Rey, CA 90292. e-mail: sagae@usc.edu

Abstract

Corpora of child language are essential for research in child language acquisition and psycholinguistics. Linguistic annotation of the corpora provides researchers with better means for exploring the development of grammatical constructions and their usage. We describe a project whose goal is to annotate the English section of the CHILDES database with grammatical relations in the form of labeled dependency structures. We have produced a corpus of over 18,800 utterances (approximately 65,000 words) with manually curated gold-standard grammatical relation annotations. Using this corpus, we have developed a highly accurate data-driven parser for the English CHILDES data, which we used to automatically annotate the remainder of the English section of CHILDES. We have also extended the parser to Spanish, and are currently working on supporting more languages. The parser and the manually and automatically annotated data are freely available for research purposes.

Type
Articles
Copyright
Copyright © Cambridge University Press 2010

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

[*]

We thank Marina Fedner for help with annotation of the English data, and Bracha Nir for help with annotation of the Hebrew data. This research was supported in part by Grant No. 2007241 from the United States–Israel Binational Science Foundation (BSF) and by the National Science Foundation (NSF) under grant IIS-0414630.

References

REFERENCES

Berger, A., Della Pietra, S. A. & Della Pietra, V. J. (1996). A maximum entropy approach to natural language processing. Computational Linguistics 22(1), 3971.Google Scholar
Berman, R. A. (1978). Modern Hebrew structure. Tel Aviv: University Publishing Projects.Google Scholar
Berman, R. A. (1979). Lexical decomposition and lexical unity in the expression of derived verbal categories in modern Hebrew. Afroasiatic Linguistics 6, 126.Google Scholar
Bloom, L. (1970). Language development: Form and function in emerging grammars. Cambridge, MA: MIT Press.Google Scholar
Bod, R. (2009). From exemplar to grammar: A probabilistic analogy-based model of language learning. Cognitive Science 33(5), 752–93.CrossRefGoogle ScholarPubMed
Borensztajn, G., Zuidema, J. & Bod, R. (2009). Children's grammars grow more abstract with age – evidence from an automatic procedure for identifying the productive units of language. Topics in Cognitive Science 1, 175–88.CrossRefGoogle ScholarPubMed
Briscoe, T. & Carroll, J. (1993). Generalised probabilistic lr parsing of natural language (corpora) with unification-based grammars. Computational Linguistics 19(1), 2559.Google Scholar
Brown, R. (1973). A first language: The early stages. Cambridge, MA: Harvard University Press.CrossRefGoogle Scholar
Buchholz, S. & Marsi, E. (2006). Conll-x shared task on multilingual dependency parsing. In Proceedings of the Tenth Conference on Computational Natural Language Learning (CONLL-x), 149–64. New York City: Association for Computational Linguistics.Google Scholar
Charniak, E. (2000). A maximum-entropy-inspired parser. In Proceedings of the First Conference of the North American Chapter of the Association for Computational Linguistics, 132–39. San Francisco, CA: Morgan Kaufmann Publishers Inc.Google Scholar
Doron, E. (1983). Verbless predicates in Hebrew. Unpublished doctoral dissertation, University of Texas at Austin.Google Scholar
Hudson, R. A. (1984). Word grammar. Oxford: Basil Blackwell.Google Scholar
Knuth, D. (1965). On the translation of languages from left to right. Information and Control 8(6), 607639.CrossRefGoogle Scholar
Lee, L. (1974). Developmental sentence analysis. Evanston, IL: Northwestern University Press.Google Scholar
MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk, 3rd edn. Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar
MacWhinney, B. (2008). Enriching CHILDES for morphosyntactic analysis. In Behrens, H. (ed.), Corpora in language acquisition research: History, methods, perspectives, Vol. 6, 165–98. Amsterdam: Benjamins.CrossRefGoogle Scholar
Mel'čuk, I. A. (1988). Dependency syntax: Theorie and practice. Albany, NY: SUNY Press.Google Scholar
Nivre, J. (2003). An efficient algorithm for projective dependency parsing. In Proceedings of the Eighth International Worskshop on Parsing Technologies (IWPT), 149–60. Nancy.Google Scholar
Nivre, J., Hall, J., Nilsson, J., Eryigit, G. & Marinov, S. (2006). Labeled pseudo-projective dependency parsing with support vector machines. In Proceedings of the Tenth Conference on Computational Natural Language Learning, 221–25. New York: Association for Computational Linguistics.Google Scholar
Parisse, C. & Le Normand, M.-T. (2000). Automatic disambiguation of the morphosyntax in spoken language corpora. Behavior Research Methods, Instruments and Computers 32, 468–81.CrossRefGoogle ScholarPubMed
Peters, A. M. (1983). The units of language acquisition. New York: Cambridge University Press.Google Scholar
Sagae, K. & Lavie, A. (2006). A best-first probabilistic shift-reduce parser. In Proceedings of the Coling/ACL Poster Session, 691–98. Sydney: Association for Computational Linguistics.Google Scholar
Sagae, K., Lavie, A. & MacWhinney, B. (2004). Adding syntactic annotations to transcripts of parent–child dialogs. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), 1815–18. Lisbon: European Language Resources Association.Google Scholar
Sagae, K., Lavie, A. & MacWhinney, B. (2005). Automatic measurement of syntactic development in child language. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), 197–204. Ann Arbor, MI: Association for Computational Linguistics.Google Scholar
Sagae, K. & Tsujii, J. (2007). Dependency parsing and domain adaptation with LR models and parser ensembles. In Proceedings of the CONLL Shared Task Session of the Joint Conferences on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CONLL 2007), 1044–50. Prague: Association for Computational Linguistics.Google Scholar
Scarborough, H. S. (1990). Index of productive syntax. Applied Psycholinguistics 11, 122.CrossRefGoogle Scholar
Tomita, M. (ed.) (1991). Generalized LR parsing. Boston: Kluwer Academic Publishing.CrossRefGoogle Scholar
Wilson, B. & Peters, A. M. (1988). What are you cookin' on a hot?: A three-year-old blind child's ‘violation’ of universal constraints on constituent movement. Language 64, 249–73.CrossRefGoogle Scholar