Skip to main content Accessibility help

Datasets for generic relation extraction*

  • B. HACHEY (a1), C. GROVER (a2) and R. TOBIN (a2)


A vast amount of usable electronic data is in the form of unstructured text. The relation extraction task aims to identify useful information in text (e.g. PersonW works for OrganisationX, GeneY encodes ProteinZ) and recode it in a format such as a relational database or RDF triplestore that can be more effectively used for querying and automated reasoning. A number of resources have been developed for training and evaluating automatic systems for relation extraction in different domains. However, comparative evaluation is impeded by the fact that these corpora use different markup formats and notions of what constitutes a relation. We describe the preparation of corpora for comparative evaluation of relation extraction across domains based on the publicly available ACE 2004, ACE 2005 and BioInfer data sets. We present a common document type using token standoff and including detailed linguistic markup, while maintaining all information in the original annotation. The subsequent reannotation process normalises the two data sets so that they comply with a notion of relation that is intuitive, simple and informed by the semantic web. For the ACE data, we describe an automatic process that automatically converts many relations involving nested, nominal entity mentions to relations involving non-nested, named or pronominal entity mentions. For example, the first entity is mapped from ‘one’ to ‘Amidu Berry’ in the membership relation described in ‘Amidu Berry, one half of PBS’. Moreover, we describe a comparably reannotated version of the BioInfer corpus that flattens nested relations, maps part-whole to part-part relations and maps n-ary to binary relations. Finally, we summarise experiments that compare approaches to generic relation extraction, a knowledge discovery task that uses minimally supervised techniques to achieve maximally portable extractors. These experiments illustrate the utility of the corpora.1



Hide All
Agichtein, E. and Gravano, L. 2000. Snowball: extracting relations from large plain-text collections. In Proceedings of the 5th ACM Conference on Digital Libraries, pp. 8594. New York, NY: ACM.
Aone, C., Halverson, L., Hampton, T. and Ramos-Santacruz, M. 1998. SRA: description of the IE2 system used for MUC-7. In Proceedings of the 7th Message Understanding Conference (MUC-7), Columbia, MD. Gaithersburg: NIST.
Auer, S., Dietzold, S., Lehmann, J., Hellmann, S., and Aumueller, D. 2009. Triplify: light-weight linked data publication from relational databases. In Proceedings of the 18th International World Wide Web Conference, Madrid, Spain, pp. 621–30. New York, NY: ACM.
Berry, M. W., Dumais, S. T. and O'Brien, G. W. 1995. Using linear algebra for intelligent information retrieval. SIAM Review 37 (4): 573–95.
Bizer, C., Heath, T. and Berners-Lee, T. 2009. Linked data – the story so far. International Journal on Semantic Web and Information Systems 5 (3): 122.
Blei, D., Ng, A. Y. and Jordan, M. I. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3: 9931022.
Brin, S. 1999. Extracting patterns and relations from the world wide web. In: Atzeni, P., Mendelzon, A., and Mecca, G. (eds.), The World Wide Web and Databases: Selected Papers from WebDB '98, pp. 172–83. Lecture Notes in Computer Science. Berlin: Springer.
Bunescu, R., Ge, R., Kate, R. J., Marcotte, E. M., Mooney, R. J., Ramani, A. K., and Wong, Y. W. 2004. Comparative experiments on learning information extractors for proteins and their interactions. Artificial Intelligence in Medicine 33 (2): 139–55.
Byrne, K. 2009. Populating the Semantic Web – Combining Text and Relational Databases as RDF Graphs. PhD thesis, University of Edinburgh.
Chinchor, N. 1998. Overview of MUC-7. In Proceedings of the 7th Message Understanding Conference. Gaithersburg, MD: NIST.
Cohen, K. B., Fox, L., Ogren, P. V. and Hunter, L. 2005. Corpus design for biomedical natural language processing. In Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, pp. 3845. Morristown, TN: ACL.
Cohen, K. B. and Hunter, L. 2006. A critical review of PASBio's argument structures for biomedical verbs. BMC Bioinformatics 7 (Suppl 3): S6.
Conrad, J. G. and Utt, M. H. 1994. A system for discovering relationships by feature extraction from text databases. In Proceedings of the 17th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 260–70. New York, NY: ACM.
Curran, J. R. and Clark, S. 2003. Investigating GIS and smoothing for maximum entropy taggers. In Proceedings of the 11th Meeting of the European Chapter of the Association for Computational Linguistics, pp. 91–8. Morristown, TN: ACL.
Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., and Weischedel, R. 2004. The automatic content extraction (ACE) program – tasks, data, and evaluation. In Proceedings of the 4th International Conference on Language Resources and Evaluation, pp. 837–40. Paris: ELDA.
Eckart, C. and Young, G. 1936. The approximation of one matrix by another of lower rank. Psychometrika 1 (3): 211218.
Filatova, E. and Hatzivassiloglou, V. 2003. Marking atomic events in sets of related texts. In: Nicolov, N., Bontcheva, K., Angelova, G., and Mitkov, R (eds.), Recent Advances in Natural Language Processing III, pp. 247–56. Amsterdam, Netherlands: John Benjamins.
Ginter, F., Pyysalo, S., Björne, J., Heimonen, J., and Salakoski, T. 2007. BioInfer relationship annotation manual. Technical Report 806, Turku Centre for Computer Science.
Griffiths, T. L. and Steyvers, M. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences 101 (Suppl 1): 52285235.
Grover, C., Matthews, M. and Tobin, R. 2006. Tools to address the interdependence between tokenisation and standoff annotation. In Proceedings of the EACL Workshop on Multi-dimensional Markup in Natural Language Processing, pp. 1926. Morristown: ACL.
Hachey, B. 2009 a. Multi-document summarisation using generic relation extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 420–9. Morristown, TN: ACL.
Hachey, B. 2009 b. Towards Generic Relation Extraction. Ph.D. thesis, University of Edinburgh.
Hasegawa, T., Sekine, S. and Grishman, R. 2004. Discovering relations among named entities from large corpora. In Proceedings of the 42nd Annual Meeting of Association of Computational Linguistics, pp. 415–22. Morristown, TN: ACL.
Hasegawa, T., Sekine, S. and Grishman, R. 2005. Unsupervised paraphrase acquisition via relation discovery. Technical Report 05-012, Proteus Project, Computer Science Department, New York University.
Heimonen, J., Pyysalo, S., Ginter, F. and Salakoski, T. 2008. Complex-to-pairwise mapping of biological relationships using a semantic network representation. In Proceedings of the 3rd International Symposium on Semantic Mining in Biomedicine, pp. 4552. Turku: Turku Centre for Computer Science Turku, Finland.
Johnson, H. L. Jr., Baumgartner, William A., Krallinger, M., Cohen, K. B., and Hunter, L. 2007. Corpus refactoring: a feasibility study. Journal of Biomedical Discovery and Collaboration 2: 4.
Landauer, T. K., Foltz, P. W. and Laham, D. 1998. An introduction to latent semantic analysis. Discourse Processes 25 (2): 259284.
Linguistic Data Consortium (LDC) 2004 a. Annotation Guidelines for Entity Detection and Tracking (EDT). Philadelphia, PA: LDC. Accessed 22 July 2008.
Linguistic Data Consortium (LDC) 2004 b. Annotation Guidelines for Relation Detection and Characterization (RDC). Philadelphia, PA: LDC. Accessed 22 July 2008.
Linguistic Data Consortium (LDC) 2005 a. ACE (Automatic Content Extraction) English Annotation Guidelines for Entities. Philadelphia, Pa: LDC. Accessed 22 July 2008.
Linguistic Data Consortium (LDC) 2005 b. ACE (Automatic Content Extraction) English Annotation Guidelines for Relations. Philadelphia, PA: LDC. Accessed 22 July 2008.
Lin, D. 1998. Dependency-based evaluation of MINIPAR. In Proceedings of the LREC Workshop Evaluation of Parsing Systems, pp. 317–30. Paris: ELDA.
Lin, D. and Pantel, P. 2001. Discovery of inference rules for question answering. Natural Language Engineering 7 (4): 343360.
Marcus, M. P., Marcinkiewicz, M. A. and Santorini, B. 1993. Building a large annotated corpus of English: the Penn treebank. Computational Linguistics 19 (2): 313–30. ISSN .
McDonald, R., Pereira, F., Kulick, S., Winters, S., Jin, Y., and White, P. 2005. Simple algorithms for complex relation extraction with applications to biomedical IE. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pp. 491–8. Morristown, TN: ACL.
Minnen, G., Carroll, J. and Pearce, D. 2000. Robust, applied morphological generation. In Proceedings of the 1st International Natural Language Generation Conference, pp. 201–8. Morristown, TN: ACL.
Mitchell, A., Strassel, S., Huang, S. and Zakhary, R. 2005. ACE 2004 Multilingual Training Corpus. Philadelphia, PA: Linguistic Data Consortium.
Pustejovsky, J., Saurí, R., Castaño, J., Radev, D., Gaizauskas, R., Setzer, A., Sundheim, B., and Katz, G. 2004. Representing temporal and event knowledge for QA systems. In: Maybury, M. T. (ed.), New Directions in Question Answering, pp. 99112. Menlo Park, CA: AAAI Press.
Pyysalo, S., Airola, A., Heimonen, J., Björne, J., Ginter, F., and Salakoski, T. 2008. Comparative analysis of five protein–protein interaction corpora. BMC Bioinformatics 9 (Suppl 3): S6.
Pyysalo, S., Ginter, F., Heimonen, J., Björne, J., Boberg, J., Järvinen, J., and Salakoski, T. 2007. BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics 8: 50.
Rzhetsky, A., Iossifov, I., Koike, T., Krauthammer, M., Kra, P., Morris, M., Yu, H., Dubou, P. A., Weng, W., Wilbur, W. J., Hatzivassiloglou, V., and Friedman, C. 2004. Geneways: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. Journal of Biomedical Informatics 37 (1): 4353.
Sekine, S. 2006. On-demand information extraction. In Proceedings of the COLING/ACL Main Conference Poster Sessions, pp. 731–8. Morristown, TN: ACL.
Smith, D. A. 2002. Detecting and browsing events in unstructured text. In Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 7380. New York, NY: ACM.
Swanson, D. R. (1986) Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspectives in Biology and Medicine 30 (1): 718.
Turmo, J., Ageno, A. and Català, N. 2006. Adaptive information extraction. ACM Computing Surveys, 38 (2): 4.
Walker, C., Strassel, S., Medero, J. and Maeda, K. 2006. ACE 2005 Multilingual Training Corpus. Philadelphia, PA: Linguistic Data Consortium.
Wattarujeekrit, T., Shah, P. and Collier, N. 2004. PASBio: predicate-argument structures for event extraction in molecular biology. BMC Bioinformatics 5: 155.

Datasets for generic relation extraction*

  • B. HACHEY (a1), C. GROVER (a2) and R. TOBIN (a2)


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed