Skip to main content
×
Home

Exploring the effectiveness of linguistic knowledge for biographical relation extraction

  • MARCOS GARCIA (a1) and PABLO GAMALLO (a1)
Abstract
Abstract

Machine learning techniques have been implemented to extract instances of semantic relations using diverse features based on linguistic knowledge, such as tokens, lemmas, PoS-tags, or dependency paths. However, there has been little work aiming to know which of these features works better in the relation extraction task, and less in languages other than English. In this paper, various features representing different levels of linguistic knowledge are systematically evaluated for biographical relation extraction. The effectiveness of these features was measured by training several supervised classifiers that only differ in the type of linguistic knowledge used to define their features. The experiments performed in this paper show that some basic linguistic knowledge (provided by lemmas and their combination in bigrams) behaves better than other complex features, such as those based on syntactic analysis. Furthermore, some feature combinations using different levels of analysis are proposed in order (i) to avoid feature overlapping as well as (ii) to evaluate the use of computationally inexpensive and widespread tools such as tokenization and lemmatization. This paper also describes two new freely available corpora for biographical relation extraction in Portuguese and Spanish, built by means of a distant-supervision strategy. Experiments were performed with five semantic relations and two languages, using these corpora.

Copyright
References
Hide All
Agichtein E. 2005. Extracting Relations from Large Text Collections, PhD Thesis. New York: Columbia University.
Agichtein E., and Gravano L. 2000. Snowball: extracting relations from large plain-text collections. In Proceedings of the 5th Association for Computing Machinery Conference on Digital Libraries, San Antonio, TX, USA, pp. 8594.
Aguado de Cea G., Gómez-Pérez A., Montiel-Ponsoda E., and Suárez-Figueroa M. 2008. Natural language-based approach for helping in the reuse of ontology design patterns. In Knowledge Engineering: Practice and Patterns, pp. 3247. Berlin: Springer-Verlag.
Akbik A., and Broß J. 2009. Wanderlust: extracting semantic relations from natural language text using dependency grammar patterns. In Proceedings of the Workshop on Semantic Search (SemSearch 2009) at the 18th International World Wide Web Conference (WWW 2009), Madrid, Spain, pp. 615.
Banko M., Cafarella M. J., Soderl S., Broadhead M., and Etzioni O. 2007. Open information extraction from the web. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007), Hyderabad, India, pp. 2670–6.
Brin S. 1998. Extracting patterns and relations from the world wide web. In WebDB Workshop at 6th International Conference on Extending Database Technology (EDBT 1998), València, Spain, pp. 172–83.
Bruckschen M., de Souza J. G. C., Vieira R., and Rigo S. 2008. Sistema SeRELeP para o reconhecimento de relações entre entidades mencionadas. In Mota C. and Santos D. (eds.), Desafios na Avaliação Conjunta do Reconhecimento de Entidades Mencionadas: O Segundo HAREM, pp. 247–60. Linguateca.
Bunescu R. C., and Mooney R. J. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of the Human Language Technology Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP 2005), Vancouver, Canada, pp. 724–31.
Bunescu R. C., and Mooney R. J. 2007. Learning to extract relations from the web using minimal supervision. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007), Prague, Czech Republic, pp. 576–83.
Cardoso N. 2008. REMBRANDT - Reconhecimento de Entidades Mencionadas Baseado em Relações e ANálise Detalhada do Texto. In Mota C. and Santos D. (eds.), Desafios na Avaliação Conjunta do Reconhecimento de Entidades Mencionadas: O Segundo HAREM, pp. 195211. Linguateca.
Chang C., and Lin C. 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2 (3): 127.
Chaves M. S. 2008. Geo-ontologias e padrões para reconhecimento de locais e de suas relações em textos: o SEI-Geo no Segundo HAREM. In Mota C. and Santos D. (eds.), Desafios na Avaliação Conjunta do Reconhecimento de Entidades Mencionadas: O Segundo HAREM, pp. 231–45. Linguateca.
Costa F., and Branco A. 2012. Extracting temporal information from portuguese texts. In Proceedings of the 10th International Conference on Computational Processing of the Portuguese Language (PROPOR 2012), pp. 99105. Lecture Notes in Artificial Intelligence, vol. 7243. Berlin: Springer-Verlag.
Etzioni O., Cafarella M., Downey D., Kok S., Popescu A. M., Shaked T., Soderland S., Weld D. S., and Yates A. 2004. Web-scale information extraction in KnowItAll. In Proceedings of the 13th International Conference on World Wide Web (WWW 2004), New York, USA, pp. 100–10.
Etzioni O., Fader A., Christensen J., Soderland S., and Center M. T. 2011. Open information extraction: the second generation. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011), Barcelona, Catalonia, Spain.
Finkelstein-Landau M., and Morin E. 1999. Extracting semantic relationships between terms: supervised vs. unsupervised methods. In Proceedings of International Workshop on Ontological Engineering on the Global Information Infrastructure, Dagstuhl Castle, Germany, pp. 71–80.
Fleischman M., Hovy E., and Echihabi A. 2003. Offline strategies for online question answering: answering questions before they are asked. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003), Sapporo, Japan, pp. 1–7.
Gamallo P., and González I. 2013. A compressing strategy for dependency parsing. Under review for Revista Electrónica de Lingüística Aplicada.
Gamallo P., Garcia M., and Fernández-Lanza S. 2012. Dependency-based open information extraction. In Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP (ROBUS-UNSUP 2012) at the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012), Avignon, France, pp. 1018.
Gamallo P., and González I. 2011. A grammatical formalism based on patterns of part-of-speech tags. International Journal of Corpus Linguistics 16 (1): 4571.
Garcia M., and Gamallo P. 2011a. An exploration of the linguistic knowledge for semantic relation extraction in Spanish. In Saint-Dizier P. and Mehta-Melkar R. (eds.), Proceedings of the Joint Workshop FAM-LbR/KRAQ 2011. Learning by Reading and Its Applications in Intelligent Question-Answering at 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011), Barcelona, Catalonia, Spain, pp. 712.
Garcia M., and Gamallo P. 2011b. Dependency-based text compression for semantic relation extraction. In Nakov P., Kozareva Z., Ganchev K., and Hobbs J. (eds.), Proceedings of the Workshop on Information Extraction and Knowledge Acquisition (IEKA 2011) at 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011), Hissar, Bulgaria, pp. 21–8.
Garera N., and Yarowsky D. 2009. Structural, transitive and latent models for biographic fact extraction. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2009), Athens, Greece, pp. 300–8.
Grishman R. 2010. The impact of task and corpus on event extraction systems. In Proceeding of 7th Language Resources and Evaluation Conference (LREC 2010), Valleta, Malta.
Hearst M. A. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th Conference on Computational Linguistics 2: 539–45.
Hoffmann R., Zhang C., and Weld D. S. 2010. Learning 5000 relational extractors. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010), Uppsala, Sweden, pp. 286–95.
Jiang J., and Zhai C. 2007. A systematic exploration of the feature space for relation extraction. In Proceedings of the Human Language Technologies/The Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL 2007), Rochester, NY, USA, pp. 113–20.
Jijkoun V., De Rijke M., and Mur J. 2004. Information extraction for question answering: improving recall through syntactic patterns. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland, pp. 1284–90.
Kambhatla N. 2004. Combining lexical, syntactic and semantic features with maximum entropy models for extracting relations. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, Catalonia, Spain.
Lin D. 2003. Dependency-based evaluation of MINIPAR. Treebanks: Building and Using Parsed Corpora 20: 317–29.
Liu X., Nie Z., Yu N., and Wen J. 2010. BioSnowball: automated population of Wikis. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2010), Washington, DC, USA, pp. 969–78.
Mann G. S. 2002. Fine-grained proper noun ontologies for question answering. In Proceedings of the 2002 Workshop on Building and Using Semantic Networks (SemaNet 2002), Taipei, Taiwan, pp. 17.
Mintz M., Bills S., Snow R., and Jurafsky D. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL/AFNLP 2009), Singapore, pp. 1003–11.
Mota C., and Santos D. 2008. Desafios na Avaliação Conjunta do Reconhecimento de Entidades Mencionadas: O Segundo HAREM. Linguateca.
Nagy I., and Farkas R. 2010. Person attribute extraction from the textual parts of web pages. In CLEF (Notebook Papers/LABs/Workshops), Padua, Italy.
Nguyen D. P. T., Matsuo Y., and Ishizuka M. 2007. Relation extraction from Wikipedia using subtree mining. In Proceedings of the 22nd National Conference on Artificial Intelligence, Vancouver, Canada, vol. 2, pp. 1414–20.
Nguyen T.-V. T., Moschitti A., and Riccardi G. 2009. Convolution kernels on constituent, dependency and sequential structures for relation extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP 2009), Singapore, vol. 3, pp. 1378–87.
Oliveira H. G., and Gomes P. 2010. Onto.PT: automatic construction of a lexical ontology for portuguese. In Proceedings of 5th European Starting AI Researcher Symposium (STAIRS 2010), Lisbon, Portugal, pp. 199211.
Oliveira H. G., Santos D., Gomes P., and Seco N. 2008. PAPEL: a dictionary-based lexical ontology for Portuguese. In Computational Processing of the Portuguese Language, pp. 31–40. Berlin: Springer-Verlag.
Padró Ll., Collado M., Reese S., Lloberes M., and Castellón I. 2010. FreeLing 2.1: five years of open-source language processing tools. In Proceedings of 7th Language Resources and Evaluation Conference (LREC 2010), Valleta, Malta.
Pantel P., and Pennacchiotti M. 2006. Espresso: leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL 2006), Sydney, NSW, Australia, pp. 113–20.
Pasca M., Lin D., Bigham J., Lifchits A., and Jain A. 2006. Organizing and searching the world wide web of facts-step one: the one-million fact extraction challenge. In Proceedings of the National Conference on Artificial Intelligence, Boston, MA, USA, vol. 21, pp. 1400–5.
Ravichandran D., and Hovy E. 2002. Learning surface text patterns for a question answering system. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), Philadelphia, PA, USA, pp. 41–7.
Riedel S., Yao L., and McCallum A. 2010. Modeling relations and their mentions without labeled text. In Proceedings of the 2010 European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 148163. Berlin: Springer-Verlag.
Ruiz-Casado M., Alfonseca E., and Castells P. 2005. Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets. In Proceedings of the Atlantic Web Intelligence Conference (AWIC 2005), pp. 380–6. Lecture Notes in Computer Science, vol. 3528. Berling: Springer-Verlag.
Sánchez-Cuadrado S., Lloréns J., Morato J., and Hurtado J. A. 2003. Extracción automática de relaciones semánticas. In 2da Conferencia Iberoamericana en Sistemas, Cibernética e Informática (CISCI 2003), Orlando, Florida, pp. 41–7.
Sierra G., Alarcón R., Aguilar C., and Bach C. 2008. Definitional verbal patterns for semantic relation extraction. Terminology 14 (1): 7498.
Snow R., Jurafsky D., and Ng A. Y. 2005. Learning syntactic patterns for automatic hypernym discovery. Advances in Neural Information Processing Systems 17: 1297–304.
Soares S., Martins B., and Calado P. 2011. Extracting biographical sentences from textual documents. In Proceedings of the 15th Portuguese Conference on Artificial Intelligence (EPIA 2011), Lisbon, Portugal, pp. 718–30.
Soler V., and Alcina A. 2008. Patrones léxicos para la extracción de conceptos vinculados por la relación parte-todo en español. Terminology 14 (1): 99123.
Suchanek F. M., Ifrim G., and Weikum G. 2006. LEILA: Learning to Extract Information by Linguistic Analysis. In Second Workshop on Ontology Population (OLP2) at the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL 2006), Sydney, NSW, Australia.
Sun A., Grishman R., Xu W., and Min B. 2011. New York University 2011 system for KBP slot filling. In Proceedings of the Text Analytics Conference (TAC 2011), Gaithersburg, MD, USA.
Wan X., Gao J., Li M., and Ding B. 2005. Person resolution in person search results: WebHawk. In Proceedings of the 14th Association for Computing Machinery International Conference on Information and Knowledge Management (CIKM 2005), Bremen, Germany, pp. 163–70.
Wu F., and Weld D. S. 2010. Open information extraction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010), Uppsala, Sweden, pp. 118–27.
Yan Y., Okazaki N., Matsuo Y., Yang Z., and Ishizuka M. 2009. Unsupervised relation extraction by mining Wikipedia texts using information from the web. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL/AFNLP 2009), Singapore, pp. 1021–9.
Zhao S., and Grishman R. 2005. Extracting relations with integrated information using kernel methods. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), Ann Arbor, MI, USA, pp. 419–26.
Zhang M., Zhang J., Su J., and Zhou G. 2006 A composite kernel to extract relations between entities with both flat and structured features. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL 2006), Sydney, NSW, Australia, pp. 825–32.
Zhou G., Su J., Zhang J., and Zhang M. 2005. Exploring various knowledge in relation extraction. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), Ann Arbor, MI, USA, pp. 427–34.
Zhou G., Zhang M., Ji D. H., and Zhu Q. 2007. Tree kernel-based relation extraction with context-sensitive structured parse tree information. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP 2009), Singapore, pp. 728–36.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 2
Total number of PDF views: 33 *
Loading metrics...

Abstract views

Total abstract views: 374 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 21st November 2017. This data will be updated every 24 hours.