Skip to main content
    • Aa
    • Aa

Exploring the effectiveness of linguistic knowledge for biographical relation extraction


Machine learning techniques have been implemented to extract instances of semantic relations using diverse features based on linguistic knowledge, such as tokens, lemmas, PoS-tags, or dependency paths. However, there has been little work aiming to know which of these features works better in the relation extraction task, and less in languages other than English. In this paper, various features representing different levels of linguistic knowledge are systematically evaluated for biographical relation extraction. The effectiveness of these features was measured by training several supervised classifiers that only differ in the type of linguistic knowledge used to define their features. The experiments performed in this paper show that some basic linguistic knowledge (provided by lemmas and their combination in bigrams) behaves better than other complex features, such as those based on syntactic analysis. Furthermore, some feature combinations using different levels of analysis are proposed in order (i) to avoid feature overlapping as well as (ii) to evaluate the use of computationally inexpensive and widespread tools such as tokenization and lemmatization. This paper also describes two new freely available corpora for biographical relation extraction in Portuguese and Spanish, built by means of a distant-supervision strategy. Experiments were performed with five semantic relations and two languages, using these corpora.

Linked references
Hide All

This list contains references from the content that can be linked to their source. For a full set of references and notes please see the PDF or HTML where available.

R. C. Bunescu , and R. J. Mooney 2005. A shortest path dependency kernel for relation extraction. In Proceedings of the Human Language Technology Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP 2005), Vancouver, Canada, pp. 724–31.

C. Chang , and C. Lin 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2 (3): 127.

F. Costa , and A. Branco 2012. Extracting temporal information from portuguese texts. In Proceedings of the 10th International Conference on Computational Processing of the Portuguese Language (PROPOR 2012), pp. 99105. Lecture Notes in Artificial Intelligence, vol. 7243. Berlin: Springer-Verlag.

P. Gamallo , and I. González 2011. A grammatical formalism based on patterns of part-of-speech tags. International Journal of Corpus Linguistics 16 (1): 4571.

N. Garera , and D. Yarowsky 2009. Structural, transitive and latent models for biographic fact extraction. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2009), Athens, Greece, pp. 300–8.

M. A. Hearst 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th Conference on Computational Linguistics 2: 539–45.

V. Jijkoun , M. De Rijke , and J. Mur 2004. Information extraction for question answering: improving recall through syntactic patterns. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland, pp. 1284–90.

D. Lin 2003. Dependency-based evaluation of MINIPAR. Treebanks: Building and Using Parsed Corpora 20: 317–29.

T.-V. T. Nguyen , A. Moschitti , and G. Riccardi 2009. Convolution kernels on constituent, dependency and sequential structures for relation extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP 2009), Singapore, vol. 3, pp. 1378–87.

H. G. Oliveira , D. Santos , P. Gomes , and N. Seco 2008. PAPEL: a dictionary-based lexical ontology for Portuguese. In Computational Processing of the Portuguese Language, pp. 31–40. Berlin: Springer-Verlag.

Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Full text views

Total number of HTML views: 2
Total number of PDF views: 29 *
Loading metrics...

Abstract views

Total abstract views: 353 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 24th September 2017. This data will be updated every 24 hours.