Hostname: page-component-89b8bd64d-7zcd7 Total loading time: 0 Render date: 2026-05-08T03:28:42.531Z Has data issue: false hasContentIssue false

Learning keyphrases from corpora and knowledge models

Published online by Cambridge University Press:  10 September 2019

R. Silveira*
Affiliation:
Eixo de Informação e Comunicação, Instituto Federal de Educação do Ceará (IFCE), Brazil
V. Furtado
Affiliation:
Programa de Pós-graduação em Informática Aplicada, Universidade de Fortaleza (UNIFOR), Brazil.
V. Pinheiro
Affiliation:
Programa de Pós-graduação em Informática Aplicada, Universidade de Fortaleza (UNIFOR), Brazil.
*
*Corresponding author. Email: raquel.vsilveira@hotmail.com

Abstract

Extraction keyphrase systems traditionally use classification algorithms and do not consider the fact that part of the keyphrases may not be found in the text, reducing the accuracy of such algorithms a priori. In this work, we propose to improve the accuracy of these systems with inferential mechanisms that use a knowledge representation model, including symbolic models of knowledge bases and distributional semantics, to expand the set of keyphrase candidates to be submitted to the classification algorithm with terms that are not in the text (not-in-text terms). The basic assumption we have is that not-in-text terms have a semantic relationship with terms that are in the text. To represent this relationship, we have defined two new features to be represented as input to the classification algorithms. The first feature refers to the power of discrimination of the inferred not-in-text terms. The intuition behind this is that good candidates for a keyphrase are those that are deduced from various textual terms in a specific document and that are not often deduced in other documents. The other feature represents the descriptive strength of a not-in-text candidate. We argue that not-in-text keyphrases must have a strong semantic relationship with the text and that the power of this semantic relationship can be measured in a similar way as popular metrics like TFxIDF. The method proposed in this work was compared with state-of-the-art systems using five corpora and the results show that it has significantly improved automatic keyphrase extraction, dealing with the limitation of extracting keyphrases absent from the text.

Information

Type
Article
Copyright
© Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable