Hostname: page-component-77f85d65b8-jkvpf Total loading time: 0 Render date: 2026-03-29T12:12:30.527Z Has data issue: false hasContentIssue false

TNT-KID: Transformer-based neural tagger for keyword identification

Published online by Cambridge University Press:  10 June 2021

Matej Martinc*
Affiliation:
Jožef Stefan Institute, Department of Knowledge Technologies, Jamova 39, 1000 Ljubljana, Slovenia Jožef Stefan International Postgraduate School, Department of Knowledge Technologies, Jamova 39, 1000 Ljubljana, Slovenia
Blaž Škrlj
Affiliation:
Jožef Stefan Institute, Department of Knowledge Technologies, Jamova 39, 1000 Ljubljana, Slovenia Jožef Stefan International Postgraduate School, Department of Knowledge Technologies, Jamova 39, 1000 Ljubljana, Slovenia
Senja Pollak
Affiliation:
Jožef Stefan Institute, Department of Knowledge Technologies, Jamova 39, 1000 Ljubljana, Slovenia
*
*Corresponding author. E-mail: matej.martinc@ijs.si
Rights & Permissions [Opens in a new window]

Abstract

With growing amounts of available textual data, development of algorithms capable of automatic analysis, categorization, and summarization of these data has become a necessity. In this research, we present a novel algorithm for keyword identification, that is, an extraction of one or multiword phrases representing key aspects of a given document, called Transformer-Based Neural Tagger for Keyword IDentification (TNT-KID). By adapting the transformer architecture for a specific task at hand and leveraging language model pretraining on a domain-specific corpus, the model is capable of overcoming deficiencies of both supervised and unsupervised state-of-the-art approaches to keyword extraction by offering competitive and robust performance on a variety of different datasets while requiring only a fraction of manually labeled data required by the best-performing systems. This study also offers thorough error analysis with valuable insights into the inner workings of the model and an ablation study measuring the influence of specific components of the keyword identification workflow on the overall performance.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2021. Published by Cambridge University Press
Figure 0

Figure 1. TNT-KID’s architecture overview. (a) Model architecture. (b) The attention mechanism.

Figure 1

Figure 2. Encoding of the input text “The advantage of this is to introduce distributed interactions between the UDDI clients.” with keywords distributed interactions and UDDI. In the first step, the text is converted into a numerical sequence, which is used as an input to the model. The model is trained to convert this numerical sequence into a sequence of zeroes and ones, where the ones indicate the position of a keyword.

Figure 2

Table 1. Datasets used for empirical evaluation of keyword extraction algorithms. No.docs stands for number of documents, Avg. doc. length stands for average document length in the corpus (in terms of number of words, that is, we split the text by white space), Avg. kw. stands for average number of keywords per document in the corpus, % present kw. stands for the percentage of keywords that appear in the corpus (i.e., percentage of document’s keywords that appear in the text of the document), and Avg. present kw. stands for the average number of keywords per document that actually appear in the text of the specific document

Figure 3

Table 2. Empirical evaluation of state-of-the-art keyword extractors. Results marked with * were obtained by our implementation or reimplementation of the algorithm and results without * were reported in the related work

Figure 4

Figure 3. Critical distance diagram showing the results of the Nemenyi test. Two keyword extraction approaches are statistically significantly different in terms of F1@10 if a difference between their ranks (shown in brackets next to the keyword extraction approach name) is larger than the critical distance (CD). If two approaches are connected with a horizontal line, the test did not detect statistically significant difference between the approaches. For the Nemenyi test $\alpha = 0.05$ was used.

Figure 5

Figure 4. (a) Relation between the average number of present keywords per document for each test dataset and the difference in performance ($F1@10_{\textrm{TNT-KID}} - F1@10_{\textrm{CatSeqD}}$). (b) Relation between the percentage of keywords that appear in the train set for each test dataset and the difference in performance ($F1@10_{\textrm{TNT-KID}} - F1@10_{\textrm{CatSeqD}}$).

Figure 6

Figure 5. Performance of the KP20k trained CatSeqD model fine-tuned on SemEval, Krapivin and Inspec validation sets and tested on the corresponding test sets, in correlation with the length of the fine-tuning in terms of number of train steps. Zero train steps means that the model was not fine-tuned.

Figure 7

Figure 6. Average attention for each token position in the SemEval corpus across eight attention heads. Distinct peaks can be observed for tokens appearing at the beginning of the document in all eight attention heads.

Figure 8

Figure 7. Number of keywords for each token position in the SemEval corpus. Distinct peaks can be observed for positions at the beginning of the document.

Figure 9

Figure 8. Attention-colored tokens. Underlined phrases were identified as keywords by the system and bold font indicates that the identification was correct (i.e., the keyphrase appears in the gold standard). Less color transparency indicates stronger attention for the token and the color itself designates that the token was correctly identified as keyword (green), incorrectly identified as keyword (red) or was not identified as keyword by the system (blue).

Figure 10

Table 3. Results of the ablation study. Column LM+BPE+BiLSTM represents the results for the model that was used for comparison with other methods from the related work in Section 4.4