Hostname: page-component-77c78cf97d-9dm9z Total loading time: 0 Render date: 2026-04-24T20:51:00.128Z Has data issue: false hasContentIssue false

Do transformer-based token classification methods solve the problem of terminology extraction?

Published online by Cambridge University Press:  15 August 2025

Małgorzata Marciniak
Affiliation:
Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
Piotr Rychlik
Affiliation:
Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
Agnieszka Mykowiecka*
Affiliation:
Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
*
Corresponding author: Agnieszka Mykowiecka; Email: agn@ipipan.waw.pl
Rights & Permissions [Opens in a new window]

Abstract

Results obtained by transformer-based token classification models are now considered to be a benchmark for the Automatic Terminology Extraction (ATE) task. However, the unsatisfactory results (they rarely exceed 0.7 of the F1 value) raise the question of whether this approach is correct and of what text features are being remembered or inferred by the model trained on this type of annotation. In the paper, we describe a number of experiments using the fine-tuned RoBERTa base model on the ACTER data, RD-TEC, and three Wikipedia articles, which proved that the results of the ATE task obtained by such models depend considerably on the type of texts being processed and their relationship to the training data. While the results are relatively good for some texts with highly specialized vocabulary, the poor results seem to correlate with the high frequency (in general English texts) of tokens that are part of terms in a particular domain. Another property that affects the results is the degree of overlap between the vocabulary of the test data and the vocabulary of terms from the training data. Words that have been labeled as terms in the training data are usually labeled as terms in other, unrelated domains as well. Moreover, we show that the results obtained by these models are unstable—models trained on more data do not include all the items identified by models trained on a smaller dataset and can present substantially lower performance.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Table 1. The results of ATE evaluated on the list of terms of the ACTER-HTFL dataset in English. The first three results are for models trained on English ACTER data, while the last two are for models trained on multilingual ACTER data

Figure 1

Table 2. ACTER data statistics: number of tokens, number of annotated terms, number of different terms, annotated named entities (NE), and different NE

Figure 2

Table 3. Other data statistics

Figure 3

Table 4. Results obtained by our models. The upper part of the table shows models that were trained on three parts of the ACTER corpus (the part which was removed is noted in the rows as a subtracted one, e.g. ACTER-CORP, and tested on the fourth (listed in the first column). The models were trained in two variants: using only term annotations or using term and named entities annotations. The lower part contains the results obtained for the RD-TEC corpus and three Wikipedia entries, by the model trained on the entire ACTER corpus. Notation: t – number of terms annotated in the corpus, p – number of terms predicted by the models, tp – number of correct predictions. P – precision, R – recall

Figure 4

Table 5. Comparison of results for different models tested on the ACTER corpus. The first column contains the names of the corpus parts on which the models were tested. The second contains the names of the models being tuned: RoBERTa base (R-B), RoBERTa large (R-L), BERT base/large cased/uncased (B-B-C, B-B-U, B-L-C, B-L-U), DeBERTa base (D-B), and MPNet base (M-B). All these models were trained on the remaining three parts of ACTER. The remaining columns contain precision, recall, and F1 for the models trained and tested on data with and without named entities, respectively. The best result for each data configuration is underlined, the worst is highlighted with a dashed line

Figure 5

Table 6. Comparison of results for different models tested on the RD-TEC corpus and three Wikipedia articles. The names of the models are explained in Table 5. These models were trained on the entire ACTER corpus both on data with and without labeled named entities

Figure 6

Figure 1. Incremental analysis for ACTER corpora. On the left are the results for precision (P), recall (R), and F1 score, and on the right are the numbers of true, predicted, and true predicted terms after examining consecutive sentences.

Figure 7

Table 7. Comparison of F1 scores, rounded to two decimal places, obtained by cased (C) and uncased (U) models tested on the same datasets. If these scores differ by at least 0.1 for the same data, they are highlighted

Figure 8

Table 8. Results for terms with the specific POS sequence using the original training data and data extended by the electron entry. Abbreviations used: N-noun, A-adjective, $_{P}$N-proper noun. The number of true predicted terms and the different syntactic patterns they represent are given in columns 2 and 3

Figure 9

Table 9. Average frequency of tokens different from punctuation marks counted for tokens without stop words. The first column indicates the number of tokens different from punctuation marks including stop words, while the second one shows the number of tokens without them

Figure 10

Table 10. Average frequency of the list of annotated terms in the corpora counted without stop words

Figure 11

Figure 2. Frequency (according to TWC) of manually annotated different terms of ACTER and RD-TEC data. The left graph displays the frequency according to the number of terms, while the right graph displays the frequency according to the percentage of terms in the datasets.

Figure 12

Table 11. Percentage of terms (columns overlap) containing a token tagged as a fragment of a term from another domain. The columns headed all no. indicate the number of all terms. Statistics are given for terms annotated manually, all predicted by the model and those predicted correctly and incorrectly by the model

Figure 13

Table 12. Results obtained by models on the EQUI data. The best results are in bold. Notation: t – number of terms annotated in the corpus, p – number of terms predicted by the models, tp – number of correct predictions. P – precision, R – recall

Figure 14

Table 13. Comparison of the term lists obtained by models trained on two datasets with the model trained on all datasets except EQUI. The columns headed all give the number of all terms; the columns headed extra give the number of extracted terms that are not common to compared lists; the column headed common gives the number of common terms recognized by both compared models; p – numbers of predicted terms; tp – numbers of true predicted terms

Figure 15

Table 14. The number of tokens recognized as a term component (labels: ‘B’, ‘I’) by the models trained on the dataset and not recognized by the model trained on CORP + HTFL + WIND. The column headed all recognized gives the number of tokens recognized as a term component by the model in dataset. The next column gives the number of tokens recognized by the smaller dataset and not recognized as term components by the model CORP + HTFL + WIND. The last column gives the number of tokens manually annotated as a term component within those in column 3

Figure 16

Figure 3. Precision (P), recall (R), and F1 score obtained by nine models tested on the EQUI dataset and trained on nine sets successively increased by 10% of the data from each of the other parts of the ACTER corpus for four experiments: with English (en-1, en-2), French (fr), and multilingual (en+fr+nl) texts.

Figure 17

Table 15. Comparison of the lists of correctly predicted terms excluding named entities (NE) obtained by models tested with and without NE terms. The columns headed common give the number of correctly recognized terms by both models, and the columns headed extra give the number of extracted terms that are not common to both lists

Figure 18

Table 16. The number of tokens recognized as a term component by the dataset model without named entities (NE) and not recognized by the model with NE. The columns headed all recognized give the number of tokens recognized as a term component by the models without NE. The next column gives the number of tokens recognized by the model without NE and not recognized as term components by the model with NE. The last column gives the number of tokens manually annotated as a term component within those in column 3

Figure 19

Table 17. Examples of terms from the ACTER corpus along with the number of their recognized (+) and unrecognized ($-$) occurrences in sentences