Hostname: page-component-89b8bd64d-n8gtw Total loading time: 0 Render date: 2026-05-08T11:17:03.556Z Has data issue: false hasContentIssue false

Focusing on potential named entities during active label acquisition

Published online by Cambridge University Press:  06 June 2023

Ali Osman Berk Şapcı
Affiliation:
Faculty of Engineering and Natural Sciences, Sabancı University, Istanbul, Turkey
Hasan Kemik
Affiliation:
Faculty of Engineering and Natural Sciences, Sabancı University, Istanbul, Turkey
Reyyan Yeniterzi*
Affiliation:
Faculty of Engineering and Natural Sciences, Sabancı University, Istanbul, Turkey
Oznur Tastan*
Affiliation:
Faculty of Engineering and Natural Sciences, Sabancı University, Istanbul, Turkey
*
Corresponding author: Reyyan Yeniterzi, Oznur Tastan; E-mails: reyyan.yeniterzi@sabanciuniv.edu, otastan@sabanciuniv.edu
Corresponding author: Reyyan Yeniterzi, Oznur Tastan; E-mails: reyyan.yeniterzi@sabanciuniv.edu, otastan@sabanciuniv.edu
Rights & Permissions [Opens in a new window]

Abstract

Named entity recognition (NER) aims to identify mentions of named entities in an unstructured text and classify them into predefined named entity classes. While deep learning-based pre-trained language models help to achieve good predictive performances in NER, many domain-specific NER applications still call for a substantial amount of labeled data. Active learning (AL), a general framework for the label acquisition problem, has been used for NER tasks to minimize the annotation cost without sacrificing model performance. However, the heavily imbalanced class distribution of tokens introduces challenges in designing effective AL querying methods for NER. We propose several AL sentence query evaluation functions that pay more attention to potential positive tokens and evaluate these proposed functions with both sentence-based and token-based cost evaluation strategies. We also propose a better data-driven normalization approach to penalize sentences that are too long or too short. Our experiments on three datasets from different domains reveal that the proposed approach reduces the number of annotated tokens while achieving better or comparable prediction performance with conventional methods.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Table 1. Details and statistics of the datasets.

Figure 1

Figure 1. Summary of the proposed active learning framework for the NER task. The main active learning loop consists of three steps (model evaluation, training, and active learning query) and continues until the stopping criterion is satisfied. For each query, uncertainties of unlabeled sentences are estimated with function $\Phi$. Then, the most uncertain sentences are sent to an annotator to expand the labeled training set. To compute $\Phi$, we propose to focus on tokens that are predicted to have positive annotations. Our approach for predicting positive tokens consists of computing semi-supervised embeddings of tokens and density-based clustering.

Figure 2

Table 2. Uncertainty-based querying methods and their abbreviations used throughout the text.

Figure 3

Table 3. Proposed aggregated uncertainty measures for each token uncertainty measure.

Figure 4

Figure 2. An example sentence and corresponding computations of different query evaluation functions$^{*}$.

Figure 5

Table 4. Comparison of approaches for extracting BERT embeddings based on passive learning $F_1$-scores.${}^{a}$ First header row indicates the dimension of embeddings. The second header row stands for the strategies used to obtain embeddings from BERT.

Figure 6

Table 5. The $F_1$-scores of the baseline methods (first three rows) and average $F_1$-scores of different sentence score aggregation strategies in the last four iterations before convergence. The percentages of sentences queried are provided for the corresponding iterations under the iteration number. The reported deviations are the corresponding standard error of the mean.

Figure 7

Figure 3. Average $F_1$-scores of RS, LSS, and PAS methods with respect to the total number of annotated tokens.

Figure 8

Figure 4. Average $F_1$-scores of total and total-pos methods with respect to the total number of annotated tokens.

Figure 9

Figure 5. Average $F_1$-scores of norm and dnorm-pos methods with respect to the total number of annotated tokens.

Figure 10

Figure 6. Average $F_1$-scores of total-pos and dnorm-pos methods with respect to the total number of annotated tokens.

Figure 11

Table 6. Comparison of aggregation methods for each uncertainty measure.${}^{a}$ The method which achieves the indicated $F_1$-score with the least number of tokens is reported. Proposed methods are italicized. In each column, the method that requires the least number of sentences to achieve the indicated $F_1$-score is stated in bold. The abbreviations are listed in Table 2 and Table 3.

Figure 12

Figure 7. Average $F_1$-scores of total and dnorm-pos methods with respect to the total number of annotated tokens.

Figure 13

Figure 8. BERT embeddings of CoNLL-03 reduced to two dimension by semi-supervised UMAP, %2 labeled.

Supplementary material: PDF

Şapcı et al. supplementary material

Şapcı et al. supplementary material

Download Şapcı et al. supplementary material(PDF)
PDF 181.3 KB