Hostname: page-component-77f85d65b8-g98kq Total loading time: 0 Render date: 2026-03-29T21:14:39.902Z Has data issue: false hasContentIssue false

Statistical dataset evaluation: A case study on named entity recognition

Published online by Cambridge University Press:  06 September 2024

Chengwen Wang
Affiliation:
School of International Cultural Exchange, Central University of Finance and Economics, Beijing, China
Qingxiu Dong
Affiliation:
MOE Key Lab of Computational Linguistics, School of Computer Science, Peking University, Beijing, China
Xiaochen Wang
Affiliation:
MOE Key Lab of Computational Linguistics, School of Computer Science, Peking University, Beijing, China
Zhifang Sui*
Affiliation:
MOE Key Lab of Computational Linguistics, School of Computer Science, Peking University, Beijing, China
*
Corresponding author: Zhifang Sui; Email: szf@pku.edu.cn
Rights & Permissions [Opens in a new window]

Abstract

Datasets serve as crucial training resources and model performance trackers. However, existing datasets have exposed a plethora of problems, inducing biased models and unreliable evaluation results. In this paper, we propose a model-agnostic dataset evaluation framework for automatic dataset quality evaluation. We seek the statistical properties of the datasets and address three fundamental dimensions: reliability, difficulty, and validity, following a Classical Test Theory (CTT). Taking the named entity recognition (NER) datasets as a case study, we introduce nine statistical metrics for a statistical dataset evaluation framework. Specifically, we investigate the reliability of a NER dataset with three metrics, including Redundancy, Accuracy, and Leakage Ratio. We assess the dataset difficulty through four metrics: Unseen Entity Ratio, Entity Ambiguity Degree, Entity Density, and Model Differentiation. For validity, we introduce the Entity Imbalance Degree and Entity-Null Rate to evaluate the effectiveness of the dataset in assessing language model performance. Experimental results validate that our evaluation framework effectively assesses various aspects of the dataset quality. Furthermore, we study how the dataset scores on our statistical metrics affect the model performance and appeal for dataset quality evaluation or targeted dataset improvement before training or testing models.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. Our statistical dataset evaluation framework based on the Classical Testing Theory. We introduce nine quality evaluation metrics from three dimensions: reliability, difficulty, and validity. The dataset scores on the metrics have a significant impact on the models (trained on this dataset) in many aspects, such as the average performance and the out-of-domain robustness.

Figure 1

Table 1. Statistical evaluation of ten NER datasets

Figure 2

Table 2. Standard named entity recognition dataset statistics.

Figure 3

Figure 2. Evaluation results of WNUT16, CoNLL03, Resume, and MSRA under different dimensions and metrics. The abbreviations and corresponding full names of the metrics are presented in Sec. 4.

Figure 4

Table 3. Results of metrics (except Leakage Ratio) under the reliability dimension of the NER datasets.

Figure 5

Figure 3. Leakage Ratio values of WikiAnn and Weibo. It is observed that 0.13 (13 percent) and 0.17 (17 percent) of the instances in the test set of WikiAnn and Weibo, respectively, appear in their corresponding training or development sets.

Figure 6

Table 4. Chinese NER model replication results.

Figure 7

Table 5. English NER model replication results.

Figure 8

Figure 4. Model performance on NER datasets when the proportion of unseen entities (UnSeenEn) in the test set is 0.80 (UnSeenEn_8) and 0.20 (UnSeenEn_2), respectively.

Figure 9

Figure 5. Model performance on NER datasets when the proportion of ambiguous entities in the test set is 0.80 (EnAmb_8) and 0.20 (EnAmb_2), respectively.

Figure 10

Table 6. Model performance when the proportion of leaked samples in the test set is 80 percent and 20 percent, respectively.

Figure 11

Table 7. Model performance on English datasets when the proportion of samples without entities in the training set and development set is 0.80 (80 percent), 0.20 (20 percent), 0.00 (0 percent), and original, respectively.

Figure 12

Table 8. Model performance on Chinese datasets when the proportion of samples without entities in the training set and development set is 0.80 (80 percent), 0.20 (20 percent), 0.00 (0 percent), and original, respectively.

Figure 13

Table 9. Standard named entity recognition dataset construction method.

Figure 14

Figure 6. Mislabeled Examples of randomly selected samples from CLUNER. Red indicates missing entities not assigned entity labels. Green indicates the entity with the wrong labeled entity type.

Figure 15

Table 10. Hyperparameter settings for various NER models.