Hostname: page-component-77f85d65b8-t6st2 Total loading time: 0 Render date: 2026-03-28T07:29:20.039Z Has data issue: false hasContentIssue false

Assessment of the E3C corpus for the recognition of disorders in clinical texts

Published online by Cambridge University Press:  18 July 2023

Roberto Zanoli
Affiliation:
Center for Digital Health & Wellbeing, Fondazione Bruno Kessler, Via Sommarive, 18 - Povo, 38123, Trento, Italy
Alberto Lavelli
Affiliation:
Center for Digital Health & Wellbeing, Fondazione Bruno Kessler, Via Sommarive, 18 - Povo, 38123, Trento, Italy
Daniel Verdi do Amarante
Affiliation:
Department of Math & Computer Science, University of Richmond, 410 Westhampton Way University of Richmond, 23173, Richmond, VA, USA
Daniele Toti*
Affiliation:
Faculty of Mathematical, Physical and Natural Sciences, Catholic University of the Sacred Heart, viale Garzetta 48, 25133 Brescia, Italy
*
Corresponding author: Daniele Toti; Email: daniele.toti@unicatt.it
Rights & Permissions [Opens in a new window]

Abstract

Disorder named entity recognition (DNER) is a fundamental task of biomedical natural language processing, which has attracted plenty of attention. This task consists in extracting named entities of disorders such as diseases, symptoms, and pathological functions from unstructured text. The European Clinical Case Corpus (E3C) is a freely available multilingual corpus (English, French, Italian, Spanish, and Basque) of semantically annotated clinical case texts. The entities of type disorder in the clinical cases are annotated at both mention and concept level. At mention -level, the annotation identifies the entity text spans, for example, abdominal pain. At concept level, the entity text spans are associated with their concept identifiers in Unified Medical Language System, for example, C0000737. This corpus can be exploited as a benchmark for training and assessing information extraction systems. Within the context of the present work, multiple experiments have been conducted in order to test the appropriateness of the mention-level annotation of the E3C corpus for training DNER models. In these experiments, traditional machine learning models like conditional random fields and more recent multilingual pre-trained models based on deep learning were compared with standard baselines. With regard to the multilingual pre-trained models, they were fine-tuned (i) on each language of the corpus to test per-language performance, (ii) on all languages to test multilingual learning, and (iii) on all languages except the target language to test cross-lingual transfer learning. Results show the appropriateness of the E3C corpus for training a system capable of mining disorder entities from clinical case texts. Researchers can use these results as the baselines for this corpus to compare their own models. The implemented models have been made available through the European Language Grid platform for quick and easy access.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Figure 1. Number of published case reports per year in PubMed extracted with the query “Case Reports[Publication Type]”.

Figure 1

Figure 2. Number of publications per year in PubMed about entity recognition extracted with the query “Entity Recognition[Title/Abstract]”.

Figure 2

Table 1. Layer 1: document, sentence, and token counts; source type distribution; entities by type and language.

Figure 3

Table 2. Number of documents and disorder entities in the training and test partitions of Layer 1.

Figure 4

Figure 3. Example of annotated clinical case text. Each clinical entity is annotated with its name span and associated with its corresponding CUI in UMLS. The annotation does not specify which words inside a discontinuous entity span are actually part of the entity or not.

Figure 5

Table 3. Number of annotated entities before and after data preprocessing.

Figure 6

Table 4. Deep learning models and hyper-parameters used in the setup.

Figure 7

Table 5. $F_1$ measure of the machine learning models and baselines (dictionary lookup, CoNLL-2003) on the mention-level test set.

Figure 8

Algorithm 1. Pseudocode for concept normalization. pred_entities are the entities recognized by the XLM-RoBERTa model. gold_entities, UMLS are dictionaries in which disorder entities are associated with their concept unique identifiers (CUIs) in UMLS.

Figure 9

Table 6. Precision, recall, and $F_1$ measure of the machine learning models and baselines (dictionary lookup, CoNLL-2003) on the English mention-level test set.

Figure 10

Table 7. Precision, recall, and $F_1$ measure of the machine learning models and baselines (dictionary lookup, CoNLL-2003) on the French and Italian mention-level test sets.

Figure 11

Table 8. Precision, recall, and $F_1$ measure of the machine learning models and baselines (dictionary lookup, CoNLL-2003) on the Spanish and Basque mention-level test sets.

Figure 12

Table 9. Precision, recall, and $F_1$ measure of XLM-RoBERTa-ML fine-tuned on the concatenation of documents of Layer 1 and Layer 2 compared to XLM-RoBERTa-ML fine-tuned using documents of Layer 1.

Figure 13

Table 10. $F_1$ measure of entity linking in cascade to XLM-RoBERTa-ML (the approach of the present work) and baselines (dictionary lookup and CoNLL-2003) calculated on the concept-level annotation of the corpus. For each column, the highest value of among the three approaches is displayed in bold.

Figure 14

Figure 4. Running the pipeline for DNER from its page in the ELG platform.

Figure 15

Figure 5. Running the pipeline for DNER from the user’s shell.