Hostname: page-component-77f85d65b8-pztms Total loading time: 0 Render date: 2026-03-28T06:46:30.778Z Has data issue: false hasContentIssue false

Clinical information extraction for lower-resource languages and domains with few-shot learning using pretrained language models and prompting

Published online by Cambridge University Press:  31 October 2024

Phillip Richter-Pechanski*
Affiliation:
Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, Germany Department of Computational Linguistics, Heidelberg University, Heidelberg, Germany Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, Germany German Center for Cardiovascular Research – Partner Site Heidelberg/Mannheim, Heidelberg, Germany Informatics for Life, Heidelberg, Germany
Philipp Wiesenbach
Affiliation:
Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, Germany Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, Germany German Center for Cardiovascular Research – Partner Site Heidelberg/Mannheim, Heidelberg, Germany Informatics for Life, Heidelberg, Germany
Dominic Mathias Schwab
Affiliation:
Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, Germany
Christina Kiriakou
Affiliation:
Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, Germany
Nicolas Geis
Affiliation:
Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, Germany Informatics for Life, Heidelberg, Germany
Christoph Dieterich
Affiliation:
Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, Germany Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, Germany German Center for Cardiovascular Research – Partner Site Heidelberg/Mannheim, Heidelberg, Germany Informatics for Life, Heidelberg, Germany
Anette Frank
Affiliation:
Department of Computational Linguistics, Heidelberg University, Heidelberg, Germany
*
Corresponding author: Phillip Richter-Pechanski; Email: phillip.richter-pechanski@med.uni-heidelberg.de
Rights & Permissions [Opens in a new window]

Abstract

A vast amount of clinical data are still stored in unstructured text. Automatic extraction of medical information from these data poses several challenges: high costs of clinical expertise, restricted computational resources, strict privacy regulations, and limited interpretability of model predictions. Recent domain adaptation and prompting methods using lightweight masked language models showed promising results with minimal training data and allow for application of well-established interpretability methods. We are first to present a systematic evaluation of advanced domain-adaptation and prompting methods in a lower-resource medical domain task, performing multi-class section classification on German doctor’s letters. We evaluate a variety of models, model sizes (further-pre)training and task settings, and conduct extensive class-wise evaluations supported by Shapley values to validate the quality of small-scale training data and to ensure interpretability of model predictions. We show that in few-shot learning scenarios, a lightweight, domain-adapted pretrained language model, prompted with just 20 shots per section class, outperforms a traditional classification model, by increasing accuracy from $48.6\%$ to $79.1\%$. By using Shapley values for model selection and training data optimization, we could further increase accuracy up to $84.3\%$. Our analyses reveal that pretraining of masked language models on general-language data is important to support successful domain-transfer to medical language, so that further-pretraining of general-language models on domain-specific documents can outperform models pretrained on domain-specific data only. Our evaluations show that applying prompting based on general-language pretrained masked language models combined with further-pretraining on medical-domain data achieves significant improvements in accuracy beyond traditional models with minimal training data. Further performance improvements and interpretability of results can be achieved, using interpretability methods such as Shapley values. Our findings highlight the feasibility of deploying powerful machine learning methods in clinical settings and can serve as a process-oriented guideline for lower-resource languages and domains such as clinical information extraction projects.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. Challenges for MIE projects in clinics: Our proposed solutions to main challenges for MIE projects in a clinical setting. Abbreviation: “par.” refers to parameters.

Figure 1

Table 1. Distribution of section classes: Number of samples per section class per corpus split. English translation in round brackets.

Figure 2

Figure 2. PET workflow: Three main steps: (a) Apply pattern function P(x) to all few-shot training instances X. Fine-tune a PLM M using a language model objective on each pattern. The output of the PLM is mapped using a verbalizer function v(y). (b) An ensemble of M trained on each pattern is used to annotate an unlabeled dataset D with soft labels. (c) A classifier C with a classification head is trained on D.

Figure 3

Figure 3. Pretrained language models: We use two publicly available PLMs: gbert and medbertde. We evaluate base and large gbert models. Four pretraining methods are used: (a) publicly available, (b) task-adapted, (c) domain-adapted, and (d) task- and domain-adapted combined.

Figure 4

Table 2. Contextualized paragraphs: A sample annotated as AllergiesIntolerancesRisks with three different context types, each separated by the [SEP] token. English translation in italics.

Figure 5

Figure 4. Section classification baseline results (lower/upper bound): We show accuracy scores in percentage per pretraining method (public, task-adapted, domain-adapted, and combination of both) per model: gbert-base and medbertde-base. (a) Lower-bound: used in zero-shot prompting (b) Upper bound: full training set.

Figure 6

Figure 5. Accuracy scores in percentage for core experiments and lower/upper bound: comparing prompting using PET vs. SC, few-shot sizes $10-400$ and pretraining methods using base BERT models. For reference, lower-bound PET baselines trained with zero-shots (ZERO) and upper-bound SC models trained on complete training set (FULL).

Figure 7

Figure 6. Core experiments: primary class $F1$-score in percentage and selected Shapley values: (a) $F1$-score scores per few-shot sizes for primary classes with using gbert-base-comb nocontext. (b) Shapley value analysis for gbert-base-comb nocontext with respect to Anamnese and Zusammenfassung prediction. First column: true label of the sample, second column: predicted label including label probability, third column: selected Shapley values. We used 20 training shots. For readability reasons, we grouped some token sequences. Further details, see Suppl. Fig. S7. Legend: Blue: positive contribution, Red: negative contribution.

Figure 8

Figure 7. Model size: (a) Accuracy scores in percentage for gbert-comb nocontext PLMs using all templates on four few-shot sizes. (b) $F1$-scores in percentage for primary classes for gbert-comb no context PLMs using all templates on various few-shot sizes.

Figure 9

Figure 8. Additional experiments (context) – primary classes $F1$-scores and selected Shapley values: (a) $F1$-scores in percentage per few-shot sizes for primary classes with nocontext and context using gbert-base-comb. Comparing to gbert-base-comb trained on full training data with nocontext and context. (b) Shapley value analysis for gbert-base-comb nocontext and gbert-base-comb context. First column: true label of the sample, second column: predicted label including label probability, third column: selected Shapley values. We used 20 training shots. For readability reasons, we grouped some token sequences. Further details, see Suppl. Fig. S10. Legend: Blue: positive contribution, Red: negative contribution.

Figure 10

Table 3. Combining and evaluating best-performing methods: Accuracy scores in percentage for gbert-large-comb context evaluated on few-shot sizes [$20,50,100,400$] with base vs. large model sizes in context vs. nocontext settings using PET. Comparison to corresponding SC model fine-tuned on full training set.

Figure 11

Figure 9. Additional experiments (combined methods) – primary classes $F1$-scores and selected Shapley values: (a) $F1$-scores in percentage per few-shot sizes for primary classes with nocontext and context using gbert-large-comb. Comparing to gbert-large-comb trained on full training data with context. (b) Shapley value analysis for gbert-base-comb context and gbert-large-comb context. First column: true label of the sample, second column: predicted label including label probability, third column: selected Shapley values. We used 20 training shots. For readability reasons, we grouped some token sequences. More detailed results, see Suppl. Fig. S13. Legend: Blue: positive contribution}?>, Red: negative contribution.

Supplementary material: File

Richter-Pechanski et al. supplementary material 1

Richter-Pechanski et al. supplementary material
Download Richter-Pechanski et al. supplementary material 1(File)
File 3.4 MB
Supplementary material: File

Richter-Pechanski et al. supplementary material 2

Richter-Pechanski et al. supplementary material
Download Richter-Pechanski et al. supplementary material 2(File)
File 10.7 KB