Hostname: page-component-77f85d65b8-v2srd Total loading time: 0 Render date: 2026-03-29T04:18:12.405Z Has data issue: false hasContentIssue false

Lightweight transformers for clinical natural language processing

Published online by Cambridge University Press:  12 January 2024

Omid Rohanian*
Affiliation:
Department of Engineering Science, University of Oxford, Oxford, UK NLPie Research, Oxford, UK
Mohammadmahdi Nouriborji
Affiliation:
NLPie Research, Oxford, UK Sharif University of Technology, Tehran, Iran
Hannah Jauncey
Affiliation:
Infectious Diseases Data Observatory (IDDO), University of Oxford, Oxford, UK
Samaneh Kouchaki
Affiliation:
Department of Electrical and Electronic Engineering, University of Surrey, Guildford, UK
Farhad Nooralahzadeh
Affiliation:
University of Zürich, Zürich, Switzerland University Hospital of Zürich, Zürich, Switzerland
Lei Clifton
Affiliation:
Nuffield Department of Population Health, University of Oxford, Oxford, UK
Laura Merson
Affiliation:
ISARIC, Pandemic Sciences Institute, University of Oxford, Oxford, UK
David A. Clifton
Affiliation:
Department of Engineering Science, University of Oxford, Oxford, UK Oxford-Suzhou Centre for Advanced Research, Suzhou, China
ISARIC Clinical Characterisation Group
Affiliation:
ISARIC, Pandemic Sciences Institute, University of Oxford, Oxford, UK
*
Corresponding author: Omid Rohanian; Email: omid.rohanian@eng.ox.ac.uk
Rights & Permissions [Opens in a new window]

Abstract

Specialised pre-trained language models are becoming more frequent in Natural language Processing (NLP) since they can potentially outperform models trained on generic texts. BioBERT (Sanh et al., Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv: 1910.01108, 2019) and BioClinicalBERT (Alsentzer et al., Publicly available clinical bert embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78, 2019) are two examples of such models that have shown promise in medical NLP tasks. Many of these models are overparametrised and resource-intensive, but thanks to techniques like knowledge distillation, it is possible to create smaller versions that perform almost as well as their larger counterparts. In this work, we specifically focus on development of compact language models for processing clinical texts (i.e. progress notes, discharge summaries, etc). We developed a number of efficient lightweight clinical transformers using knowledge distillation and continual learning, with the number of parameters ranging from $15$ million to $65$ million. These models performed comparably to larger models such as BioBERT and ClinicalBioBERT and significantly outperformed other compact models trained on general or biomedical data. Our extensive evaluation was done across several standard datasets and covered a wide range of clinical text-mining tasks, including natural language inference, relation extraction, named entity recognition and sequence classification. To our knowledge, this is the first comprehensive study specifically focused on creating efficient and compact transformers for clinical NLP tasks. The models and code used in this study can be found on our Huggingface profile at https://huggingface.co/nlpie and Github page at https://github.com/nlpie-research/Lightweight-Clinical-Transformers, respectively, promoting reproducibility of our results.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Table 1. Training samples from the MedNLI dataset

Figure 1

Table 2. i2b2-2010 samples taken from the dataset’s guideline Uzuner et al.(2011). The concept pairs for which a relationship should be predicted are displayed in boldface. Following the pre-processing used in the BLUE benchmark, the concepts are replaced with tags and then passed to the model as shown in the second column

Figure 2

Figure 1. Samples from the i2b2-2012 dataset.

Figure 3

Figure 2. Samples from i2b2-2014. Names have been anonymised for privacy. Labels are ‘PA’ for Patient’, ‘PR’ for ‘Professional’, ‘OR’ for ‘Organisation’ and ‘DO’ for ‘Doctor’. See Appendix A for the complete list of labels.

Figure 4

Table 3. Some sample clinical notes along with their annotation from the ICN dataset

Figure 5

Table 4. The results of the baselines (above the double line) and our pre-trained models on clinical downstream tasks. The metrics used for reporting scores are accuracy for the MedNLI, micro-averaged F1 for i2b2-2010 (RE), macro-averaged F1 for ICN, and Exact F1 for the others. Bold numbers denote the best performance and underlined numbers denote the second-best performance

Figure 6

Table 5. The effect of different initialisations on the continual learning of compact models

Figure 7

Table 6. Comparing the efficiency of the proposed models with ClinicalBioBERT. $\downarrow$ denotes that less is better for that particular metric

Figure 8

Table 7. Hyperparameters used for pre-training models on MIMIC-III

Figure 9

Table 8. Hyperparameters used for fine-tuning on downstream tasks

Figure 10

Figure 3. (A) Training loss of the ClinicalDistilBERT and the ClinicalMobileBERT when optimised for the MLM objective. (B) Training loss of the DistilClinicalBERT on the distillation objective as described in Section 3.1.1. (C) Training loss of TinyClinicalBERT on the distillation objective, as explained in Section 3.1.2. (D) Training loss of ClinicalMiniALBERT on the distillation objective, as introduced in Nouriborji et al.(2022).

Figure 11

Figure 4. (A) and (E) represent the confusion matrices for BioBERT on the test set and the corner cases (Section 6.3.1), respectively. (B) and (F) refer to the confusion matrices for ClinicalBioBERT on the test set and corner cases. (C) and (G) denote the confusion matrices for ClinicalDistilBERT on the test set and corner cases. ‘No M’ indicates ‘No Malignancy’, ‘P M’ represents ‘Possible Malignancy’ and ‘M’ signifies ‘Malignancy’.