Hostname: page-component-77f85d65b8-g4pgd Total loading time: 0 Render date: 2026-03-28T22:22:36.543Z Has data issue: false hasContentIssue false

MHeTRep: A multilingual semantically tagged health terms repository

Published online by Cambridge University Press:  25 February 2022

Jorge Vivaldi*
Affiliation:
Universitat Pompeu Fabra, Barcelona, Spain
Horacio Rodríguez
Affiliation:
Universidad Politécnica de Catalunya, Barcelona, Spain
*
*Corresponding author. E-mail: jorge.vivaldi@upf.edu
Rights & Permissions [Opens in a new window]

Abstract

This paper presents MHeTRep, a multilingual medical terminology and the methodology followed for its compilation. The multilingual terminology is organised into one vocabulary for each language. All the terms in the collection are semantically tagged with a tagset corresponding to the top categories of Snomed-CT ontology. When possible, the individual terms are linked to their equivalent in the other languages. Even though many NLP resources and tools claim to be domain independent, their application to specific tasks can be restricted to specific domains, otherwise their performance degrades notably. As the accuracy of NLP resources drops heavily when applied in environments different from which they were built, a tuning to the new environment is needed. Usually, having a domain terminology facilitates and accelerates the adaptation of general domain NLP applications to a new domain. This is particularly important in medicine, a domain living moments of great expansion. The proposed method takes Snomed-CT as starting point. From this point and using 13 multilingual resources, covering the most relevant medical concepts such as drugs, anatomy, clinical findings and procedures, we built a large resource covering seven languages totalling more than two million semantically tagged terms. The resulting collection has been intensively evaluated in several ways for the involved languages and domain categories. Our hypothesis is that MHeTRep can be used advantageously over the original resources for a number of NLP use cases and likely extended to other languages.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2022. Published by Cambridge University Press
Figure 0

Table 1. Notation used in this paper

Figure 1

Table 2. Snomed-CT semantic categories

Figure 2

Table 3. Resources used in the work

Figure 3

Table 4. Fragment of the entry angiosarcoma within DO ontology

Figure 4

Figure 1. Compilation methodology.

Figure 5

Table 5. Top categories of RadLex mapped into our tagset

Figure 6

Table 6. MesH descriptor categories

Figure 7

Table 7. MesH Heading examples: fragment of the taxonomy headed by nervous system

Figure 8

Table 8. Size of vocabularies for each language and category after the initial step

Figure 9

Figure 2. Sizes of first step overall vocabularies per semantic class.

Figure 10

Figure 3. Overall sizes (per language) of the vocabulary after the first step.

Figure 11

Table 9. Size of the resulting vocabulary for each language and category after seven iterations

Figure 12

Figure 4. Total size of vocabularies after each iteration step.

Figure 13

Table 10. Information available for the English term Liver

Figure 14

Figure 5. Histogram of atomic terms length in tokens.

Figure 15

Figure 6. Type/token distribution of the atomic terms.

Figure 16

Figure 7. Vocabulary representation in our approximate matching system.

Figure 17

Figure 8. Histogram of sources per terms for English Body structure.

Figure 18

Table 11. Distribution of sources per term for all the terms languages and categories in MHeTRep

Figure 19

Table 12. Result of MIMIC experiments using EXACT matching

Figure 20

Table 13. Result of MIMIC experiments using APPROX matching

Figure 21

Table 14. Results obtained on BARR test

Figure 22

Table 15. Results using the medical terms defined in the Clef eHealth competition

Figure 23

Table 16. Results using the medical terms collected from medical discharge reports

Figure 24

Table 17. Results on SemEval 2017 test (BLE: Bilingual extension, BC: Best count)

Figure 25

Table 18. Official results of the task and improved version

Figure 26

Table 19. Examples of approximate matching results. (all are English terms come from DDI, examples 11 and 12 is a Spanish example and example 13 in a French example

Figure 27

Table 20. Coverage of Snomed-CT classes of English terms from different sources

Figure 28

Table 21. Summary of the results obtained by extrinsic evaluation