Hostname: page-component-89b8bd64d-b5k59 Total loading time: 0 Render date: 2026-05-09T03:53:08.189Z Has data issue: false hasContentIssue false

Semantic search helper: A tool based on the use of embeddings in multi-item questionnaires as a harmonization opportunity for merging large datasets – A feasibility study

Published online by Cambridge University Press:  20 January 2025

Karl Gottfried
Affiliation:
Institute of Applied Medical Informatics, University Hospital Center Hamburg-Eppendorf, Hamburg, Germany
Karina Janson
Affiliation:
Department of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Baden-Württemberg, Germany Institute of Medical Psychology and Medical Sociology, University Medical Center Schleswig-Holstein, Kiel University, Preußerstraße 1-9, Kiel, Schleswig-Holstein, Germany
Nathalie E. Holz
Affiliation:
Department of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Baden-Württemberg, Germany German Center for Mental Health (DZPG), Partnersite Mannheim-Heidelberg-Ulm, Germany
Olaf Reis
Affiliation:
Department of Child and Adolescent Psychiatry, Neurology, Psychosomatics and Psychotherapy, Rostock University Medical Centre, Rostock, Germany
Johannes Kornhuber
Affiliation:
Department of Psychiatry and Psychotherapy, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
Anna Eichler
Affiliation:
Department of Child and Adolescent Mental Health, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
Tobias Banaschewski
Affiliation:
Department of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Baden-Württemberg, Germany German Center for Mental Health (DZPG), Partnersite Mannheim-Heidelberg-Ulm, Germany
Frauke Nees*
Affiliation:
Department of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Baden-Württemberg, Germany Institute of Medical Psychology and Medical Sociology, University Medical Center Schleswig-Holstein, Kiel University, Preußerstraße 1-9, Kiel, Schleswig-Holstein, Germany
*
Corresponding author: Frauke Nees; Email: nees@med-psych.uni-kiel.de

Abstract

Background

Recent advances in natural language processing (NLP), particularly in language processing methods, have opened new avenues in semantic data analysis. A promising application of NLP is data harmonization in questionnaire-based cohort studies, where it can be used as an additional method, specifically when only different instruments are available for one construct as well as for the evaluation of potentially new construct-constellations. The present article therefore explores embedding models’ potential to detect opportunities for semantic harmonization.

Methods

Using models like SBERT and OpenAI’s ADA, we developed a prototype application (“Semantic Search Helper”) to facilitate the harmonization process of detecting semantically similar items within extensive health-related datasets. The approach’s feasibility and applicability were evaluated through a use case analysis involving data from four large cohort studies with heterogeneous data obtained with a different set of instruments for common constructs.

Results

With the prototype, we effectively identified potential harmonization pairs, which significantly reduced manual evaluation efforts. Expert ratings of semantic similarity candidates showed high agreement with model-generated pairs, confirming the validity of our approach.

Conclusions

This study demonstrates the potential of embeddings in matching semantic similarity as a promising add-on tool to assist harmonization processes of multiplex data sets and instruments but with similar content, within and across studies.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of European Psychiatric Association
Figure 0

Figure 1. Workflow concept for detecting semantic similarities within multi-item questionnaires. The process involves: (1) inputting sentence data, including variable names and text; (2) converting text into vectors using models like SBERT or ADA; (3) building similarity pairs to identify semantic matches and generating scores; (4) selecting and downloading pairs for further analysis. Color coding indicates manual, automatic, and semi-automatic processes.

Figure 1

Table 1. Overview of IMAC-mind cohort studies

Figure 2

Table 2. Overview of questionnaires used in the studies

Figure 3

Figure 2. Mean score responses and model-based similarity scores. The correlation between average score responses (x-axis) and similarity scores generated by the embedding models (y-axis). Scatter plots: score responses versus model similarity scores. Each plot represents the correlation between mean score responses and similarity scores for SBERT (green) and ADA (red) models. The overall trend in the relationship between evaluative scores and model-derived similarity metrics is indicated by linear regression lines.

Figure 4

Table 3. Comparison of semantic similarity scores for questionnaire item pairs using SBERT and ADA algorithms

Figure 5

Figure 3. ‘Semantic Search Helper’ application interface. The user interface (UI) with different stages of the harmonization process: A bar chart overview of metadata distribution and a scatter plot for visualizing data points in semantic dimensions (Bottom Section). A filter tree for selecting specific data items and a filtered table displaying data based on applied filters (Right Section). Table with survey questions and semantic similarity scores, and a bar chart showing semantic coverage percentages (Top Section). A network graph visualizing semantic connections between questionnaire items (Middle Section). This interface facilitates the comparison of semantic similarities across survey questions, streamlining the data harmonization workflow for researchers.

Figure 6

Table 4. Spearman correlations and 95% confidence intervals

Figure 7

Figure 4. Cosine similarity distributions for SBERT (light blue) and ADA (green). SBERT has a mean of 0.19 (blue line) and ADA has a mean of 0.76 (yellow line).

Supplementary material: File

Gottfried et al. supplementary material

Gottfried et al. supplementary material
Download Gottfried et al. supplementary material(File)
File 57.1 KB
Submit a response

Comments

No Comments have been published for this article.