Hostname: page-component-89b8bd64d-j4x9h Total loading time: 0 Render date: 2026-05-07T14:11:03.441Z Has data issue: false hasContentIssue false

ItemComplex: A Python-based visualization framework for ex-post organization and integration of large language-based datasets

Published online by Cambridge University Press:  26 May 2025

Karina Janson
Affiliation:
Department of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Germany Institute of Medical Psychology and Medical Sociology, University Medical Center Schleswig-Holstein, Kiel University, Kiel, Germany
Karl Gottfried
Affiliation:
Department of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Germany
Olaf Reis
Affiliation:
Department of Child and Adolescent Psychiatry, Neurology, Psychosomatics and Psychotherapy, Rostock University Medical Centre, Gehlsheimer Strasse 20, Rostock 18147, Germany
Johannes Kornhuber
Affiliation:
Department of Psychiatry and Psychotherapy, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen 91054, Germany
Anna Eichler
Affiliation:
Department of Child and Adolescent Mental Health, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen 91054, Germany
Michael Deuschle
Affiliation:
Department of Psychiatry and Psychotherapy, Medical Faculty Mannheim, Central Institute of Mental Health, Heidelberg University, Mannheim, Germany
Tobias Banaschewski
Affiliation:
Department of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Germany
Frauke Nees*
Affiliation:
Department of Child and Adolescent Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Germany Institute of Medical Psychology and Medical Sociology, University Medical Center Schleswig-Holstein, Kiel University, Kiel, Germany
*
Corresponding author: Frauke Nees; Email: nees@med-psych.uni-kiel.de

Abstract

Background

Nowadays, both researchers and clinicians alike have to deal with increasingly larger datasets, specifically also in the context of mental health data. Sophisticated tools for dataset visualization of information from various item-based instruments, such as questionnaire data or data from digital applications or clinical documentations, are still lacking, specifically for an integration at multiple levels and for use in both data organization and appropriate construction for its valid use in subsequent analyses.

Methods

Here, we introduce ItemComplex, a Python-based framework for ex-post visualization of large datasets. The method exploits the comprehensive recognition of instrument alignments and the identification of new content networks and graphs based on item similarities and shared versus differential conceptual bases within and across data and studies.

Results

The ItemComplex framework was evaluated using four existing large datasets from four different cohort studies and demonstrated successful data visualization across multi-item instruments within and across studies. ItemComplex enables researchers and clinicians to navigate through big datasets reliably, informatively, and quickly. Moreover, it facilitates the extraction of new insights into construct representations and concept identifications within the data.

Conclusions

The ItemComplex app is an efficient tool in the field of big data management and analysis addressing the growing complexity of modern datasets to harness the potential hidden within these extensive collections of information. It is also easily adjustable for individual datasets and user preferences, both in the research and clinical field.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of European Psychiatric Association
Figure 0

Figure 1. Code snippet for sunburst plot creation. The code generates a sunburst visualization displaying hierarchical data organized by levels, constructs, questionnaires, and cohorts utilizing Python libraries.

Figure 1

Figure 2. Data frame used for exploration and visualization steps.

Figure 2

Figure 3. Importing libraries required for item similarity analysis.

Figure 3

Figure 4. Data preprocessing for item similarity analysis.

Figure 4

Figure 5. TF-IDF scores calculation.

Figure 5

Figure 6. Generated cosine similarity matrix.

Figure 6

Figure 7. Network graph generation.

Figure 7

Figure 8. Workflow of the ItemComplex framework, a Python-based visualization methodology for the organization, structuring, and analysis of large and diverse psychometric datasets.Note: On the left, various large-scale cohort datasets are shown being collected and organized, emphasizing their diverse structures. These datasets are then processed through ItemComplex, which applies a suite of Python algorithms for data structuring according to content topics, along with visualization tools, including sunburst and TreeMap plots. These tools help to visually represent the hierarchical structure, overlap, and assignment of constructs/subscales within and across different studies, allowing for navigation through various levels of data granularity. Structured overview (top right): This part of the figure illustrates the structured overview process applied to large psychometric datasets. The ItemComplex framework allows for the organization and structuring of data on multiple levels, such as the questionnaire level or construct level. At the questionnaire level, the framework can identify identical or similar questionnaires across different studies, aiding in the harmonization process. At the construct level, it allows to see where the most information has been assessed, highlighting the overlaps and gaps in the data across studies. This hierarchical structuring provides a comprehensive understanding of the data landscape, making it easier to compare and integrate data from diverse sources. Visualization pipelines and data search algorithms (middle right): highlights the use of interactive data exploration pipelines within the ItemComplex framework. These pipelines are equipped with intuitive, interactive charts like sunburst and TreeMap plots, which allow users to explore the data in a user-friendly manner. By interacting with these visual tools, users can easily navigate through the data. Item-level semantic network (bottom right): The final part of the figure focuses on the item-level semantic network analysis performed by the ItemComplex framework. Using NLP techniques and cosine similarity measurements, the framework identifies semantic similarities between individual survey items across different studies. The results are visualized in network graphs that represent the relationships between items.

Figure 8

Figure 9. Interface of ItemComplex framework.Note: 1 – Data upload and preview: In this step, users can upload an Excel file containing multi-item instrument data (e.g., questionnaires such as BDI, CES-D, etc.). The data preview table displays the uploaded dataset, showing columns like “questionnaire,” “scale,” “item text,” and “construct.” 2 – plot type selection: Users can choose from multiple plot types to visualize the data, such as “Sunburst,” “Treemap,” “Sankey,” or “Item Similarity Network.” 3 – column selection for plot layers: Users can select different columns from the dataset to define the hierarchical layers for the visualization. For example, “cohorts” could be selected as the first layer, followed by “construct,” “questionnaire,” and optionally “scale.” 4 – generate plot: After selecting the plot type and columns, the “Generate Plot” button is used to create the interactive visualization. The system then processes the data and displays the sunburst plot, which visually represents the hierarchical structure of the selected attributes.

Figure 9

Figure 10. Sunburst plots of overlapping constructs across different studies. The plot represents different layers that provide information about overlapping questionnaires, constructs/subscales across different studies. Similarly, the adaptable nature of the visualization allows for a personalized sequencing of the layers (for example starting with cohort-specific information within the central circle).

Figure 10

Figure 11. TreeMap plots of overlapping constructs across different studies. The plot showcases hierarchical layers that illustrate the overlap of questionnaires and constructs/subscales across various studies. The adaptable design of the visualization allows for customizable sequencing of the layers, such as beginning with cohort-specific information at the central node, enabling tailored insights into the data’s structure and relationships.

Figure 11

Figure 12. Network visualization for item similarity identification. Spring layout algorithm, positioning items with higher similarity in closer proximity (cosine similarity cut-off value =.50), colors correspond to the associated construct, facilitating the identification of clusters wherein related items sharing the same construct are visually discernible. The hover tooltip provides contextual information related to the specific element being hovered over. The tooltip displays information such as selected tokens, the actual text of the survey item, the originating questionnaire, the associated scale or subscale, and the corresponding cohort, and the values in the first line present the weight value associated with the edge, which quantifies the similarity between connected survey items.

Figure 12

Figure 13. Dimensionality reduction of TF-IDF features. The PCA scatter plot (above) displays items projected onto the first two principal components derived from the TF-IDF feature matrix of preprocessed text data. These components capture the greatest variance in the high-dimensional space. Each data point represents an individual item and is color coded according to its associated construct (e.g., theoretical category). The dispersion of points reflects overall variability in semantic content, while clustering suggests underlying similarities among items with the same construct label. The LDA scatter plot (below) illustrates a supervised projection of the same TF-IDF features, with the axes representing the first two linear discriminants. This analysis seeks to maximize the separation between the predefined construct classes. Each point in the plot corresponds to an item, and the colors denote its construct label. Compared to PCA, LDA emphasizes class differences by concentrating on between-class variance, thereby highlighting the distinct clusters corresponding to the different constructs.

Submit a response

Comments

No Comments have been published for this article.