Hostname: page-component-77f85d65b8-zzw9c Total loading time: 0 Render date: 2026-03-28T10:05:19.457Z Has data issue: false hasContentIssue false

Uncovering the limits of visual-language models in engineering knowledge representation

Published online by Cambridge University Press:  27 August 2025

Marco Consoloni*
Affiliation:
University of Pisa, Italy Business Engineering for Data Science (B4DS) research group, Italy
Vito Giordano
Affiliation:
University of Pisa, Italy Business Engineering for Data Science (B4DS) research group, Italy
Federico Andrea Galatolo
Affiliation:
University of Pisa, Italy
Mario Giovanni Cosimo Antonio Cimino
Affiliation:
University of Pisa, Italy
Gualtiero Fantoni
Affiliation:
University of Pisa, Italy Business Engineering for Data Science (B4DS) research group, Italy

Abstract:

Visual-Language (VL) models offer potential for advancing Engineering Design (ED) by integrating text and visuals from technical documents. We review VL applications across ED phases, highlighting three key challenges: (i) understanding how functional and structural information is complementarily expressed by text and images, (ii) creating large-scale multimodal design datasets and (iii) improving VL models’ ability to represent ED knowledge. A dataset of 1.5 million text-image pairs and an evaluation dataset for cross-modal information retrieval were developed using patents. By Fine-tuning and testing the CLIP base model on these datasets, we identified significant limitations in VL models’ capacity to capture fine-grained technical details required for precision-driven ED tasks. Based on these findings, we propose future research directions to advance VL models for ED applications.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - ND
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is unaltered and is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use or in order to create a derivative work.
Copyright
© The Author(s) 2025
Figure 0

Figure 1. Visual-language models for engineering design tasks

Figure 1

Figure 2. Two alternative uses of text and images to express functional and structural ED concepts

Figure 2

Table 1. Experimental dataset for patent citation retrieval task

Figure 3

Figure 3. Retrieval Workflow

Figure 4

Table 2. Results of patent citation retrieval tasks for CLIP base and fine-tuned model

Figure 5

Figure 4. Performance of retrieval strategies: Avg. P@k, R@k and F1@k for k values 1-30

Figure 6

Figure 5. Distribution of IPC classes of retrieved patents for each citing patent