Hostname: page-component-77f85d65b8-pkds5 Total loading time: 0 Render date: 2026-04-22T18:24:30.580Z Has data issue: false hasContentIssue false

Are large pre-trained vision language models effective construction safety inspectors

Published online by Cambridge University Press:  06 April 2026

Xuezheng Chen
Affiliation:
The University of British Columbia – Vancouver Campus, Canada
Zhengbo Zou*
Affiliation:
Civil Engineering and Engineering Mechanics, Columbia University, USA
*
Corresponding author: Zhengbo Zou; Email: zhengbo.zou@columbia.edu

Abstract

Construction safety inspections typically involve a human inspector identifying safety concerns on-site. With the rise of powerful vision language models (VLMs), researchers are exploring their use for tasks such as detecting safety rule violations from on-site images. However, there is a lack of open datasets to comprehensively evaluate and further fine-tune VLMs in construction safety inspection. Current applications of VLMs use small, supervised datasets, limiting their applicability in tasks they are not directly trained for. In this article, we propose the ConstructionSite 10 k, featuring 10,000 construction site images with annotations for three inter-connected tasks, including image captioning, safety rule violation visual question answering (VQA), and construction element visual grounding. Our subsequent evaluation of current state-of-the-art large pre-trained VLMs shows notable generalization abilities in zero-shot and few-shot settings, while additional training is needed to make them applicable to actual construction sites. This dataset allows researchers to train and evaluate their own VLMs with new architectures and techniques, providing a valuable benchmark for construction safety inspection.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices
Open data
Copyright
© The Author(s), 2026. Published by Cambridge University Press
Figure 0

Figure 1. The high-level schematic diagram shows the three tasks employed in this article. The VLM will receive the construction site image and prompts for three tasks: image captioning, safety rule violation VQA, and construction element visual grounding.

Figure 1

Table 1. Recent works of applying VLM for construction site inspection

Figure 2

Table 2. Comparisons between our dataset with popular construction-related computer vision datasets

Figure 3

Table 3. Statistics of the annotations

Figure 4

Figure 2. The figure presents construction site images from the dataset, each annotated with three labels arranged in a grid structure. For example, the bottom-left image is labeled as “sparse info,” “night,” and “short distance.” The images are uniformly scaled for demonstration purposes, resulting in slightly altered appearances. These images and labels aim to showcase representative samples and do not depict the actual distribution of the dataset.

Figure 5

Table 4. Four safety rules used in the dataset

Figure 6

Figure 3. Distribution of images in the dataset across four features. These statistics unveil the diversity of construction site imagery.

Figure 7

Figure 4. The time consumption breakdown to prepare the dataset. The time shown in the figure approximates and includes the time to create annotation software.

Figure 8

Figure 5. Word frequency of prevalent terms across topic categories in the reference image captions. Synonymous and plural noun forms (e.g., concrete truck and concrete mixers) are consolidated under a single canonical term. Verb occurrences are aggregated at the lemma level. For each term, the bar shows the total frequency, with occurrences from the test split on the left and training split on the right, separated by a black vertical marker.

Figure 9

Figure 6. The overall workflow of GPT-assisted image caption annotation.

Figure 10

Figure 7. Workflow of the image captioning task. The image, system and user prompts are given to the VLMs as inputs, the models generate an image caption as an output. The example prompts will be used to give examples to the VLMs in few-shot settings. The generated candidate caption, human-labeled reference caption, or the image is then evaluated with automatic metrics.

Figure 11

Figure 8. A three-stage query and evaluation workflow for the safety rule VQA test. The images and prompts are given to VLM in Stage 1. The VLMs generate choices, reasoning, and bounding boxes as outputs. If the VLM selects the correct violations in Stage 2, the reasoning and bounding boxes will be evaluated in Stage 3. The safety rules depicted in the figure are simplified for clarity, with solid font indicating rules relevant to the image and grayed-out font indicating irrelevant rules. In this example, the VLM selects Rule 1 and Rule 2, but only Rule 1 is violated in the image. The reasoning for the violation of the correctly chosen safety rule (Rule 1) is then provided to Llama 3 as the candidate reasoning, alongside the reference reasoning. For the visual grounding, the candidate bounding box, marking the location of the violation, is compared with the reference bounding box. This comparison yields an IoU score.

Figure 12

Table 5. The reported resource usage reflects the consumption per image after completing one round of image captioning (5-shot), VQA (5-shot), and object grounding (for three object as per our test)

Figure 13

Figure 9. The figure displays examples of candidate captions generated by the GPT-4 model, alongside reference captions and their evaluations (all scores are in %) for the image captioning task. For all evaluation metrics except the one assessing the reference caption, higher scores indicate a “better” candidate caption. While the five types of image captioning metrics assign different scores to the same caption, they generally exhibit a monotonic relationship. This means that if one metric assigns a higher score to a specific caption, it is likely that the others will also assign higher scores.

Figure 14

Table 6. The evaluation results (in %) for image description with automatic metrics

Figure 15

Table 7. The result (in %) of the safety rule violation detection

Figure 16

Table 8. The table presents reasoning results for safety rule violation VQA

Figure 17

Figure 10. Examples of safety rule violation reasoning. The figure includes the candidate, reference reasoning, and the evaluations.

Figure 18

Table 9. The IoU result (in %) of object detection results

Figure 19

Figure 11. The image displays visual grounding examples from GPT model. Each row corresponds to a different object category: the first row shows excavators, the second row shows rebar detection, and the third row shows workers with white hard hats. The model excels in detecting larger objects but struggles with irregular shapes, such as rebar piles, and specific constraints, such as workers with white hard hats.

Submit a response

Comments

No Comments have been published for this article.