Hostname: page-component-6766d58669-kn6lq Total loading time: 0 Render date: 2026-05-17T02:49:42.537Z Has data issue: false hasContentIssue false

Multi-class sewer defect detection with vision-language models

Published online by Cambridge University Press:  25 March 2026

Riccardo Taormina*
Affiliation:
Delft University of Technology , Netherlands
Job Augustijn van der Werf
Affiliation:
Delft University of Technology , Netherlands
*
Corresponding author: Riccardo Taormina; Email: r.taormina@tudelft.nl
Rights & Permissions [Opens in a new window]

Abstract

Proactive sewer asset management requires accurate condition assessment, yet CCTV inspections remain costly because interpretation is manual. We evaluated 18 vision-language models (VLMs) in a zero-shot setting for automated classification of six sewer defect types using a curated dataset. Each model produced a defect label, a short explanation, and a confidence score. OpenAI proprietary models outperformed open-source ones. GPT-4.1 mini achieved the highest macro-F1 score (0.50), outperforming much larger models, especially for surface damage and cracks/breaks. Some open-source models, such as LLaMa 4 (16x17B) and Qwen2.5-VL (32B), performed above random guessing but remained behind the proprietary models. All models failed to detect production errors, the most difficult class, and performed poorly on deformations. Confidence scores were generally unreliable, with little distinction between correct and incorrect predictions. Textual-output analysis showed that models sometimes described defects accurately even when the assigned label was wrong, although major hallucinations remained. We conclude that VLMs show some promise for sewer asset management, but they are not ready for deployment. Future work should focus on adding asset metadata to prompts and fine-tuning open-source models, especially since larger, newer, and more expensive OpenAI models did not outperform smaller ones, although confirmation requires a more thorough statistical analysis.

Topics structure

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press
Figure 0

Figure 1. Macro-F1 scores across model families and parameter settings for open-weight models (a), OpenAI GPT variants without reasoning (b) and o-series models with reasoning (c). Identical colour/marker combinations denote results for the same model families across different settings. Transparency encodes either the temperature or the reasoning effort: the most transparent points correspond to temperature 0.0 or low reasoning effort, while the least transparent correspond to temperature 1.0 or high reasoning effort.

Figure 1

Figure 2. Model performance benchmarked against a population of 500,000 random guess models.

Figure 2

Table 1. Performance of the best-performing model combination (GPT-4.1 mini, temperature = 0)

Figure 3

Figure 3. Distributions of the level of confidence reported by the top four models, the highest statistically significant difference (Qwen2.5) and OpenAI’s o3 with high reasoning. The shaded green (top row for each model) shows the correctly labelled distribution, and red (bottom row for each model) the wrongly labelled data. Boxplots layouts follow the original Tukey’s definition. Bold label on the y-axis indicates a statistically significant (KS test, $ p\le 0.05 $) difference.

Figure 4

Table 2. Overview of image label and prevalence of the qualitative labels (calculated as % per image label type)

Figure 5

Figure 4. The improvement between the F1 score based on the labels alone and that computed including the qualitative assessment of the textual output. Gemma 3 (27b) model and the o3 models are highlighted: o3 shows no sensitivity to reasoning level and Gemma 3 (27b) shows consistent improvement with no clear sensitivity to temperature.

Figure 6

Table 3. Illustrative examples of model predictions and generated explanations for different sewer defect classes

Author comment: Multi-class sewer defect detection with vision-language models — R0/PR1

Comments

No accompanying comment.

Review: Multi-class sewer defect detection with vision-language models — R0/PR2

Conflict of interest statement

Reviewer declares none.

Comments

While I think that the paper has its merits and is a suitable candidate for publication in Cambridge Prisms Water, there are a few weaknesses that must be addressed before the paper becomes ready for publication. Please find below some detailed comments and suggestions on how to improve the paper.

Strengths:

- The paper investigates a highly relevant application domain, where AI could have a significant impact in practice. In this context, the paper does a good job in investigating the potential (and limitations) of a large number of different VLMs. I think this paper is of interest to the research community as well as practitioners that consider the application of AI for operating urban drainage networks.

- The paper is well written and easy to read.

Weaknesses:

- The paper claims interpretability of the outputs -- e.g., “Natural-language rationales generated by VLMs foster trust”, for which, however, no evidence is given. I think this is critical since it constitutes one of the main claims of the paper on why VLMs might be useful. I see several possible ways on how to resolve this: Simply removing this statement/claim -- I think the paper has enough other contributions that make it interesting; Back-up the statement with literature -- maybe some studies exist for other domains; Evaluate the interpretability by, for instance, running a user study or coming up with some other quantitative evaluation measure/metric -- this would be the most interesting way of dealing with the claim but also the most challenging way. Finally, an alternative way would be to keep the statement but discuss it more critically, avoiding false expectations and potential harm in practice.

Personally, I am a bit sceptical about the faithfulness of such explanations since (V)LLMs are often overly confident in their outputs, and there is no guarantee that the explanations indeed explain the model’s internal reasoning.

- The empirical evaluation is performed on a relatively small dataset (“only” 100 images), limiting the generalizability of the empirical results. I suggest using a larger dataset to get statistically more reliable results. Even if this is computationally not feasible, a proper statistical analysis of the results is required to ensure scientific rigour anyway. I think just reporting F1 scores (see Figure 1 and Table 1) is not enough. I suggest performing some form of statistical significance analysis (like it was done for the confidence analysis), or at a bare minimum, include some notion of variance by, for instance, performing experiments on different subsets. Furthermore, the small number of test images should be discussed/acknowledged as a limitation of the paper.

Minor:

- Consider including some illustrative examples from the SewerML dataset. This would make the paper more accessible to readers who are not familiar with this dataset.

Review: Multi-class sewer defect detection with vision-language models — R0/PR3

Conflict of interest statement

Reviewer declares none.

Comments

The manuscript “Multi-Class Sewer Defect Detection with Vision-Language Models” is an interesting study and shows nicely the opportunities for but also still shortcomings of VLMs in application for sewer defect detection from visual CCTV data. I have no strong criticism regarding the paper, which is well written and argued, but rather some questions that I would like to have reflections on:

• What can be seen is that the proprietary models are performing slightly better than the open-source ones. The used data set is the openly available Sewer-ML dataset. How big are the chances that this dataset could be part of one of the existing VLM models training?

• One recurring theme in discussion with utilities and in literature is the conservatism of the water field. In the end the use case here would currently be the copying of critical infrastructure condition data into a proprietary system. Could as be a hindrance from a security and legalistic point of view. How do you see possible changes to adapt to this? Self-hosted open-source models, such as it has been mentioned? Or other business models?

• It would be interesting to see the performance over a “real system” where the defects may be very unbalanced. Also, I am a bit less optimistic on the mapping between Sewer-ML and EN13508-2 but I am looking forward to being proven wrong. What this study shows is the potential for classification of defects, but additional info would be needed (crack width for example). How do you see the potential to get this information?

• The mentioned autonomous inspections (by robots or drones or…) are intriguing but the mentioned “sufficiently compact yet capable models” seem vague. How compact can you get with such models?

Recommendation: Multi-class sewer defect detection with vision-language models — R0/PR4

Comments

This paper focuses on testing the potential of vision language models to detect defects in sewer pipes. The authors compare different models and reflect on the performance. Both reviewers are positive about the paper, and provide a few important suggestions for the authors to address, including avoiding overstating the scope of the work and providing additional discussion about the broad impact and generalizability of this work. I look forward to receiving the revised paper.

Decision: Multi-class sewer defect detection with vision-language models — R0/PR5

Comments

No accompanying comment.

Author comment: Multi-class sewer defect detection with vision-language models — R1/PR6

Comments

No accompanying comment.

Recommendation: Multi-class sewer defect detection with vision-language models — R1/PR7

Comments

No accompanying comment.

Decision: Multi-class sewer defect detection with vision-language models — R1/PR8

Comments

No accompanying comment.