Hostname: page-component-cb9f654ff-pvkqz Total loading time: 0 Render date: 2025-08-10T06:23:45.622Z Has data issue: false hasContentIssue false

MVFD-Net: multi-view fusion detection network for occluded underwater dam cracks

Published online by Cambridge University Press:  24 June 2025

Yukai Wu
Affiliation:
School of Mechanical and Electrical Engineering, Henan University of Technology, ZHengzhou, PR China
Xiaochen Qin
Affiliation:
School of Mechanical and Electrical Engineering, Henan University of Technology, ZHengzhou, PR China
Lei Cai*
Affiliation:
School of Artificial Intelligence, Henan Institute of Science and Technology, Xinxiang, PR China
*
Corresponding author: Lei Cai; Email: cailei2014@126.com
Rights & Permissions [Opens in a new window]

Abstract

Detecting cracks in underwater dams is crucial for ensuring the quality and safety of the dam. However, underwater dam cracks are easily obscured by aquatic plants. Traditional single-view visual inspection methods cannot effectively extract the feature information of the occluded cracks, while multi-view crack images can extract the occluded target features through feature fusion. At the same time, underwater turbulence leads to nonuniform diffusion of suspended sediments, resulting in nonuniform flooding of image feature noise from multiple viewpoints affecting the fusion effect. To address these issues, this paper proposes a multi-view fusion network (MVFD-Net) for crack detection in occluded underwater dams. First, we propose a feature reconstruction interaction encoder (FRI-Encoder), which interacts the multi-scale local features extracted by the convolutional neural network with the global features extracted by the transformer encoder and performs the feature reconstruction at the end of the encoder to enhance the feature extraction capability and at the same time in order to suppress the interference of the nonuniform scattering noise. Subsequently, a multi-scale gated adaptive fusion module is introduced between the encoder and the decoder for feature gated fusion, which further complements and recovers the noise flooding detail information. Additionally, this paper designs a multi-view feature fusion module to fuse multi-view image features to restore the occluded crack features and achieve the detection of occluded cracks. Through extensive experimental evaluations, the MVFD-Net algorithm achieves excellent performance when compared with current mainstream algorithms.

Information

Type
Research Article
Copyright
© The Author(s), 2025. Published by Cambridge University Press

1. Introduction

Dams are prone to cracking due to prolonged immersion, temperature fluctuations, water chemical corrosion and hydraulic fracturing [Reference Mucolli, Krupinski, Maurelli, Mehdi and Mazhar1, Reference Chen, Huang and Kang2]. The number of cracks increased as the dam was used for longer periods of time. Some of the cracks may extend into the interior of the embankment dam, which affects the dam’s structure and load-carrying capacity. Thus, underwater dam crack detection is crucial to ensure the proper functioning and safety of the dam structure.

Dam cracks can be obscured by aquatic vegetation during crack detection in the dam. The traditional single-view image cannot completely capture the characteristic information of the crack, and the use of multi-view image for crack detection can effectively avoid the problem of repeated occlusion in the same area of the crack. The feature information of the occluded region can be complemented by fusing the multi-view images. In recent years, numerous methods for enhancing underwater feature extraction capabilities in underwater object detection have been proposed [Reference Jian, Yang, Tao, Zhi and Luo3]. Convolutional neural networks (CNNs) are widely used in feature recognition, but they have limitations in dealing with long-range dependencies. Transformer networks can effectively capture global dependencies through positional encoding or attention mechanisms [Reference Wu, Li, Zhang, Li, Li and Zhang4, Reference Beyene, Tran, Maru, Kim and Park5]. However, the complexity of the transformer computation increases significantly, especially when computing inter-positional correlations, and the computational resources required grow exponentially. This leads to the fact that most transformer methods can only run on high-performance servers [Reference Zhang, Bai and Kpalma6]. The use of gating mechanisms [Reference Yang, Zhu, Wu and Yang7] for the problem of too much information can alleviate this problem. However, in order to balance segmentation accuracy and computational complexity, many studies have attempted to combine CNNs with transformers, achieving promising results [Reference Wang, Zeng, Sharma, Alfarraj, Tolba, Zhang and Wang8]. Also this method fuses the global information of coarse granularity with the local information of fine granularity facilitates the network to capture features of different sizes [Reference Zhu, Huang, Xie, Meng, Wang and Zhou9]. Underwater turbulence leads to an uneven distribution of the suspended sediments, which leads to varying levels of scattered noise in the images. This results in different levels of feature information being obscured from different perspectives. Such differential noise not only affects the fusion of homogeneous feature information in multi-view images but also the completeness and accuracy of crack detection.

To address these challenges, this study introduces a novel multi-view fusion network (MVFD-Net) designed for crack detection in underwater dams. The primary innovations of this network are encapsulated in the following three aspects:

  1. 1. The FRI-Encoder is introduced, which facilitates interaction between the multi-scale local features extracted by the CNN encoder and the global features extracted by the transformer encoder. This interaction is achieved through the fusion interaction module (IFM) and the feature reconstruction module (FRM). These design choices enhance the model’s ability to capture crack texture features and effectively suppress background noise.

  2. 2. The MGAF module is proposed to enable cross-level feature fusion between the encoder and decoder. This module compensates for the semantic loss in low-level features while recovering the details in high-level features, thereby improving the continuity of segmentation results.

  3. 3. The MVFF module is proposed to guide the scale-space construction of the SIFT algorithm. This is achieved through multi-view crack masks and unobscured crack masks with dimension-enhancing descriptions. The module effectively mitigates image alignment issues caused by homogeneous feature variability, which is induced by underwater non-uniform scattering noise. Additionally, through adaptive guidance provided by the occlusion masks, the MVFF module restores masked crack features, further improving crack segmentation performance.

2. Related work

2.1. Occluded object detection

Traditional techniques for detecting dam cracks include embedded sensors, ground penetrating radar, and ultrasonic testing. While these methods are effective in traditional environments, their performance is significantly degraded in underwater scenarios due to occlusion, which results in missing target information and poses challenges in feature extraction and crack detection. With the rapid development of deep learning technologies, researchers have been exploring neural networks for hidden object detection. Various approaches have been proposed to address the problem of information loss caused by occlusions. Ke et al. [Reference Ke, Tai and Tang10] introduced a two-layer convolutional network (BCNet), which is characterized by its two-layer structure: the upper layer detects occluders, while the lower layer infers the occluded parts of the target. This method separates the boundaries of occluders and occluded targets through mask regression, providing an effective solution to the occlusion problem. In the field of multi-object detection, Yuan et al. [Reference Yuan, Kortylewski, Sun and Yuille11] proposed a generative model that leverages the activation of neural features to accurately localize occluders. By classifying targets based on free areas, the model ensures high detection accuracy. To address the problem of sub-segmentation, Zhang et al. [Reference Zhang, Dai, Song, Zhao and Zhang12] developed OSLPNet, which mitigates the impact of occlusion on feature extraction through multi-scale receptive fields. Additionally, the network leverages the contextual topological relationships of target features to further optimize occluded object detection. Gan et al. [Reference Gan, Menegon, Sun, Scollo, Jiang, Xue and Norton13] improved a two-stage segmentation network by introducing boundary expansion boxes that guide non-modal instance segmentation networks to generate clearer target boundaries. Meanwhile, Wang et al. [Reference Wang, Zhu, Chen, Li and Cai14] proposed OccludedInst, a query-based instance segmentation method. By integrating data augmentation techniques and an occlusion correction module, this approach enables robust learning in covert scenarios. For more precise restoration of the appearance of occluded targets, Yan et al. [Reference Yan, Wang, Liu, Yu, He and Pan15] developed an iterative multitasking framework. This framework uses a dual-path structure that includes a 3D model pool and coupled discriminators, which significantly improves the accuracy of target recovery and detection.

2.2. Multi-view object detection

When detecting occluded objects, a single viewpoint image may not accurately identify the target. Multi-view approaches utilize information from multiple perspectives to compensate for the loss of information caused by occlusion in a single view [Reference Guo, Yu, Xie, Ma, Cao and Luo16]. The biggest challenge in multi-view object detection is effectively merging information from different viewpoints, especially when it comes to occlusions and viewpoint variations. To address these problems, many studies have proposed solutions based on multi-view information fusion [Reference Dong, Yan, Tang, Tang and Zhang17, Reference Xia, Liao, Di, Zhao, Liang and Xiong18], generative adversarial networks (GANs) [Reference Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair and Bengio19, Reference Isola, Zhu, Zhou and Efros20], and optical flow learning. Zhou et al. [Reference Zhou, Tulsiani, Sun, Malik and Efros21] introduced an appearance flow-based method that learns the image flow relationship between the source and target views to reconstruct occluded regions to improve the robustness of multi-view object detection [Reference Yang, Wang, Wang, Dou, Yang and Shen22]. Choy et al. [Reference Choy, Xu, Gwak, Chen and Savarese23] proposed a recursive neural network framework that recursively merges multi-view information to generate 3D object models with minimal occlusion, thereby mitigating the problems caused by occlusions. Yang et al. [Reference Yang, Zheng, Yang, Chen and Tian24] developed a Spatiotemporal Graph Convolutional Network (ST-GCN) that integrates both temporal and spatial features, enabling effective video re-identification of pedestrians even in occlusion. Zhang et al. [Reference Zhang, Zheng, Gao, Zhang, Yang and Chua25] proposed a Multi-View Consistency Generative Adversarial Network (MVCGAN) that successfully generates images from multiple viewpoints through geometric constraints and optimization models and processes complex multi-object scenes using a “decomposition and composition” approach. Overall, these methods cleverly fuse multi-view information and generative models to not only address occlusion problems but also improve the performance of multi-view object detection in complex scenarios, thereby improving the robustness and accuracy of detection systems. Arooj et al. [Reference Arooj, Altaf, Ahmad, Mahmoud and Mohamed26] introduced an improved detection network that combines CNNs and SIFT and uses SIFT to extract important feature points from images under different lighting conditions, guiding the network to learn effectively and achieve promising results. Ma et al. [Reference Xu, Yuan and Ma27] proposed a multi-graph matching fusion mechanism that implements a coarse-to-fine matching process and attempts to improve local texture information while preserving the original scene content during the fusion phase.

Figure 1. MVFD-Net Network Framework. The “Conv layer” represents the convolution operations on each layer, “Patch Embed” refers to the embedding layer, and the “transformer block” refers to the transformer blocks used for feature extraction. “ESPCN” indicates the subpixel convolution used for upsampling, while “IDSC” represents depth-wise separable convolutions (PW + DW). Finally, the output provides the predicted results.

3. Proposed method

In this paper, we propose aMVFD-Net. The MVFD-Net network structure is shown in Figure 1. MVFD-Net consists of three key components: the feature reconstruction interaction encoder (FRI-Encoder), the multi-scale gated adaptive fusion module (MGAF), and the multi-view feature fusion module (MVFF). First, FRI-Encoder interacts the multi-scale local features extracted by CNN with the global features extracted by transformer, and designs two modules, interaction feature module (IFM) and feature refinement module (FRM), at the middle layer as well as at the end. To solve the problem of feature extraction difficulty caused by underwater non-uniform scattering noise. Second, MGAF performs feature fusion between the encoder and decoder via a pyramid network to further complement the lost feature detail information. Finally, the designed MVFF introduces new perspective features for feature fusion repair to solve the problem of dam cracks obscured by aquatic plants. The following sections provide a detailed analysis of each module and explain how they synergistically improve the overall performance of the MVFD-Net architecture.

3.1. Feature reconstruction interactive encoder (FRI-encoder)

This paper introduces the FRI-Encoder, a solution designed to address the challenges of feature extraction in underwater environments, particularly those caused by non-uniform scattering noise. The FRI-Encoder follows a two-branch architecture. One branch is a lightweight CNN encoder, augmented with ResNet34 as the backbone, and incorporates depth separable convolution (DSC) to replace standard convolutional layers [Reference Hua, Huang, Li and Cao28]. The fundamental unit of the encoder is the Conv block, which includes DSC, batch normalization (BN), and the GELU activation function. By stacking Conv blocks, the FRI encoder effectively extracts local features at each layer, maintaining strong feature extraction capability while minimizing network depth and computational load. The CNN encoder comprises five layers, with the number of channels doubling after every two downsampling operations, leading to a progressively smaller feature map. The second branch, the transformer encoder, consists of four layers. The input image is sequentially processed by the transformer module, with the feature map size reduced by a factor of 1/4 after passing through the embedding layer. To ensure consistent feature map sizes during fusion, features extracted from the second layer of the CNN encoder are merged with those from the first layer of the transformer encoder through the designed IFM module. This process is repeated layer by layer, as illustrated in Figure 2.

Figure 2. Illustration of IFM, $T$ denotes the features that are reintroduced into the transformer encoder after fusion, and $T'$ indicates the features input into the MGAF module after fusion.

Specifically, the IFM aggregates interlayer features of encoders with two branches and further divides them into two subbranches. One of these subbranches is connected to the decoder via jump connections. First, the feature shape is adjusted (from $C \times H \times B$ to $H \times B \times C$ ), followed by global average pooling (GAP) (reducing to $1 \times 1 \times C$ ) to calculate the weight for each channel. The weight is then multiplied by the features to adjust the influence of each channel. Finally, the feature shape is restored (back to $C \times H \times B$ ). After the processed features are multiplied and merged, they are concatenated with the original features to obtain the merged features. Information aggregation is then performed via a convolution module and residual connections, resulting in the complementary fusion features $T'$ that improve feature correlation. The other subbranch generates the fusion feature $T$ , which is then fed back into the transformer encoder. Through this interaction, the local feature information extracted by the CNN encoder can be integrated into the transformer encoder, improving its ability to perceive local details such as edges, shapes, and textures. In this study, the reshaping of inconsistent feature dimensions in transformer is first aligned with the CNN architecture and summed element-by-element as input. Then, a pixel-wise convolution (1x1 convolution), enhanced by both local and global attention mechanisms, is applied to assign weights and obtain the weighted fused features. The weight calculation process is shown in Table I.

Table I. Local and global attention mechanisms.

Through the synergy of local and global attention modules, weights can be adaptively assigned to fuse both local and global features. These fused features are then reintroduced into the transformer encoder in an interactive manner, thereby enhancing the transformer’s attention to local feature information and generating more representative features. This process can be summarized and represented by the following equation:

(1a) \begin{equation} T = W(A \oplus B) \otimes A \oplus (1 - W(A \oplus B)) \otimes B \end{equation}

where $ \oplus$ represents feature integration and we use element-by-element summation, $ \otimes$ denotes element-wise multiplication, $ A$ denotes the feature maps of the CNN encoder, $ B$ denotes the feature maps of the transformer encoder, $ T$ represents the fused output features, with $ A, B, T \in \mathbb{R}^{C \times H \times W}$ . $ W$ refers to the processing step described in the pseudo-code in Table I. As illustrated in Figure 2, the dotted line indicates $ 1 - W(A \oplus B)$ . It is important to note that the fusion weight $ W(A \oplus B)$ consists of real values between 0 and 1, as does $ 1 - W(A \oplus B)$ , which enables the network to perform a weighted average between $ A$ and $ B$ .

Although the two-branch coding approach effectively mitigates the noise flooding problem caused by underwater nonuniform noise, it also introduces significant information redundancy. To filter and reconstruct the feature information generated by the FRI-Encoder for fusion, we design the FRM module at the end of the FRI-Encoder. The FRM module is a symmetrical structure, and for ease of presentation, we show only one side of the module as shown in Figure 3.

The FRM module adjusts the shape of the feature maps $ f_1$ and $ f_2$ extracted by the CNN encoder and transformer encoder from $ (B, C, H, W)$ to $ (B, C, H \times W)$ . Next, the global mean and global maximum are computed. After processing by the convolutional layer and ReLU activation function effective global feature representations $ a_1$ and $ a_2$ are obtained. This process effectively reduces the amount of post-demand computation. The detailed steps are provided in Table II.

In Table II, $i$ is taken as 1 or 2 corresponding to the features $f_1$ and $f_2$ extracted by the input CNN encoder and transformer encoder. “DSC” refers to depthwise separable convolution. The reconstructed feature maps, $a_1$ and $a_2$ , are extracted by the CNN and transformer encoders, respectively, through feature reconstruction. The cross-attention mechanism is used to reweight $f_1$ and $f_2$ by $a_1$ and $a_2$ to obtain the feature map $f_{i\text{-cross}}$ . The cross-attention weighting process is defined by the following equation:

(2a) \begin{align}&\,\,\, f_{1\text{-cross}} = \text{softmax}(a_{1} \times a_{2}^T) \times f_1 \end{align}
(3a) \begin{align}& f_{2\text{-cross}} = \text{softmax}((a_{1} \times a_{2}^T)^T) \times f_2 \end{align}

The cross-attention mechanism captures the correlations between different feature maps, enabling a more accurate representation of information. Next, $ f_{i\text{-cross}}$ is reshaped from $ (B,C,H\times W)$ to the original shape $ (B,C,H,W)$ and fed into the convolutional layer for spatial feature fusion. This process adaptively adjusts the importance of each pixel, preserving richer and more effective spatial structural information. The details are as shown in Table III. $ a_{i\text{-spatial}}$ represents the spatial feature weights, and $ f_i'$ denotes the feature map obtained after $ f_i$ is weighted by $ a_{i\text{-spatial}}$ . The weighted and fused feature maps, $ f_1'$ and $ f_2'$ , are adjusted in dimensions and then element-wise superimposed. The reshaped feature maps, with shape $ (B,C,H,W)$ , are used as the final output of the encoder.

Table II. Feature reconstruction.

Figure 3. Illustration of FRM, the feature maps $f_i$ represent the features from the CNN encoder or transformer, while $ f_i^{\prime}$ denotes the reconstructed features.

Table III. Spatial information integration.

FRI-Encoder achieves deep fusion of feature reconstruction and interaction, which not only improves the effectiveness of feature representation but also enhances the model’s ability to capture key features of the target. This provides strong support for handling target detection and feature extraction tasks in complex underwater environments [Reference Pan, Jia and Cai29].

3.2. Multi-scale gated adaptive fusion module (MGAF)

Cracks exhibit complex topological structures, irregular boundaries, and a very small pixel ratio in images. Some of the feature details in cracks often contain important structural information. Nonuniform underwater noise further exacerbates the obscurity of these detailed features. In this article, we propose the MGAF module, which can merge functions from different learning stages, effectively supplementing the detailed information of high-level functions. The MGAF network framework is shown in Figure 4.

Figure 4. Illustration of MGAF. $ T_i'$ and $ T_i''$ represent the input features and the features processed by the MGAF module, respectively. Gate represents the gating mechanism, and $\theta$ denotes the embedding weights that manage the channel weights prior to normalization. The gating weights and bias ( $\eta$ and $\lambda$ ) progressively adjust the input feature proportions $x$ across the channels.

The MGAF module also uses DSC instead of standard convolution to process the FRI-Encoder middle layer fusion feature $ T'$ . DSC effectively adjusts and aligns features at different scales to achieve more efficient multi-scale feature fusion. Then, these features are concatenated along the channel dimension to obtain the full merged features. Next, the fused feature map is fed into the gating mechanism for processing. We introduce the operator $ \boldsymbol{\theta }$ to embed the global context and control the weights of each channel before normalization. Then, the gated adaptation operators $ \boldsymbol{\eta }$ and $ \boldsymbol{\lambda }$ are introduced. This operator adjusts the input features line by line based on the normalized outpu, and $ \boldsymbol{\theta }$ is responsible for adjusting the embedding output. Gating weights $ \boldsymbol{\lambda }$ and bias $ \boldsymbol{\eta }$ control the activation of the weight coefficients. These weights determine the behavior of the gating mechanism in each channel. In this paper, let $ \mathbf{x} \in \mathbb{R}^{C \times H \times W}$ represent an activation feature in a convolutional network, where $ H$ and $ W$ are the spatial height and width, and $ C$ is the number of channels. The total equation of the gating mechanism is given as follows:

(4a) \begin{equation} \hat {\mathbf{x}} = F(\mathbf{x} \mid \boldsymbol{\theta }, \boldsymbol{\eta }, \boldsymbol{\lambda }), \quad \boldsymbol{\theta }, \boldsymbol{\eta }, \boldsymbol{\lambda } \in \mathbb{R}^C \end{equation}

where $\hat {\mathbf{x}}$ represents the result processed by the adaptive gating mechanism. For each channel, the receptive field of the convolutional neural network is enhanced by designing a global context embedding module. This enables the network to better aggregate and utilize global information. Given the embedding weight operator $\boldsymbol{\theta } = [\theta _1, \ldots , \theta _C]$ , the approach avoids ambiguities that may arise from a limited receptive field. Let $\mathbf{x} = [x_1, \ldots , x_C]$ , where $x_c = \left [ x_c^{(i,j)} \right ]_{H \times W} \in \mathbb{R}^{H \times W}$ with $c \in \{1, 2, \ldots , C\}$ . The global feature representation $g_c$ is then obtained as follows, as shown in the equation:

(5a) \begin{equation} g_c = \theta _c \|x_c\|_2 = \theta _c \left ( \left [ \sum _{i=1}^{H} \sum _{j=1}^{W} (x_c^{(i, j)})^2 \right ] + \epsilon \right )^{1/2} \end{equation}

where $ x_c$ represents the feature map of each channel in $ x$ . In this paper, $ x_c$ is $ \ell _2$ normalized to retain more detailed feature information compared to GAP. $ \theta _c$ represents the trainable parameters that control the weight of each channel. The proposed channel normalization method employs $ \ell _2$ normalization across channels to enhance the training stability and model performance while significantly reducing computational complexity. Let $ G = [g_1, \ldots , g_C]$ represent the normalized feature representations, as shown in the following equation:

(6a) \begin{equation} \hat {g}_c = \frac {\sqrt {C} s_c}{\| G \|_2} = \frac {\sqrt {C} g_c}{\left ( \sum \limits _{c=1}^{C} g_c^2 + \epsilon \right )^{1/2}} \end{equation}

where $ \epsilon$ is a small positive constant, and $ \sqrt {C}$ is introduced to prevent $ \hat {g}_c$ from becoming too small when $ C$ ( the number of channels) is large. To further control the feature expression of each channel, this study employs a gating mechanism to dynamically regulate the control gates on the channels by designing trainable gating weights $ \boldsymbol{\eta } = [\eta _1, \ldots , \eta _C]$ and biases $ \boldsymbol{\lambda } = [\lambda _1, \ldots , \lambda _C]$ to adjust the activations. The equation is as follows:

(7a) \begin{equation} \hat {x}_c = x_c \left [1 + \tanh (\eta _c\hat {g}_c + \lambda _c \right )] \end{equation}

Where $ \eta _c$ and $ \lambda _c$ represent the gating weight and bias of the $ c$ -th channel, respectively.

Following the MGAF module processing, more complex dependencies are captured, and global context normalization is introduced through efficient feature fusion between the encoder and decoder. By tuning the trainable parameters, the model adaptively optimizes global features and enhances feature representation. This approach is particularly well-suited for small and imbalanced datasets, as it not only reduces computation from excessive fused information but also improves segmentation performance and model generalization.

3.3. Multi-view feature fusion (MVFF)

This article proposes a MVFF to solve the problem of occlusion, as shown in Figure 5.

Figure 5. Multi-view feature fusion module.

First, the feature point description level of the SIFT algorithm is improved by extending the traditional 128-dimensional descriptor to 180 dimensions. The improvement is to define a circular area with a radius of 30 pixels around the key point, divided into 15 concentric rings. After calculating the gradients of each region with Gaussian weighting, the main direction of the feature point is optimized taking into account the rotation invariance of the circular region. This approach not only increases computing efficiency, but also simplifies operation. To ensure the uniqueness of the feature points, the gradients in the circular area are divided into 12 directions and the values in each direction are accumulated. Finally, the gradient sums of the regions are arranged in an inside-out order, forming a 180-dimensional feature vector. By expanding the descriptor dimensions, this method enables more precise encoding of the local region details in the image and captures more subtle features in complex underwater environments with uneven noise. In addition, to mitigate the impact of gray level variations, the generated feature vector is normalized.

However, expanding the descriptor also increases computational complexity, especially in feature matching and storage, which may lead to additional burden. To balance precision with computational efficiency and to account for the differences in homogeneity characteristics caused by different noise levels in underwater images from different viewpoints, this paper proposes a network based on the encoder-decoder structure designed in Sections 3.1 and 3.2. This network performs semantic segmentation of both free and blocked cracks in the dam body and generates corresponding masks. The location information provided by the free crack mask is used to guide the construction of the scale space, allowing the network to quickly focus on the regions with significant crack features. This approach enables efficient feature point detection and significantly reduces the amount of unnecessary calculations. Feature point matching is then performed using the SIFT algorithm and based on the matching relationships, a homography matrix transformation is applied to the new viewpoint image. The specific calculation formula is as follows:

(8a) \begin{equation} p' = H \cdot p \end{equation}

where $ p = [x, y, 1]^T$ represents a point in the original view, expressed in homogeneous coordinates (with the third element being 1), the transformed point is denoted as $ p' = [x', y', 1]^T$ . The relationship between the original and transformed points is described by the homography matrix $ H$ , a $ 3 \times 3$ matrix that governs the perspective transformation between the two views.

Finally, under the guidance of the hidden crack mask, an adaptive weighted multi-feature fusion of hidden features in the original view is performed using the new views. Specifically, there are $ N$ views, where $ V_1$ represents the original view and $ V_i$ ( $ i \in [2, N]$ ) represents the new views. In each view, if a pixel at position $ p$ is occluded, the occlusion mask is called $ M_i(p) = 1$ , otherwise $ M_i(p) = 0$ . If the position $ p$ in $ V_1$ is occluded ( $ M_1(p) = 1$ ), we want to use the unoccluded regions in other perspectives to perform the recovery. This means that during the weighted averaging process, only the feature values in corresponding positions in other views that are not occluded should be involved in the recovery. The equation for this is as follows:

(9a) \begin{equation} V_i(p) = \begin{cases} V_i(p), & \text{if } M_i(p) = 0 \\ 0, & \text{if } M_i(p) = 1 \end{cases} \end{equation}

where $ V_i(p)$ represents the feature value at position $ p$ in the $ i$ -th view and $ W_i$ is the weight assigned to each view. The less occlusion there is in the new view, the greater its contribution to restoring the original view. Therefore, we introduce a weighting coefficient $ W_i(p)$ , which is defined as follows:

(10a) \begin{equation} W_i(p) = \frac {1 - M_i(p)}{\sum _{j=1}^N (1 - M_j(p))} \end{equation}

where $(1 - M_i(p))$ ensures that only the views in which the position $ p$ is not obscured contribute to the weighting. This method effectively solves the problem that hidden and non-hidden features cannot be merged. Finally, for all views, the uncovered feature values at position $ p$ are weighted and averaged point by point to generate the repaired feature $ V$ . The final repair result can be expressed as follows:

(11a) \begin{equation} V(p) = \begin{cases} V_1(p), & \text{if } M_1(p) = 0 \\ \sum _{i=2}^N W_i(p) \cdot V_i(p) \cdot M_i(p), & \text{if } M_1(p) = 1 \end{cases} \end{equation}

where $ V(p)$ represents the repaired feature value at position $ p$ . In particular, for the visible area in the original view $ V_1$ (where $ M_1(p) = 0$ ), the original feature is retained. For the occluded region (where $ M_1(p) = 1$ ), the occluded feature is adaptively repaired by averaging the unoccluded features from other viewpoints using a weighted fusion approach.

This paper proposes a multi-view image feature fusion method that improves the SIFT algorithm by expanding the descriptor dimension to achieve more precise encoding of local details. Combined with a semantic segmentation network, the method generates occlusion masks to optimize the detection and assignment of feature points. After aligning the views using the homography matrix, the method adaptively repairs the occluded regions using non-occluded features and finally generates a fused feature representation. This approach effectively solves the problem of merging homogeneous feature information affected by different noise levels from different viewpoints, thereby enabling hidden feature recovery.

3.4. Loss function

To ensure that the algorithm can be trained effectively, selecting an appropriate loss function is crucial. In the case of an unbalanced dataset, using Dice Loss can result in significant fluctuations. Prediction errors with small targets can result in sharp loss value changes, leading to severe gradient instability. Given that cracks in images are typically slender and elongated, making them challenging to segment as small targets, a combination of BCE_Loss and Dice_Loss was chosen. This combination enables the network to focus more effectively on foreground regions while mitigating the instability of Dice_Loss during training. The specific formulas are as follows:

(12a) \begin{align}& {BCE\_Loss} = -(1 - m) \log (1 - t) - m \log (t) \end{align}
(13a) \begin{align}&\qquad\quad\,\, {Dice\_Loss} = 1 - \frac {2y \hat {s} + 1}{y + \hat {s} + 1} \end{align}
(14a) \begin{align}&\qquad\,\, \text{Loss} = {BCE\_Loss} + {Dice\_Loss} \end{align}

where $m$ represents the ground truth labels and $t$ is the predicted probability in the BCE_Loss function. BCE_Loss measures the discrepancy between the predicted values ( $t$ ) and the true labels ( $m$ ). In Dice_Loss, $\hat {s}$ is the model’s estimated probability, and $y$ is the label. Dice_Loss assesses the overlap between the predicted and actual regions. Since binary classification is a problem involving 0s and 1s, the final predicted values also fall between 0 and 1. To categorize predictions into two classes, we need to establish a threshold. Since crack segmentation focuses solely on differentiating between crack areas and the background, this represents a binary classification challenge. We utilize Dice_Loss with a threshold of 1 to facilitate this distinction. The combined loss function we constructed takes into account both the accuracy of the model’s predictions and the precision of the predicted regions, further enhancing segmentation performance.

4. Experiment setting

The experimental setup in this paper consists of three key components: evaluation metrics, experimental datasets, and model implementation details. First, the evaluation metrics used to assess the performance of the proposed model are outlined. Second, the datasets employed in the experiments are discussed. Lastly, an explanation of the model implementation is provided, including the hardware and software used during the experiments. We adopt the standard evaluation metrics for semantic segmentation [Reference Diao, Su, Yang, Zhu, Xiang, Chen and Shi30], including mIoU (MIOU), recall (Re), accuracy (Acc), and F1 score (F1). In addition, in order to evaluate the computational complexity of the model, this paper introduces a comparison between the proposed algorithm and the leading models in terms of the amount of model calculations (FLOPs) and the number of parameters (Params) to comprehensively evaluate the performance of the algorithm.

4.1. Dataset

The self-constructed dataset used in this paper is a crack image of a submerged dam in a reservoir in Zhejiang Province. The images in this dataset present challenges, such as complex underwater non-uniform noise and occlusion caused by aquatic plants. As a benchmark for evaluation, this dataset is highly relevant. To facilitate analysis, the dataset adopts a multi-view setup for each crack, comprising three images: one original view and two additional perspectives. These three images are captured from different viewpoints of the same dam crack. Subsequently, the dataset is categorized into three underwater cases by filtering and collating the images: the Weak Non-Uniform Noise Underwater Occlusion Dataset (UDODW), the Strong Non-Uniform Noise Underwater Occlusion Dataset (UDODS), and the Aquatic Plant Occlusion Dataset (UDPOD). The first two datasets, UDODW and UDODS, use randomly generated black blocks to simulate occlusion from various virtual perspectives, while the UDPOD dataset contains images of underwater dam cracks occluded by actual aquatic plants. Finally, after filtering the images, the three datasets, with 360 image sets in total, are organized into underwater multi-view occlusion datasets. A visual representation of these datasets is shown in Figure 6.

4.2. Experimental environment

The dataset used in this study has a resolution of 256 $\times$ 256. The network model is built using the PyTorch deep learning framework. The configuration for training the network model includes an AMDEPYC 7282 processor, 250 GB of memory, and an NVIDIA A100 80 GB PCIe GPU. The training is carried out using the GPU. In this paper, Adam optimization [Reference Liu, Wang, Chen, Huangliang and Zhang31] was employed, the initial learning rate was set to 0.001, and the cosine annealing strategy with thermal restart was applied to dynamically adjust the learning rate. At epoch 20, the initial learning rate was restored for the first time, and each subsequent restoration doubled the previous one. The network was trained for a total of 120 epochs, with a batch size of 4.

5. Experiments and results

5.1. Comparative experiments

In this section, the UDPOD dataset is randomly divided into a training set, a validation set, and a test set in the ratio of 7:2:1. The proposed algorithm is compared with five other algorithms, namely CrackFormer-II [Reference Liu, Yang, Miao, Mertz and Kong32], CarNet [Reference Li, Yang, Ma, Wang and Wang33], SegNet [Reference Badrinarayanan, Kendall and Cipolla34], OSLPNet [Reference Zhang, Dai, Song, Zhao and Zhang12], and UISS-Net [Reference He, Cao, Luo, Xu, Tang, Xu and Chen35]. These compared algorithms are all state-of-the-art segmentation networks in their respective fields, with UISS-Net [Reference He, Cao, Luo, Xu, Tang, Xu and Chen35] being a segmentation network designed for underwater scenarios, and OSLPNet [Reference Zhang, Dai, Song, Zhao and Zhang12] being an occlusion-resistant segmentation network. We also use the recommended parameter settings and run the source code provided by the authors to achieve the best results for each method. Comparative experiments are conducted on the UDODW, UDODS, and UDPOD datasets to demonstrate the superiority of the proposed network model. Additionally, ablation experiments are performed on the UDPOD dataset to assess the contribution of each module. The results of these experiments are as follows:

Table IV. Comparison of segmentation models on the UDODW dataset.

Figure 6. Illustration of the images in the three datasets.

5.1.1 Comparative experimental results analysis

Qualitative analysis: From the comparative evaluation results, it is evident that the algorithm proposed in this paper performs exceptionally well across all three datasets, especially in several key evaluation metrics where it achieves significant improvements. In Table IV, although the accuracy (Acc) of this paper’s algorithm on the UDODW dataset is slightly lower than the OSLPNet algorithm, it surpasses it by 2.04% in the mean intersection over union (mIoU). This indicates that the proposed algorithm is better at distinguishing between cracked and non-cracked regions. Additionally, the F1 score of the proposed algorithm is 1.29% higher than the second-ranked algorithm, demonstrating its improved ability to identify cracked regions while minimizing false positives for non-cracked areas. Although the proposed algorithm outperforms the OSLPNet algorithm in FLOPs (G) by 0.26G, it is still far lower than the other compared algorithms, with a total of only 1.07G. Additionally, the number of parameters is just 4.09M, which is sufficient for deployment on mobile devices, such as underwater robots. These results demonstrate that the proposed network can be efficiently deployed on mobile devices, offering low computational complexity while ensuring excellent recognition performance in the presence of challenges such as occlusion and noise. In the UDODS dataset, shown in Table V, the proposed algorithm improves by 1.45% in mIoU and 0.93% in F1 score. These results indicate that the algorithm exhibits strong robustness in complex environments with significant noise, effectively handling interference from highly noisy scenes. Table VI further confirms the exceptional performance of the proposed algorithm in environments obscured by aquatic plants. All evaluation metrics of the proposed algorithm outperform those of the other five compared algorithms. Specifically, the accuracy (Acc) and mIoU are improved by 2.32% and 2.15%, respectively, over the second-place algorithm, while the F1 score shows an impressive improvement of 95.57%. This highlights the algorithm’s excellent balance between segmentation precision and recall. These results provide strong evidence that the proposed network excels in tackling challenges such as occlusion and noise.

Table V. Comparison of segmentation models on the UDODS dataset.

Table VI. Comparison of segmentation models on the UDPOD dataset.

Quantitative analysis: Figures 7 and 8 demonstrate that the proposed network shows a consistent improvement throughout the training process, with all evaluation metrics outperforming those of the compared algorithms. The loss decreases smoothly without significant fluctuations, indicating that the network exhibits strong learning ability and convergence. As shown in Figure 9, the proposed algorithm demonstrates robust performance on both the UDODW and UDODS datasets, accurately segmenting the target region. Although the OSLPNet and UISS-Net algorithms also perform well, their segmentation masks display noticeable breakpoints and under-segmentation in regions where cracks are obscured. In contrast, the segmentation masks of the SegNet and CrackFormer-II algorithms are generally larger than the actual crack width, exhibiting significant noise and over-segmentation. This results in a higher number of false negatives (FNs), where non-cracked regions are misclassified as cracked regions. For the UDPOD dataset, the real hydrilla occlusion better reflects the semantic correlation between objects, necessitating stronger semantic reasoning and context-awareness from the algorithm. As shown in Figure 9, the proposed algorithm still effectively segments the crack features in the occluded areas. In contrast, other algorithms can only rely on the visible crack region to predict the occluded crack, resulting in noticeable breakpoints in the segmentation mask and an inability to accurately segment the occluded cracks.

Figure 7. Illustration of the changes in loss during the training stage.

Figure 8. Illustration of the changes in mIoU, Acc, Re, and F1 during the training stage.

Figure 9. Illustration of segmentation results of comparative experiments on different datasets.

As shown in Figure 9(g), in this crack scene, the homogenization of the background and crack features is even more pronounced due to over-illumination, low image contrast, and the inherent noise effects of the underwater environment. This makes it difficult for the algorithm to accurately distinguish between the occluded and unoccluded parts. Compared to other contrast-based algorithms, the proposed algorithm effectively avoids excessive noise or over-segmentation and performs better. Despite a few disconnected under-segmentations in the occluded areas, the algorithm still segments the overall morphology of the cracks more effectively. This suggests that, in addition to improving the recognition algorithm, the simultaneous optimization of other underwater recognition equipment is also crucial and worthy of further consideration.

In summary, the algorithm proposed in this paper not only demonstrates excellent segmentation performance but also maintains good robustness in complex environments where dam cracks are occluded by aquatic plants, effectively addressing the occlusion problem in the recognition of dam cracks with different morphologies.

5.2. Ablation experiments

To evaluate the effectiveness of the FRI encoder, MGAF module, and MVFF module in segmenting occluded cracks in submerged dams, this paper conducts ablation experiments on the UDPOD dataset. The ablation experiment is designed as follows: Baseline, Proposed + w/o FRI-Encoder, Proposed + w/o MGAF, and Proposed + w/o Multi-view Occlusion Completion Module.

Table VII. Ablation experimental results on UDPOD dataset.

5.2.1 Ablation experimental results analysis

The results of the ablation experiments are presented in Table VII. Since the baseline model is not specifically optimized for the occlusion problem under underwater non-uniform scattering conditions, the algorithm proposed in this paper addresses the target loss issue in occluded areas by incorporating new viewpoint features. Consequently, compared to the baseline, the proposed algorithm demonstrates substantial improvements across all evaluation metrics. Notably, the mean Intersection over Union (mIoU) improves by 4.39%, and the F1 score increases by 5.53%, which underscores the enhanced segmentation stability of the model.

Further analysis of the impact of removing each design module reveals a significant decline in performance metrics compared to the full model. First, when the multi-view feature fusion (MVFF) module is removed, mIoU and accuracy decrease by 1.99% and 2.82%, respectively. This suggests that the MVFF module plays a crucial role in effectively fusing multi-view feature information and enhancing segmentation performance under occlusion. Second, the removal of the feature refinement and integration (FRI) encoder results in a marked decrease in evaluation metrics, with mIoU and F1 dropping by 4.29% and 3.86%, respectively. This indicates that the FRI encoder significantly mitigates the challenges of feature extraction caused by underwater nonuniform noise, thus improving the model’s robustness. Finally, the removal of the MGAF module leads to a decrease of 0.81% in mIoU and 1.16% in F1, further demonstrating the contribution of the MGAF module in enhancing model accuracy. Visual results presented in Figure 10 corroborate these findings. When the MVFF module is omitted, the segmentation results shown in Figure 10(e) reveal poor crack identification in the occluded region. Due to the lack of feature information from the occluded area, the network attempts to infer features based on crack connectivity but is limited in its ability to perform effective recognition. This highlights the critical role of the MVFF module in addressing the occlusion problem by compensating for missing information through multi-view feature fusion. In contrast, the segmentation of local details (e.g., texture information such as edges and shapes) improves significantly when the FRI-Encoder module is introduced, as shown in Figure 10(d) and (e). In Figure 10(c), where the FRI-Encoder module is absent, the crack boundaries are noticeably under-segmented, underscoring the encoder’s role in alleviating feature extraction difficulties caused by underwater non-uniform noise. When only the MGAF module is removed, the segmentation results shown in Figure 10(d) reveal obvious breakpoints in the crack boundaries, indicating the loss of information during the up-sampling stage. In contrast, the comparisons in Figure 10(f) and Figure 10(c) demonstrate that the MGAF module effectively compensates for information loss during the up-sampling process, mitigating segmentation discontinuities and improving both segmentation accuracy and coherence.

Overall, the visual and quantitative results of the ablation experiments clearly demonstrate the effectiveness of the FRI encoder, MGAF module, and MVFF module. Furthermore, they highlight that the proposed MVFD-Net model offers a significant advantage in comprehensive segmentation capability, effectively addressing occlusion and noise challenges, and providing stable and accurate segmentation results in complex underwater environments.

Figure 10. Illustration of segmentation results of ablation experiments on UDPOD dataset .

5.3. Practical application and result analysis

5.3.1 Introduction to remotely operated vehicles

In this paper, the BlueROV2 underwater robot, shown in Figure 11, is employed for practical application validation. The specific parameters of the robot are provided in Table VIII.

Table VIII. Specific parameters of the BlueROV2 underwater robot.

Figure 11. BlueROV2 underwater robot.

5.3.2 Application validation settings

In this section, we deploy the optimal model weights—trained on the UDPOD dataset—onto an underwater robot that is tethered via cable to an offshore mobile display unit. Under full-power, uniform illumination, the robot conducts field tests along the embankment of Dingguo Lake in Xinxiang City, Henan Province. To evaluate the algorithm’s performance, we selected three crack scenarios (Test 1, Test 2, and Test 3). During each scenario, the robot sequentially captures images of the same occluded crack from three positions (P0, P1, and P2), yielding three distinct viewpoints (Image 1, Image 2, and Image 3). As illustrated in Figure 12, we annotate the robot’s acquisition positions corresponding to each image across all test scenarios.

Figure 12. Illustration of relative positions of the underwater robot. The red circle denotes the location of the dam crack and the yellow guide line represents the direction of the robot’s view.

In this study, the underwater robot is remotely operated from an offshore unit and maneuvered to a position at a distance $r$ from the embankment crack area. At this location, the robot autonomously calculates multiple image acquisition viewpoints to capture crack images. To mitigate the impact of uneven viewpoint distribution on testing accuracy, a viewpoint constraint model is established to ensure that the acquisition points for each crack are both accurate and reasonably distributed.

Due to the highly nonlinear motion behavior of the robot in complex underwater environments, the robot’s depth and orientation in the vertical direction are adaptively adjusted according to the acquisition task requirements and terrain constraints. This ensures that the camera’s optical axis remains focused on the embankment cracks. On the horizontal plane, the multi-viewpoint positions are uniformly distributed by applying a viewpoint distribution constraint model.

Specifically, the camera position of the underwater robot in space is treated as the origin, while the observable range of the crack is abstracted as a semicircular region centered at the crack center with radius $r$ . A total of $n$ acquisition points are deployed along this semicircular arc to evenly cover the visible range of the crack. Let the initial azimuth angles of the acquisition points satisfy:

(15a) \begin{equation} \phi _S \lt \phi _0(0) \lt \phi _1(0) \lt \phi _2(0) \lt \cdots \lt \phi _n(0) \lt \phi _E, \quad (3 \leq n \in \mathbb{N}^*) \end{equation}

where $\phi _S$ and $\phi _E$ denote the start and end boundaries of the arc, respectively.

In this section, for illustration purposes, three angle ranges are selected: the first quadrant $[0, \frac {\pi }{2}]$ , the second quadrant $[\frac {\pi }{2}, \pi ]$ , and the combined first and second quadrants $[0, \pi ]$ , which correspond to the three test scenarios, Test 1, Test 2, and Test 3 shown in Figure 12. The distribution of the multi-viewpoint acquisition points along the semicircular arc with $\phi _S = 0$ and $\phi _E = \pi$ is illustrated in Figure 13.

Figure 13. Schematic layout of underwater robotic imaging positions centered on dam cracks.

The discrete evolution equation for the deployment of multi-viewpoint acquisition points is defined as follows:

(16a) \begin{align}& \phi _i(k+1) = \phi _i(k) + u_i(k) \end{align}
(17a) \begin{align}&\quad u_i(k) = -t \cdot \frac {\partial T(k)}{\partial \phi _i(k)} \end{align}

where $k$ denotes the iteration step, $\phi _i(k)$ is the azimuth angle of the underwater robot at acquisition point $P_i$ in step $k$ , and $t$ is the learning rate, set to $0.1$ . $T(k)$ represents the cost function, and $u_i(k)$ is the negative gradient-based control law used to update the robot’s acquisition point based on $T(k)$ (detailed later in this section).

Let $P_{i+1}$ denote the left neighbor of acquisition point $P_i$ , and let $L_i$ denote the angular distance (arc length) between $P_i$ and $P_{i+1}$ along the clockwise direction. This distance is computed as

(18a) \begin{equation} L_i(k) = \phi _{i+1}(k) - \phi _i(k), \quad 0 \leq i \leq n - 1 \end{equation}

To ensure both uniform distribution and sufficient angular coverage of the embankment crack from multiple viewpoints, we introduce two constraint metrics: consistency, quantified by the sum of absolute differences between neighboring distances, and coverage, defined by the deviation of the minimum neighbor distance from the ideal uniform spacing.

Accordingly, the cost function for the entire multi-viewpoint distribution is defined as:

(19a) \begin{equation} T(k) = \alpha \cdot \sum _{i=0}^{n-1} \left | L_{i+1}(k) - L_i(k) \right | + \beta \left ( \frac {\phi _E - \phi _S}{n - 1} - \min \left ( L_i(k) \right ) \right ) \end{equation}

where $T(k) \geq 0$ , and $\alpha$ and $\beta$ are weighting coefficients corresponding to distribution uniformity and angular coverage, respectively. In this study, both are set to $0.5$ . When $T(k)$ tends to 0, the viewpoints are optimally and uniformly distributed along the semicircular arc and achieve complete angular coverage. Taking into account environmental constraints, a practical convergence threshold of $T(k) = 0.2$ is used to terminate the iteration and determine the final configuration of the acquisition point.

Meanwhile, the underwater robot is equipped with a high-precision inertial measurement unit, a sonar-based localization and obstacle-avoidance module to realize safe and precise movement from one point to the next. Navigation control is divided into two steps: Path generation and Position revision.

Path generation: The current acquisition angle is $\phi _i$ and the next is $\phi _{i+1}$ , so the planar coordinates at $P_i$ are given by $x_i = r\cos \phi _i$ and $y_i = r\sin \phi _i$ . We divide the angular distance $L_i = \phi _{i+1} - \phi _i$ into $I = L_i / \sigma$ segments, where $\sigma$ is the robot’s minimum response step. Discrete trajectory points are then:

(20a) \begin{equation} O_j = \bigl ((1-\mu _j)x_i + \mu _j\,x_{i+1},\;(1-\mu _j)y_i + \mu _j\,y_{i+1}\bigr ) \end{equation}

For $j=0,1,\ldots ,I$ with $\mu _j = j/I$ , yielding a smooth sequence $\{O_0,\ldots ,O_I\}$ from $O_0=P_i$ to $O_I=P_{i+1}$ . Between each $O_j$ and $O_{j+1}$ , the robot advances at constant speed.

Position revision: After reaching $O_I=(x_{i+1},y_{i+1})$ , the sonar localization provides the actual position $(x_{\textrm{real}},y_{\textrm{real}})$ and the errors :

(21a) \begin{equation} e_x = x_{i+1}-x_{\textrm{real}}, \quad e_y = y_{i+1}-y_{\textrm{real}} \end{equation}

The correction step is $\Delta \mathbf{p} = K_p\,(e_x, e_y)^{\mathsf{T}}$ with $K_p=5$ . If $\sqrt {e_x^2+e_y^2} \gt \varepsilon _p$ ( $\varepsilon _p$ denotes the position error threshold, which is set to 0.05 m), the robot iteratively applies $\Delta \mathbf{p}$ until $\sqrt {e_x^2+e_y^2}\le \varepsilon _p$ . Once positional accuracy is met, the robot adjusts its attitude so the camera’s optical axis points precisely at the crack center and completes image acquisition.

Finally, the final multiview dam crack images—Image1, Image2, and Image3—are fused and processed by pretrained weight files deployed on the underwater robot to recognize dam cracks. The entire recognition pipeline operates at 25 fps, satisfying the real-time detection requirements. The recognition results are subsequently transmitted to an offshore mobile computing device via a wired connection.

5.3.3 Application validation results and analysis

Figure 14 illustrates the real-time segmentation results produced by the proposed algorithm for three test scenarios. Figure 14(d) presents the crack labels corresponding to Figure 14(a), while Figure 14(e) displays the predicted segmentation masks generated by the proposed algorithm for Figure 14(a). Figure 14(f) shows the superimposed results of these predicted masks on the original image in Figure 14(a).

Figure 14. Illustration of segmentation results of the proposed algorithm in three test scenarios.

In Test 1, the prediction mask generated by the proposed algorithm is compared with the crack labels corresponding to Figure 14(a) for images containing more complex crack shapes. Although minor under-segmentation occurs in regions with complex crack shapes, the overall crack structure is segmented with high accuracy, particularly in masked regions where cracks are well recognized. In Test 2, the algorithm demonstrates strong robustness by achieving accurate segmentation of fine cracks, highlighting its capability to handle detailed complexities effectively. In Test 3, where cracks are significantly obscured by aquatic plants, the predicted segmentation masks exhibit minor noise (i.e., some non-crack areas are misclassified as cracks). Nonetheless, the final segmentation results effectively capture the overall morphology of the cracks.

Experimental results demonstrate that the algorithm proposed in this paper exhibits strong accuracy and robustness in segmenting underwater occluded dam cracks in practical application scenarios.

6. Conclusions

In underwater dam crack detection, cracks are often obscured by aquatic plants, and underwater turbulence causes nonuniform diffusion of suspended sediments, resulting in varying degrees of feature submergence across different viewpoints. To address these challenges, we propose a MVFD-Net for occluded underwater dam crack detection,MVFD-Net. First, the FRI-Encoder integrates multi-scale local features extracted by a CNN with global representations from a transformer encoder and performs feature reconstruction fusion at the encoder output to suppress non-uniform scattering noise. Second, we introduce the MGAF module, which employs a pyramid structure to perform gated feature fusion between the encoder and decoder, thereby recovering lost details. Finally, within the segmentation network, we design an MVFF module to enhance crack integrity and recognition accuracy by incorporating features from additional viewpoints to repair occluded regions. We validate MVFD-Net on a self-constructed dataset, demonstrating its superior generalization and significantly improved segmentation performance under aquatic plant occlusion. Future work will focus on optimizing the device functions that accompany the algorithm and developing quantitative crack analysis methods to provide a more scientific basis for crack repair.

Author contributions

Yukai Wu conceived the study, performed the core experiments, and drafted the manuscript; Xiaochen Qin supervised the experimental validation and optimized the article content; Lei Cai guided the direction of the paper, provided the experimental equipment platform, and finalized the academic interpretation.

Funding

This work was supported by Henan Provincial focus on research and development Project (231111220700).

Financial support

This research received no specific grant from any funding agency, commercial, or not-for-profit sectors.

Competing interests

The authors declare no conflicts of interest exist.

References

Mucolli, L., Krupinski, S., Maurelli, F., Mehdi, S. A. and Mazhar, S., “Detecting cracks in underwater concrete structures: An unsupervised learning approach based on local feature clustering,” MTS/IEEE Seattle 23(1), 18 (2019).Google Scholar
Chen, D., Huang, B. and Kang, F., “A review of detection technologies for underwater cracks on concrete dam surfaces,” Appl. Sci. 13(6), 35643578 (2023).10.3390/app13063564CrossRefGoogle Scholar
Jian, M., Yang, N., Tao, C., Zhi, H. and Luo, H., “Underwater object detection and datasets: A survey,” Intelligent Marine Technology and Systems 2(1), 922 (2024).10.1007/s44295-024-00023-6CrossRefGoogle Scholar
Wu, Y., Li, S., Zhang, J., Li, Y., Li, Y. and Zhang, Y., “Dual attention transformer network for pixel-level concrete crack segmentation considering camera placement,” Autom. Constr. 157(3), 105166105178 (2024).10.1016/j.autcon.2023.105166CrossRefGoogle Scholar
Beyene, D. A., Tran, D. Q., Maru, M. B., Kim, T. and Park, S., “Unsupervised domain adaptation-based crack segmentation using transformer network,” J. Build. Eng. 15(2), 107889107990 (2023).10.1016/j.jobe.2023.107889CrossRefGoogle Scholar
Zhang, X., Bai, C. and Kpalma, K., “OMCBIR: Offline mobile content-based image retrieval with lightweight CNN optimization,” Displays 76(6), 102355102368 (2021).10.1016/j.displa.2022.102355CrossRefGoogle Scholar
Yang, Z., Zhu, L., Wu, Y. and Yang, Y., “Gated Channel Transformation for Visual Recognition,” 33st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, Seattle, WA, USA, 2020), pp. 1179111800.Google Scholar
Wang, J., Zeng, Z., Sharma, P. K., Alfarraj, O., Tolba, A., Zhang, J. and Wang, L., “Dual-path network combining CNN and transformer for pavement crack segmentation,” Autom. Constr. 158(64), 105217105239 (2024).10.1016/j.autcon.2023.105217CrossRefGoogle Scholar
Zhu, Z., Huang, S., Xie, J., Meng, Y., Wang, C. and Zhou, F., “A refined robotic grasp detection network based on coarse-to-fine feature and residual attention,” Robotica. 43(2), 118 (2024).Google Scholar
Ke, L., Tai, Y. W. and Tang, C. K., “Deep Occlusion-Aware Instance Segmentation with Overlapping Bilayers,” 34st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, Nashville, TN, USA, 2021), pp. 40184027 Google Scholar
Yuan, X., Kortylewski, A., Sun, Y. and Yuille, A., “Robust Instance Segmentation Through Reasoning About Multi-Object Occlusion,” 34st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, Nashville, TN, USA, 2021), pp. 1114111150.Google Scholar
Zhang, T., Dai, J., Song, W., Zhao, R. and Zhang, B., “OSLPNet: A neural network model for street lamp post extraction from street view imagery,” Expert Syst. Appl. 231(25), 120764120782 (2023).10.1016/j.eswa.2023.120764CrossRefGoogle Scholar
Gan, H., Menegon, F., Sun, A., Scollo, A., Jiang, Q., Xue, Y. and Norton, T., “Peeking into the unseen: Occlusion-resistant segmentation for preweaning piglets under crushing events,” Comput. Electron. Agric. 219(23), 108683108693 (2024).10.1016/j.compag.2024.108683CrossRefGoogle Scholar
Wang, H., Zhu, S., Chen, L., Li, Y. and Cai, Y., “OccludedInst: An efficient instance segmentation network for automatic driving occlusion scenes,” IEEE Trans. Emerg. Top. Comput. Intell. 10(12), 34149483414962 (2024).Google Scholar
Yan, X., Wang, F., Liu, W., Yu, Y., He, S. and Pan, J., “Visualizing the Invisible: Occluded Vehicle Segmentation and Recovery,” 32st IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE, Seoul, Korea, 2019), pp. 76177626.Google Scholar
Guo, Y., Yu, H., Xie, S., Ma, L., Cao, X. and Luo, X., “DSCA: A dual semantic correlation alignment method for domain adaptation object detection,” Pattern Recognit 150(12), 110329110345 (2024).10.1016/j.patcog.2024.110329CrossRefGoogle Scholar
Dong, N., Yan, S., Tang, H., Tang, J. and Zhang, L., “Multi-view information integration and propagation for occluded person re-identification,” Inf. Fusion 104(22), 102201102221 (2024).10.1016/j.inffus.2023.102201CrossRefGoogle Scholar
Xia, Z., Liao, M., Di, S., Zhao, Y., Liang, W. and Xiong, N., “Automatic liver segmentation from CT volumes based on multiview information fusion and condition random fields,” Opt. Laser Technol. 179(65), 111298111320 (2024).10.1016/j.optlastec.2024.111298CrossRefGoogle Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S. and Bengio, Y., “Generative Adversarial Nets,” 27th International Conference on Neural Information Processing Systems-Volume 2 (NIPS’14) (NIPS, Cambridge, MA, USA, 2014), pp. 26722680.Google Scholar
Isola, P., Zhu, J. Y., Zhou, T. and Efros, A., “Image-to-Image Translation with Conditional Adversarial Networks,” 30st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, Honolulu, HI, USA, 2017), pp. 59675976.Google Scholar
Zhou, T., Tulsiani, S., Sun, W., Malik, J. and Efros, A., “View Synthesis by Appearance Flow,” Computer Vision–ECCV 2016 (Springer, Amsterdam, The Netherlands, 2016), pp. 286301.10.1007/978-3-319-46493-0_18CrossRefGoogle Scholar
Yang, C., Wang, K., Wang, Y., Dou, Q., Yang, X. and Shen, W., “Efficient deformable tissue reconstruction via orthogonal neural plane,” IEEE Trans. Med. Imaging 43(11), 32113223 (2024).10.1109/TMI.2024.3388559CrossRefGoogle ScholarPubMed
Choy, C. B., Xu, D., Gwak, J., Chen, K. and Savarese, S., “3D-R2N2: A Unified Approach for Single and Multi-View 3D Object Reconstruction,” Computer Vision–ECCV 2016 (Springer, Amsterdam, The Netherlands, 2016), pp. 628644.10.1007/978-3-319-46484-8_38CrossRefGoogle Scholar
Yang, J., Zheng, W. S., Yang, Q., Chen, Y. C. and Tian, Q., “Spatial-temporal graph convolutional network for video-based person re-identification,” 33st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, Seattle, WA, USA, 2022), pp. 32893299.Google Scholar
Zhang, X., Zheng, Z., Gao, D., Zhang, B., Yang, Y. and Chua, T., “Multi-view consistent generative adversarial networks for compositional 3D-aware image synthesis,” Int. J. Comput. Vis. 131(26), 22192242 (2023).10.1007/s11263-023-01805-xCrossRefGoogle Scholar
Arooj, S., Altaf, S., Ahmad, S., Mahmoud, H. and Mohamed, A. S. N., “Enhancing sign language recognition using CNN and SIFT: A case study on Pakistan sign language,” J. King Saud Univ.-Comput. Inf. Sci. 36(23), 101934101962 (2024).10.1016/j.jksuci.2024.101934CrossRefGoogle Scholar
Xu, H., Yuan, J. and Ma, J., “MURF: Mutually reinforcing multi-modal image registration and fusion,” IEEE Trans. Pattern Anal. Mach. Intell. 45(23), 1214812166 (2023).10.1109/TPAMI.2023.3283682CrossRefGoogle ScholarPubMed
Hua, Y., Huang, X., Li, H. and Cao, X., “Mobile robot tracking control based on lightweight network,” Robotica, 42(2), 119 (2025).Google Scholar
Pan, J., Jia, J. and Cai, L., “Global enhancement network underwater archaeology scene parsing method,” Robotica 39(12), 35413564 (2023).10.1017/S026357472300098XCrossRefGoogle Scholar
Diao, S., Su, J., Yang, C., Zhu, W., Xiang, D., Chen, X. and Shi, F., “Classification and segmentation of OCT images for age-related macular degeneration based on dual guidance networks,” Biomed. Signal Process. Control 84(35), 104810104830 (2023).10.1016/j.bspc.2023.104810CrossRefGoogle Scholar
Liu, Y., Wang, H., Chen, Z., Huangliang, K. and Zhang, H., “TransUNet+: Redesigning the skip connection to enhance features in medical image segmentation,” Knowl.-Based Syst. 256(23), 109872109889 (2022).10.1016/j.knosys.2022.109859CrossRefGoogle Scholar
Liu, H., Yang, J., Miao, X., Mertz, C. and Kong, H., “Crackformer network for pavement crack segmentation,” IEEE Trans. Intell. Transp. Syst. 24(14), 92409252 (2023).10.1109/TITS.2023.3266776CrossRefGoogle Scholar
Li, K., Yang, J., Ma, S., Wang, B. and Wang, S., “Rethinking lightweight convolutional neural networks for efficient and high-quality pavement crack detection,” IEEE Trans. Intell. Transp. Syst. 23(16), 237250 (2024).10.1109/TITS.2023.3307286CrossRefGoogle Scholar
Badrinarayanan, V., Kendall, A. and Cipolla, R., “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 24812495 (2017).10.1109/TPAMI.2016.2644615CrossRefGoogle ScholarPubMed
He, Z., Cao, L., Luo, J., Xu, X., Tang, J., Xu, J. and Chen, Z., “UISS-Net: Underwater image semantic segmentation network for improving boundary segmentation accuracy of underwater images,” Aquac. Int. 32(12), 56255638 (2024).10.1007/s10499-024-01439-xCrossRefGoogle Scholar
Figure 0

Figure 1. MVFD-Net Network Framework. The “Conv layer” represents the convolution operations on each layer, “Patch Embed” refers to the embedding layer, and the “transformer block” refers to the transformer blocks used for feature extraction. “ESPCN” indicates the subpixel convolution used for upsampling, while “IDSC” represents depth-wise separable convolutions (PW + DW). Finally, the output provides the predicted results.

Figure 1

Figure 2. Illustration of IFM,$T$denotes the features that are reintroduced into the transformer encoder after fusion, and$T'$indicates the features input into the MGAF module after fusion.

Figure 2

Table I. Local and global attention mechanisms.

Figure 3

Table II. Feature reconstruction.

Figure 4

Figure 3. Illustration of FRM, the feature maps $f_i$represent the features from the CNN encoder or transformer, while $ f_i^{\prime}$denotes the reconstructed features.

Figure 5

Table III. Spatial information integration.

Figure 6

Figure 4. Illustration of MGAF.$ T_i'$and$ T_i''$represent the input features and the features processed by the MGAF module, respectively. Gate represents the gating mechanism, and$\theta$denotes the embedding weights that manage the channel weights prior to normalization. The gating weights and bias ($\eta$and$\lambda$) progressively adjust the input feature proportions$x$across the channels.

Figure 7

Figure 5. Multi-view feature fusion module.

Figure 8

Table IV. Comparison of segmentation models on the UDODW dataset.

Figure 9

Figure 6. Illustration of the images in the three datasets.

Figure 10

Table V. Comparison of segmentation models on the UDODS dataset.

Figure 11

Table VI. Comparison of segmentation models on the UDPOD dataset.

Figure 12

Figure 7. Illustration of the changes in loss during the training stage.

Figure 13

Figure 8. Illustration of the changes in mIoU, Acc, Re, and F1 during the training stage.

Figure 14

Figure 9. Illustration of segmentation results of comparative experiments on different datasets.

Figure 15

Table VII. Ablation experimental results on UDPOD dataset.

Figure 16

Figure 10. Illustration of segmentation results of ablation experiments on UDPOD dataset .

Figure 17

Table VIII. Specific parameters of the BlueROV2 underwater robot.

Figure 18

Figure 11. BlueROV2 underwater robot.

Figure 19

Figure 12. Illustration of relative positions of the underwater robot. The red circle denotes the location of the dam crack and the yellow guide line represents the direction of the robot’s view.

Figure 20

Figure 13. Schematic layout of underwater robotic imaging positions centered on dam cracks.

Figure 21

Figure 14. Illustration of segmentation results of the proposed algorithm in three test scenarios.