Hostname: page-component-6766d58669-fx4k7 Total loading time: 0 Render date: 2026-05-15T07:58:23.502Z Has data issue: false hasContentIssue false

NLCA-Net: a non-local context attention network for stereo matching

Published online by Cambridge University Press:  07 July 2020

Zhibo Rao
Affiliation:
Northwestern Polytechnical University, Xian710129, China
Mingyi He*
Affiliation:
Northwestern Polytechnical University, Xian710129, China
Yuchao Dai
Affiliation:
Northwestern Polytechnical University, Xian710129, China
Zhidong Zhu
Affiliation:
Northwestern Polytechnical University, Xian710129, China
Bo Li
Affiliation:
Northwestern Polytechnical University, Xian710129, China
Renjie He
Affiliation:
Northwestern Polytechnical University, Xian710129, China Nanyang Technological University, 639798 Singapore, Singapore
*
Corresponding author: Mingyi He Email: myhe@nwpu.edu.cn

Abstract

Accurate disparity prediction is a hot spot in computer vision, and how to efficiently exploit contextual information is the key to improve the performance. In this paper, we propose a simple yet effective non-local context attention network to exploit the global context information by using attention mechanisms and semantic information for stereo matching. First, we develop a 2D geometry feature learning module to get a more discriminative representation by taking advantage of multi-scale features and form them into the variance-based cost volume. Then, we construct a non-local attention matching module by using the non-local block and hierarchical 3D convolutions, which can effectively regularize the cost volume and capture the global contextual information. Finally, we adopt a geometry refinement module to refine the disparity map to further improve the performance. Moreover, we add the warping loss function to help the model learn the matching rule of the non-occluded region. Our experiments show that (1) our approach achieves competitive results on KITTI and SceneFlow datasets in the end-point error and the fraction of erroneous pixels $({D_1})$; (2) our proposed method particularly has superior performance in the reflective regions and occluded areas.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Author(s), 2020. Published by Cambridge University Press in association with [APSIPA Transactions on Signal and Information Processing].
Figure 0

Fig. 1. Our end-to-end deep stereo regression architecture, NLCA-Net (Non-Local Context Attention network). Our model consists of three modules: 2D geometry feature learning (GFL), non-local attention matching (NLAM), and geometry refinement (GR) module.

Figure 1

Fig. 2. The 2D geometry feature learning module (GFL). $x \times x,s,f$ denote the size of the convolution kernel, stride, and the number of convolution filters respectively. ${\times} n$ denotes the block repeats n times.

Figure 2

Fig. 3. The non-local attention matching module (NLAM). The NLAM module consists of feature matching part and scale recovery part. Note that the feature maps are shown as feature dimensions, e.g. $D \times H \times W \times F$means a feature map with disparity number$D$, height$H$, width W, and feature number$F$. Here, $L\ast$denotes different scale levels of the feature maps.

Figure 3

Fig. 4. The encoder–decoder architecture. The pink block means the encoding process. The green block means the decoding process.

Figure 4

Fig. 5. Geometry refinement module (GR). The initial disparity map, the left image, and the semantics feature are fed to the GR module. After this module, we get refined disparity map. Here, blue block means the 32 convolutions with the size$3 \times 3$, and green block means the $1$ convolution with the size$3 \times 3$.

Figure 5

Table 1. Evaluation of NLCA-Net with different settings.

Figure 6

Fig. 6. SceneFlow test data qualitative results. From left: left stereo input image, ground-truth, disparity prediction.

Figure 7

Table 2. Influence of weight values for ${\lambda _1}$, ${\lambda _2}$, $\alpha$, and $\beta$ on three-pixel-error.

Figure 8

Table 3. Influence of the different numbers of the non-local blocks on the model.

Figure 9

Fig. 7. KITTI 2012 test data qualitative results. We compare our approach with state-of-the-art methods (HD3-S and GwcNet), and we highlight our advantage in the error maps. Note that, in the error maps, the deeper red pixels mean higher error rate in the occluded regions and white pixels denote ≥5 pixels error in the non-occluded regions.

Figure 10

Fig. 8. KITTI 2015 test data qualitative results. From left: left stereo input image, disparity prediction, error map. Note that, in the error maps, depicting correct estimates (<3 px or <5% error) in blue and wrong estimates in red color tones.

Figure 11

Table 4. Results on KITTI 2012 stereo benchmark.

Figure 12

Table 5. Results on KITTI 2015 stereo benchmark.

Figure 13

Table 6. Comparisons of different state-of-the-art methods in the reflective regions.

Figure 14

Fig. 9. Part zoom-up of the error maps on the occluded region. From left: original image, HD3-Stereo, GwcNet, and ours. The result shows that our method can notably reduce the error rate on the occluded area and handle well with the large textureless regions.

Figure 15

Table 7. The summary of our non-local context attention network, NLCA-Net.