Hostname: page-component-6766d58669-h8lrw Total loading time: 0 Render date: 2026-05-15T12:34:47.352Z Has data issue: false hasContentIssue false

DSNet: an efficient CNN for road scene segmentation

Published online by Cambridge University Press:  26 November 2020

Ping-Rong Chen
Affiliation:
National Chiao Tung University, Hsinchu, Taiwan
Hsueh-Ming Hang*
Affiliation:
National Chiao Tung University, Hsinchu, Taiwan
Sheng-Wei Chan
Affiliation:
Industrial Technology Research Institute, Hsinchu, Taiwan
Jing-Jhih Lin
Affiliation:
Industrial Technology Research Institute, Hsinchu, Taiwan
*
Corresponding author: Hsueh-Ming Hang Email: hmhang@nctu.edu.tw

Abstract

Road scene understanding is a critical component in an autonomous driving system. Although the deep learning-based road scene segmentation can achieve very high accuracy, its complexity is also very high for developing real-time applications. It is challenging to design a neural net with high accuracy and low computational complexity. To address this issue, we investigate the advantages and disadvantages of several popular convolutional neural network (CNN) architectures in terms of speed, storage, and segmentation accuracy. We start from the fully convolutional network with VGG, and then we study ResNet and DenseNet. Through detailed experiments, we pick up the favorable components from the existing architectures and at the end, we construct a light-weight network architecture based on the DenseNet. Our proposed network, called DSNet, demonstrates a real-time testing (inferencing) ability (on the popular GPU platform) and it maintains an accuracy comparable with most previous systems. We test our system on several datasets including the challenging Cityscapes dataset (resolution of 1024 × 512) with an Mean Intersection over Union (mIoU) of about 69.1% and runtime of 0.0147 s/image on a single GTX 1080Ti. We also design a more accurate model but at the price of a slower speed, which has an mIoU of about 72.6% on the CamVid dataset.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Author(s), 2020. Published by Cambridge University Press
Figure 0

Fig. 1. The architecture of fast dense segmentation network (DSNet-fast). The encoder is constructed based on the deep dense unit described in Fig. 3.

Figure 1

Fig. 2. The architecture of the proposed network, DSNet-accurate. It differs from Fig. 1 (DSNet-fast) mainly in that it removes the down-sampling operation in the initial block. And it removes the skip connection connected to Block 2, and thus has 96 channels at the concatenation layer rather than 128 channels in DSNet-fast.

Figure 2

Fig. 3. The modified deep dense units by inserting another convolutional layer (red dotted block). (a) Non-bottleneck architecture. (b) Bottleneck architecture.

Figure 3

Fig. 4. The architecture of FCN-ResNet. The backbone of encoder is ResNet, and a FCN-based decoder is attached to fuse feature maps by summation.

Figure 4

Table 1. Results of FCN-VGG16 and FCN-ResNet50 on CamVid test set and Cityscapes validation set.

Figure 5

Fig. 5. An output sample that shows the importance of receptive field. (a) Input image, (b) Ground truth, (c) FCN-VGG16 output, (d) FCN-ResNet50 output.

Figure 6

Fig. 6. The architecture of FCN-VGG-ED (ED: early down-sampling). Similar to Fig. 4, the decoder is a FCN-based structure, but the backbone of the encoder is VGG.

Figure 7

Table 2. Results of FCN-VGG16 and FCN-VGG-ED on CamVid test set (training from scratch).

Figure 8

Fig. 7. The architecture of FCN-DenseNet and FCN-DenseNet-D (“D” means the deep dense unit described in Fig. 3). Similar to Figs 4 and 6, a FCN-base decoder is employed. The encoder is a customized structure based on the idea of DenseNet.

Figure 9

Table 3. Results of FCN-DenseNet and FCN-DenseNet-D on Cityscape validation set.

Figure 10

Fig. 8. The architecture of FCN-DenseNet-D with the wide decoder and the narrow decoder. Differing from Fig. 7, the concatenation operator is employed instead of summation in the decoder. There are two channel numbers in the skip connections and decoder layers. The left number (blue) is for the wide decoder and the right one (red) is for the narrow decoder.

Figure 11

Table 4. Fusion methods in the decoder (Cityscapes validation set at 1024 × 512 resolution).

Figure 12

Fig. 9. Variations of Decoder. From top to bottom: (a) Model-1, (b) Model-2, (c) Model-3, (d) Model-4.

Figure 13

Table 5. Results of four decoders on CamVid test set.

Figure 14

Table 6. Comparison of DSNet and other schemes on CamVid test set.

Figure 15

Table 7. The speed of DSNet running on 480 × 360 resolution with 11 categories (CamVid dataset).

Figure 16

Fig. 10. The results of DSNet on CamVid test set. From left to right: (a) Input image, (b) Ground truth, (c) DSNet-fast output, (d) DSNet-accurate output.

Figure 17

Fig. 11. Results of DSNet on Cityscapes validation set. From left to right: (a) Input image, (b) Ground truth, (c) DSNet-fast output.

Figure 18

Table 8. The results of DSNet-fast and other methods on Cityscapes test set. The results of other methods are listed according to the online leaderboard and their reference papers (Cityscape webpage).