Hostname: page-component-5db58dd55d-8lnk4 Total loading time: 0 Render date: 2026-05-31T23:17:21.767Z Has data issue: false hasContentIssue false

Deep learning model to reconstruct 3D cityscapes by generating depth maps from omnidirectional images and its application to visual preference prediction

Published online by Cambridge University Press:  11 November 2020

Atsushi Takizawa*
Affiliation:
Housing and Environmental Design Course, Graduate School of Human Life Science, Osaka City University, Osaka, Japan
Hina Kinugawa
Affiliation:
Housing and Environmental Design Course, Graduate School of Human Life Science, Osaka City University, Osaka, Japan
*
Corresponding author Atsushi Takizawata takizawa@osaka-cu.ac.jp
Rights & Permissions [Opens in a new window]

Abstract

We developed a method to generate omnidirectional depth maps from corresponding omnidirectional images of cityscapes by learning each pair of an omnidirectional and a depth map, created by computer graphics, using pix2pix. Models trained with different series of images, shot under different site and sky conditions, were applied to street view images to generate depth maps. The validity of the generated depth maps was then evaluated quantitatively and visually. In addition, we conducted experiments to evaluate Google Street View images using multiple participants. We constructed a model that predicts the preference label of these images with and without the generated depth maps using the classification method with deep convolutional neural networks for general rectangular images and omnidirectional images. The results demonstrate the extent to which the generalization performance of the cityscape preference prediction model changes depending on the type of convolutional models and the presence or absence of generated depth maps.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - SA
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (http://creativecommons.org/licenses/by-nc-sa/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is included and the original work is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use.
Copyright
© The Author(s), 2020. Published by Cambridge University Press
Figure 0

Figure 1. Example of two-dimensional isovists.

Figure 1

Figure 2. An isovist, finite visual lines for creating its approximate isovist and its projection on a sphere with cameras.

Figure 2

Figure 3. Omnidirectional image (left) and corresponding depth map (right) of a cityscape in a computer graphics model (©NONECG).

Figure 3

Figure 4. Framework of the proposed method.

Figure 4

Figure 5. Two city models for training pix2pix (©NONECG).

Figure 5

Figure 6. Shooting points in each city model. Red points are for training, green ones are for validation and purple ones for test (©NONECG).

Figure 6

Figure 7. An omnidirectional image of local city model (top) and its four-way images at normal angle of view (©NONECG).

Figure 7

Figure 8. Comparison of shades of depth maps at different maximum distances (©NONECG).

Figure 8

Figure 9. Outline of pix2pix process.

Figure 9

Table 1. Nine models used for training; In each cell, M__ denotes a model name and lower values denote the number of training data/validation data/test data.

Figure 10

Figure 10. An example of an omnidirectional image of GSV (top) and its four-way images at normal angle of view (©Google, 2020).

Figure 11

Figure 11. Filtering operation of a generated depth map for sky area using SS with the image of GSV (©Google, 2020).

Figure 12

Figure 12. Modified Resnet-50 for RGB/RGBD image. The first and last parts noted in red are modified from original ResNet-50.

Figure 13

Figure 13. Mesh convolution of UGSCNN.

Figure 14

Figure 14. UGSCNN used in this study.

Figure 15

Figure 15. Example of convergence process of loss functions (M2c).

Figure 16

Table 2. RMSE of test data generated by each pix2pix model.

Figure 17

Figure 16. Example of a generated depth map of a test data (©NONECG) by M2c, RMSE = 4.37.

Figure 18

Figure 17. Comparison of depth maps of the same GSV image (©Google, 2020) generated by each model.

Figure 19

Figure 18. Example of depth maps of GSV (©Google, 2020) generated by M2c and their filtered depth maps.

Figure 20

Table 3. Basic statistics of preference score of 100 GSV images.

Figure 21

Figure 19. Histogram for each mean preference score of 100 GSV images.

Figure 22

Figure 20. Examples from GSV (©Google, 2020) preference scoring experiment in Osaka. The values are the mean/std of every 10 subjects’ scores.

Figure 23

Figure 21. Example of the convergence process of loss functions of ResNet-50 with RGBD.

Figure 24

Figure 22. Example of the convergence process of loss functions of UGSCNN with RGBD.

Figure 25

Figure 23. Distribution of F1 score of 10-fold cross validation for each CNN, X denotes mean.

Figure 26

Table 4. Descriptive statistics of F1 score of 10-fold cross validation for each CNN.

Figure 27

Table 5. Decision limit of analysis of means of F1 score for sets of CNNs, significance level = 0.05.

Figure 28

Figure 24. Mean absolute error of pixel unit between generated depth maps and correct images and number of pixels for each distance.

Figure 29

Figure 25. Example of a depth map of GSV (©Google, 2020).

Figure 30

Table A1. Learning settings of the pix2pix model

Figure 31

Table A2. Learning settings of ResNet-50 model

Figure 32

Table A3. Learning settings of UGSCNN model

Figure 33

Figure A1. Structure of semantic segmentation models.

Figure 34

Figure A2. Type of depthwise convolution.

Supplementary material: PDF

Takizawa and Kinugawa supplementary material

Figures S1-S4

Download Takizawa and Kinugawa supplementary material(PDF)
PDF 5.3 MB