Impact Statement
As technology advances, building operations increasingly rely on Building Information Modeling (BIM) and digital twins, yet accurate three-dimensional (3D) models are often unavailable or outdated, making manual reconstruction costly and time-consuming. To address this, we present an autonomous drone system for indoor navigation and 3D mapping, generating semantically rich point clouds. These semantic point clouds enable the automatic extraction of building components. By bridging the gap between autonomous robotics and scan-to-BIM methodologies, this study introduces a novel approach to automated indoor mapping, reducing manual efforts while improving accuracy, and paving the way for more scalable, cost-effective BIM generation in complex indoor environments. This innovation has broad implications for construction, facility management, and heritage preservation, offering a transformative solution for industries that depend on high-fidelity digital building models.
1. Introduction
In the past decade, Building Information Modeling (BIM) has gained extensive utilization within the architecture, engineering, construction, and facility operations (AECO) industry, owing to its enhancements in project productivity, efficiency, and construction quality (Migilinskas et al., Reference Migilinskas, Popov, Juocevicius and Ustinovichius2013). BIM has transitioned the construction industry standards from two-dimensional (2D) drawings and associated information systems to three-dimensional (3D) object-oriented information systems (Mihindu and Arayici, Reference Mihindu and Arayici2008). Considering the extended life cycle of building and infrastructure projects, the maintenance, renovation, and demolition phases are increasingly relying on BIM (Xu et al., Reference Xu, Ding and Peter2017; Park et al., Reference Park, Kim, Lee, Jeong, Lee, Kim and Hong2022). According to Akcamete et al. (Akcamete et al., Reference Akcamete, Akinci and Garrett2010), the operational phase of building projects accounts for up to 60% of the life-cycle costs, and BIM has the potential to reduce these costs by providing functions such as simulation, automation, and information sharing (Czerniawski and Leite, Reference Czerniawski and Leite2020).
Unfortunately, the creation and sustained maintenance of BIMs throughout a building’s lifespan are challenging due to their extensive informational scope and diversity, as well as the semantic, temporal, and spatial complexities inherent in buildings. Scan-to-BIM is currently the major method to create as-built BIM models, which relies heavily on manual operations. The process can be summarized into three steps: (1) data acquisition and preprocessing, (2) segmentation of 3D data based on semantic classes defined by a BIM taxonomy, and (3) development of 3D models based on geometric and semantic information of the captured space using a BIM authoring software. Currently, as-built point clouds are generally collected using laser scanners (Bassier et al., Reference Bassier, Vergauwen and Van Genechten2017; Son and Kim, Reference Son and Kim2017; Hong et al., Reference Hong, Nguyen and Choi2018) or image-based geometric approaches (Dai et al., Reference Dai, Rashidi, Brilakis and Vela2012; Klein et al., Reference Klein, Li and Becerik-Gerber2012; Lu and Lee, Reference Lu and Lee2017; Xue et al., Reference Xue, Lu and Chen2018). Laser scanners can collect point clouds with the best quality, but they are costly and labor-intensive to operate. On the other hand, existing image-based geometric approaches perform poorly in an indoor environment since large weak-texture areas are present (e.g., white walls), which provide no reliable image features for stereo matching. Next, the collected point clouds need to be assigned semantic labels according to real-world building information. To avoid manual labeling, researchers investigated automated semantic segmentation using model-driven methods (Faber and Fisher, Reference Faber and Fisher2002; Xiong and Huber, Reference Xiong and Huber2010; Dore and Murphy, Reference Dore and Murphy2013) and data-driven methods (Guo et al., Reference Guo, Bennamoun, Sohel, Lu and Wan2014; Armeni et al., Reference Armeni, Sax, Zamir and Savarese2017; Czerniawski and Leite, Reference Czerniawski and Leite2020; Perez-Perez et al., Reference Perez-Perez, Golparvar-Fard and El-Rayes2021). Lastly, the segmented point clouds can be converted into surfaces or meshes through a manual process (Tang et al., Reference Tang, Huber, Akinci, Lipman and Lytle2010) or predefined modeling algorithms (Xiong et al., Reference Xiong, Adan, Akinci and Huber2013; Jung et al., Reference Jung, Stachniss, Ju and Heo2018; Wang et al., Reference Wang, Yin, Luo, Jack and Wang2021; Park et al., Reference Park, Kim, Lee, Jeong, Lee, Kim and Hong2022). Despite past efforts, scan-to-BIM remains an ongoing area of research, lacking established best practices.
Recent advancements in robotics, including unmanned ground vehicles (UGVs) and unmanned aerial vehicles (UAVs), have opened up new avenues for gathering building data through automation. For example, Chen et al. (Chen and Cho, Reference Chen and Cho2022; Hu et al., Reference Hu, Vincent and Yin2023) developed methods to use UGV to autonomously navigate the built environment and collect point clouds using LiDAR scanners for BIM generation. Their study has proven the feasibility of integrating robotics in scan-to-BIM. However, the traversability of UGV in complex construction and indoor environments with unknown terrains (e.g., stones and stairs) is limited, and the field of view of mobile LiDAR mounted on robot moving platforms can be limited. Finally, carrying expensive laser scanners introduces risks for financial loss.
Compared to UGVs, UAVs can navigate without terrain constraints and provide better coverage given additional degrees of freedom. Past studies have explored the usage of UAVs in inspection and monitoring, including bridge inspection (Bolourian and Hammad, Reference Bolourian and Hammad2020), progress monitoring (Huang and Xu, Reference Huang and Xu2022), boiler inspection (Ortega et al., Reference Ortega, Muñoz, McGee, Choudhuri and Flores-Abad2020), and building facade inspection (Chen et al., Reference Chen, Reichard and Xu2019). Most of these studies focused on outdoor navigation using GPS localization, and there is no past study using indoor UAVs for scan-to-BIM tasks. In this study, we attempt to automate the first two steps in scan-to-BIM, data acquisition, and semantic segmentation, using a UAV system and a Gaussian splatting-based reconstruction method we proposed: ICON drone and depth-regularized planar-based Gaussian splatting reconstruction (D-PGSR).
The ICON drone is designed to navigate and create semantic-rich point clouds for complex indoor spaces. Our method avoids using high-cost LiDAR scanners while navigating and reconstructing the scene using only cameras. By taking inputs from an IMU and a stereo camera, the UAV can localize itself using a visual-inertial odometer (VIO) and find the next destination using a frontier-based exploration algorithm. Next, the recorded RGB videos are used as input to our proposed reconstruction algorithm to reconstruct point clouds of the explored areas. To address the poor reconstruction accuracy in weak-texture areas in image-based reconstruction methods, we use monocular depth estimation to supervise the optimization of these areas. Lastly, a neural network is used to provide semantic information to the point clouds based on their corresponding building components and building material classes. To validate the proposed pipeline, we performed experiments in three different scenes: the classroom, the lobby, and the lounge. The main contributions of this study are threefold:
-
1. We developed an autonomous UAV system for indoor navigation using off-the-shelf, affordable components, and we successfully tested the system in multiple real-world scenarios.
-
2. We propose a method for high-quality indoor reconstruction, focusing on improving weak-texture areas.
-
3. We propose a method to recover the absolute scale of point clouds generated using image-based 3D reconstruction methods.
-
4. We propose a semantic segmentation model on both building components and building materials for scan-to-BIM.
2. Background
2.1. Simultaneous localization and mapping (SLAM), exploration, and trajectory-planning
For robots to navigate autonomously in an unknown environment, there are three key modules: localization, path planning, and exploration. Given that most indoor environments have no GPS signals, SLAM is the most suitable approach for localization. In a SLAM problem, the robot observes the surrounding environment using sensors like ranging sensors (e.g., LiDARs) and RGB cameras, and estimates its current location based on past observations.
SLAM systems can be roughly divided into two categories: LiDAR-based SLAM and vision-based SLAM. LiDAR-based SLAM is uncommon on UAV systems given UAV’s limited payload. Therefore, we focus on visual-based SLAM, which only needs lighter and cheaper sensors (i.e., cameras). MonoSLAM (Davison et al., Reference Davison, Reid, Molton and Stasse2007) is the first SLAM system using only a monocular camera, which opened the possibility for visual-only SLAM. The ORB-SLAM series (Tardós, Reference Mur-Artal, Montiel and Tardós2015; Mur-Artal and Tardós, Reference Mur-Artal and Tardós2017; Campos et al., Reference Campos, Elvira, Rodríguez, José and Tardós2021) further improved the performance of visual-only SLAM. However, visual-only SLAM does not provide absolute-scale odometry, and the system easily fails when the visual input changes dramatically. To address these issues, researchers integrated IMUs into the existing camera setup for SLAM, commonly known as VIO.
After localizing the UAV, we can investigate its exploration. For our interest in point cloud acquisition, we focus on coverage path planning (CPP), which is to find a path that avoids all obstacles and maximizes area coverage while optimizing travel time, processing time, energy cost, number of turns, and other parameters (e.g., reconstruction quality). Navigation in unknown environments requires online CCP, which updates the exploration strategy while the UAV is navigating and observing new information (e.g., RGB images and depth images). In other words, the algorithm will update the viewpoints and a suitable path based on the real-time information. In this article, we use frontier-based methods. First introduced by Yamauchi (Yamauchi, Reference Yamauchi1997), frontiers are calculated in areas that are unoccupied by obstacles, and the robot is prompted to move toward frontiers. To expand on the original method, Shen et al. developed a method to detect frontiers in 3D space (Shen et al., Reference Shen, Michael and Kumar2012). However, the classic frontier-based methods make decisions greedily and do not consider the dynamics of the UAV, which leads to unnecessary moves and insufficient coverage. To address this issue, Zhou et al. (Zhou et al., Reference Zhou, Zhang, Chen and Shen2021) proposed an incremental frontier information structure and a hierarchical exploration planning strategy to achieve better exploration efficiency. This was the method we used in our exploration pipeline. The details are explained in the methodology section.
To move the drone from one viewpoint to the next, a smooth and obstacle-free trajectory is required. To achieve this, recent studies (Oleynikova et al., Reference Oleynikova, Burri, Taylor, Nieto, Siegwart and Galceran2016; Gao et al., Reference Gao, Lin and Shen2017; Usenko et al., Reference Usenko, von Stumberg, Pangercic and Cremers2017) mostly use nonlinear optimization on several objectives to generate a trajectory that maximizes the overall performance of the algorithm. Commonly considered objectives in the optimization are the smoothness of the trajectory, travelled time, dynamic feasibility, and obstacle avoidance. Innovatively, Usenko et al. (Usenko et al., Reference Usenko, von Stumberg, Pangercic and Cremers2017) and Zhou et al. (Zhou et al., Reference Zhou, Gao, Pan and Shen2020) adopted a B-spline for UAV trajectory generation. A B-spline is a smooth polynomial defined by control points and knots. The convex hull property of uniform B-splines can be utilized to incorporate gradient information from the obstacles and dynamic constraints to generate optimal trajectories.
2.2. Image-based 3D reconstruction
Here, 3D reconstruction refers to capturing the 3D appearance of an object and generating the 3D shape of the surface through computer-aided modeling (Moons et al., Reference Moons, Van Gool and Vergauwen2010). Due to the limited sensors on the drone, we focused on image-based 3D reconstruction. Traditional image-based 3D sparse reconstruction methods, such as structure from motion (SFM) (Schönberger and Frahm, Reference Schönberger and Frahm2016) extract geometry features in images, matching similar features in different image views, and estimating the 3D location of the extracted features and cameras to create a 3D reconstruction. With known camera parameters, multi-view stereo (MVS) (Bleyer et al., Reference Bleyer, Rhemann and Rother2011; Schönberger et al., Reference Schönberger, Zheng, Frahm, Pollefeys, Leibe, Matas, Sebe and Welling2016) can be used to create dense reconstructions by matching stereo pixels in neighboring images. The AECO community has adopted image-based 3D reconstruction in various downstream tasks, such as construction progress monitoring (Han and Golparvar-Fard, Reference Han and Golparvar-Fard2015), site material management (Kamari and Ham, Reference Kamari and Ham2021), and facade inspection (Chen et al., Reference Chen, Reichard and Xu2019). However, MVS methods perform poorly in weak-texture areas, since the matching process can be inaccurate with too many similar pixels in the area. Xu et al. (Xu et al., Reference Xu, Kong, Tao and Pollefeys2023) relieve this issue by introducing a planar prior in these regions, but their method is still sensitive to noise.
As deep learning became a popular area of research, researchers have developed innovative deep-learning approaches for 3D reconstruction using images. Neural radiance field (NeRF)-based methods (Mildenhall et al., Reference Mildenhall, Srinivasan, Tancik, Barron, Ramamoorthi and Ng2020; Yariv et al., Reference Yariv, Gu, Kasten and Lipman2021; Müller et al., Reference Müller, Evans, Schied and Keller2022) store the scene in a neural network implicitly. However, implicit representation suffers from its inefficient per-pixel volume rendering for optimization. In contrast, 3D-Gaussian splatting (3DGS) (Kerbl et al., Reference Kerbl, Kopanas, Leimkühler and Drettakis2023) models the environment explicitly using a set of Gaussian primitives with learnable parameters. 3DGS leverages splatting to project 3D Gaussians into 2D image space given camera poses, which is much faster than volume rendering used in NeRF, leading to shorter training time and real-time high-quality scene rendering. The advantage of the deep-learning methods is that they can render a high-quality image of the scene, but the geometry accuracy in weak-texture areas remains problematic.
Given large datasets and complex neural networks, researchers also tried to achieve monocular depth estimation using deep learning. Recent models (Ke et al., Reference Ke, Obukhov, Huang, Metzger, Daudt and Schindler2024; Wang et al., Reference Wang, Xu, Dai, Xiang, Deng, Tong and Yang2024) have achieved impressive estimation quality by training on a massive amount of Red Green Blue-Depth (RGB-D) images captured in various environments. Taking another step forward, DUSt3R (Wang et al., Reference Wang, Leroy, Cabon, Chidlovskii and Revaud2023) can estimate both the camera poses and depth from multiple input images and generate a dense multi-view reconstruction, relieving the need for SFM initialization. In this study, we attempt to combine the advantages of 3DGS and monocular depth estimation to improve the reconstruction quality in weak-textured areas.
2.3. Semantic segmentation
Semantic segmentation provides a label for each pixel in the image or each point in a point cloud, allowing algorithms to identify different objects and understand the scene. Automated semantic segmentation can be classified into two categories: model-driven (Faber and Fisher, Reference Faber and Fisher2002; Dore and Murphy, Reference Dore and Murphy2013; Di Angelo and Di Stefano, Reference Di Angelo and Di Stefano2015) and data-driven (Ronneberger et al., Reference Ronneberger, Fischer, Brox, Navab, Hornegger, Wells and Frangi2015; Qi et al., Reference Qi, Su, Mo and Guibas2017; Lin et al., Reference Lin, Chen, Wang, Wu, Li, Lin, Liu and He2023). Model-driven methods achieve segmentation by aligning data points with various preestablished models of shapes and identifying the model that best conforms to the data. However, model-driven methods are not robust in complicated scenes with multiple classes closely oriented. Therefore, we focus on data-driven methods in this study, which can learn contextual features in the training datasets through iterations and provide predictions after training. In the past decade, numerous studies have attempted to improve the accuracy of semantic segmentation through more complex models (Dosovitskiy et al., Reference Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit and Houlsby2021), larger and better-quality datasets (Zhou et al., Reference Zhou, Zhao, Puig, Fidler, Barriuso and Torralba2017), and novel training techniques (Loshchilov and Hutter, Reference Loshchilov and Hutter2019). Researchers have also applied semantic segmentation in various applications for buildings and infrastructures, including scan-to-BIM (Park et al., Reference Park, Kim, Lee, Jeong, Lee, Kim and Hong2022), crack detection (Zhang et al., Reference Zhang, Rajan and Story2019), quality assurance (Mirzaei et al., Reference Mirzaei, Arashpour, Asadi, Masoumi, Mahdiyar and Gonzalez2023), and robot navigation (Yu et al., Reference Yu, Liu, Liu, Xie, Yang, Wei and Fei2018).
Despite being emphasized as a crucial step toward scan-to-BIM (Czerniawski and Leite, Reference Czerniawski and Leite2020; Ma et al., Reference Ma, Czerniawski and Leite2020; Perez-Perez et al., Reference Perez-Perez, Golparvar-Fard and El-Rayes2021), most existing studies only focused on the semantic segmentation of building components. According to references (Yüksek, Reference Yüksek2015; Kim et al., Reference Kim, Jeong, Hong, Lee and Lee2023), material information of building components is important due to its usage in the operation and maintenance phase of a building’s life cycle for cost analysis and energy simulation. Therefore, to generate BIM with a higher level of detail, we also need material information for the point clouds. For this study, we focus on indoor 3D reconstruction with semantic annotations of building components and materials. The model we implemented has a Swin Transformer (Liu et al., Reference Liu, Lin, Cao, Hu, Wei, Zhang, Lin and Guo2021) as the backbone and an Upernet (Xiao et al., Reference Xiao, Liu, Zhou, Jiang and Sun2018) head as the decoder. This architecture has been proven effective by achieving state-of-the-art performance in various datasets.
2.4. UAV applications in indoor environments
UAVs have been adopted for various applications in infrastructure inspections. However, few of them focus on indoor applications. The challenges hindering indoor applications lie in the absence of GPS signals indoors. Furthermore, the layout of indoor environments can be unknown, confined, and have the potential to continuously evolve with time (Razafimahazo et al., Reference Razafimahazo, De Saqui-Sannes, Vingerhoeds, Baron, Soulax and Mège2021). Past studies have attempted to address these issues from different angles. For example, Gao et al. (Gao et al., Reference Gao, Wang, Wang, Zhao, Zhai, Chen and Chen2023) proposed an explore-then-exploit UAV system for autonomous indoor facility data collection and scene reconstruction. Their method required one drone to first explore and reconstruct a coarse map of the environment, and use the coarse map to plan a finer path for another drone to complete the scene reconstruction. Their method ensured safe trajectories, but two steps were needed, and semantic information of the scene was not included in the reconstruction. Similarly, Li et al. focused on generating collision-free trajectories in building using the
$ {A}^{\ast } $
algorithm and distance transformation. However, their method lacked real-world experiments to prove its effectiveness. On the other hand, Eiris et al. developed a flight behavior visualization platform to study how to visually represent the human decision-making processes during UAV-based indoor building inspection (Eiris et al., Reference Eiris, Albeaino, Gheisari, Benda and Faris2021). In terms of more systematic applications, Cheng et al. proposed an endoscopic system for fusion reactor inspection. (Cheng et al., Reference Cheng, Xu, Liu, Zheng, Zuo, Ji and Qin2024). They successfully tested the system inside a fusion reactor; however, human operation was still needed to control the drone. In summary, the level of automation for UAVs in indoor environments is still limited. Our proposed system explores the possibility of applying a fully automated UAV system in indoor building environments to perform a semantic reconstruction task. The system is also suitable for modification to support other building-related tasks with complex layouts, architectural designs, and obstacles.
3. Methodology
Our method enables (1) autonomous UAV navigation, which allows the drone to explore unknown environments; (2) 3D reconstruction, which creates point clouds of the explored environments using RGB images; and (3) semantic segmentation, which annotates collected images and projects annotations onto the 3D reconstruction.
3.1. ICON drone system design
3.1.1. Mechanical design
Figure 1 shows the configuration of the ICON drone, which is built on a 250 mm quadcopter frame. We use a Pixhawk 6c mini flight controller running PX4 autopilot to control 4 Tmotor F40PRO motors through 4 EMAX 45A ESCs. For autonomous navigation, we mounted an Intel NUC computer, a D435i RGB-D camera, and a Wheeltec M100 IMU on the drone for environment perception and real-time processing. The entire drone is powered by a 2,200-mAh LiPo battery. Finally, a Go-Pro camera is mounted on the top to record high-quality RGB videos for 3D reconstruction.

Figure 1. Overview of the components of the ICON drone.
Table 1. Metric definitions.
$ P $
and
$ {P}^{\ast } $
are the point clouds sampled from the predicted and ground truth mesh

3.1.2. Mechatronic and software system
As shown in Figure 2, the drone’s control system includes two parts: the flight controller module and the navigation module. The flight controller sends commands to the motors to control the posture of the drone. The flight controller is connected to the computer through a USB cable and communicates through the robot operating system. The navigation module runs on the onboard computer. The VIO takes in IMU measurements and stereo images from the RGB-D camera and outputs an odometry. The odometry and depth images from the depth camera are sent to the exploration module to generate a B-spline trajectory. The trajectory is output in the form of a position command, which consists of position, velocity, acceleration, and orientation. Lastly, a controller processes the odometry, the position command, and the IMU measurements from the flight controller to output the final thrust and orientation command to the flight controller.

Figure 2. The mechatronic pipeline of the ICON drone.
3.2. VIO, exploration, and trajectory planning
The ICON drone uses a VIO for localization. A VIO aims to estimate the ego-motion of a robot given visual and inertial measurement inputs. To elaborate, given a series of images
$ {image}_{0:N} $
and the corresponding inertial measurements
$ {IMU}_{0:N} $
in a time period
$ {T}_{0:N} $
, we want to estimate the translation and rotation at each time interval
$ t $
, where
$ t\in \left[0:N\right] $
.
The estimation process of the VIO can be summarized as follows: (1) visual feature matching and IMU pre-integration, (2) pose optimization, and (3) marginalization. In the first step, we preprocess input data to formulate the optimization problem. The visual features in the images are extracted using the KLT (Kanade–Lucas–Tomasi) sparse optical flow algorithm. Similar features in different frames are matched. These matched feature points
$ \left[{p}_t,\dots, {p}_i\right] $
are used to estimate camera poses. On the other hand, the IMU data we collected have a higher frequency than the image data. To optimize poses with consistent time intervals, we need to integrate the IMU data between two image frames, which is known as IMU pre-integration.
Combining both sensors, we formulate a tightly coupled nonlinear least squares problem:

where
$ S $
is the set of measurements coming from different sensors,
$ h $
is the sensor model, and
$ \mathcal{X} $
is all the states (i.e., translation and rotation) we need to estimate. We assume the uncertainty of sensor observation is Gaussian-distributed
$ \mathcal{N}\left({z}_t^k,{\Omega}_t^k\right) $
. For image data,
$ z $
is the image feature (i.e.,
$ {p}_t $
) and
$ h $
is the projection model that projects features observed in other images onto the current image. For IMU data,
$ z $
is the IMU measurement at
$ t $
and
$ h $
is the IMU model that estimates the IMU measurement at
$ t $
from
$ t-1 $
. A visualization of the optimization process is shown in Figure 3.

Figure 3. Visualization of visual inertial odometry localization.
Lastly, to limit the computational cost, the number of states in the optimization needs to be limited. This is achieved by marginalization. When a new key frame is detected, one of the old key frames will be omitted from the optimization, so the computation time of the VIO is limited to achieve real-time update.
To enable the ICON drone to navigate unknown environments autonomously, we implement a frontier-based exploration algorithm called Fast UAV Exploration (FUEL). The setup of a frontier-based exploration operates upon a voxel grid map as depicted in Figure 4. The voxel grid map contains three basic types of grids: free space, occupied space, and unknown space. Free spaces are the ones that can be freely navigated, occupied spaces are occupied by obstacles that cannot be accessed, and unknown spaces have not been seen by the sensor and hence cannot be identified as either free or occupied. The goal of exploration is to eliminate as many unknown spaces as possible. To achieve this, a frontier is introduced. Frontiers refer to the boundaries between unknown spaces and free spaces, and frontier cells within an area are grouped to form a frontier cluster. In later computations, each cluster will be considered as one frontier entity instead of many frontier cells to save computation costs. The algorithm will generate an optimum path to visit all the frontier clusters. While the UAV is moving, new information will be observed to update the map, and the path will be replanned based on the new map.

Figure 4. Visualization of the exploration problem and the hierarchical approach.
FUEL generates exploration motions for UAVs through three hierarchical coarse-to-fine steps: (1) finding the shortest path that covers all the frontiers, (2) refining a local segment of the global path, and (3) generating a safe minimum-time trajectory to the next viewpoint. To reduce redundant computation, the length of the paths generated from each of the above three steps decreases from coarse to fine, respectively. Since the trajectories will be updated with the new observed grid map, there is no need to compute a costly B-spline trajectory for the whole map at each update.
In the first step, the algorithm formulates an asymmetric traveling salesman problem to ensure that all frontiers will be covered with minimum traveling distance. To elaborate on the problem, given a set
$ V $
of
$ n $
frontiers and a distance function
$ d $
, find a cycle
$ C $
of minimum cost that contains all the points in
$ V $
. The cost of a cycle
$ C=\left({e}_1,{e}_2,\dots, {e}_n\right) $
is defined to be
$ {\sum}_{e\in C}d(e) $
. A cost function is assigned to all the paths between frontiers. In the second step, the algorithm selects the optimum viewpoint at each frontier to optimize time lower bounds and motion consistency. Given discrete viewpoints, the UAV needs a continuous trajectory for smooth navigation. In the last step, FUEL adopts a smooth, safe, and dynamically feasible B-spline trajectory generation method from Fast Planner. To achieve this, we optimize a trajectory with
$ {N}_b+1 $
control points
$ X={x}_0,{x}_1,\dots {x}_{N_b} $
, where
$ {x}_i $
contains a
$ x,y,z $
coordinate and a viewpoint angle, and
$ \delta {t}_b $
is the knot span following the constraints below:

$ {f}_s $
is the elastic band’s smoothness cost to ensure a smooth trajectory,
$ T $
is the total trajectory time,
$ {f}_c $
is the penalty to keep the trajectory away from the closest obstacle,
$ {f}_v $
and
$ {f}_a $
penalize infeasible velocity and acceleration,
$ {f}_{bs} $
ensures smooth transitions between viewpoints, and
$ {\omega}_t $
,
$ {\lambda}_c $
,
$ {\lambda}_d $
, and
$ {\lambda}_{bs} $
are predefined weights. Details of the calculation can be found in the original paper (Zhou et al., Reference Zhou, Zhang, Chen and Shen2021).
3.3. Reconstruction: D-PGSR
The reconstruction method we propose is a 3DGS-based method. 3DGS (Kerbl et al., Reference Kerbl, Kopanas, Leimkühler and Drettakis2023) represents the appearance and geometry of a static 3D scene with a set of 3D Gaussian primitives. Each Gaussian primitive encompasses a set of attributes, including 3D position
$ \mu $
, opacity
$ \alpha $
, and covariance matrix
$ \Sigma $
. A 3D Gaussian
$ G $
can be represented as:

The covariance is defined as:

where
$ R $
is the rotation matrix and
$ S $
is the scaling vector. For fast image rendering, the 3D Gaussians are splatted into 2D Gaussians
$ {G}_{2D}\left(p;{\mu}_{2D},{\Sigma}_{2D}\right) $
on the image plane utilizing an affine approximation. Finally, the color of a pixel
$ p $
is computed by
$ \alpha $
composing the color values of all the projected 2D Gaussians that intersect with the pixel
$ p $
using the following equation:

where
$ c $
represents the view-dependent color obtained by combining the SH coefficients of
$ G $
with the viewing direction.
Even though 3DGS excels at novel view synthesis, its ability to capture geometry is limited, especially in weak-texture areas. To address this issue, we introduce a modified version of Gaussian splitting with reference to PGSR (Chen et al., Reference Chen, Li, Ye, Wang, Xie, Zhai, Wang, Liu, Bao and Zhang2024). PGSR introduces multi-view regularization, which uses pairs of reference and neighboring images to ensure the global consistency of the geometry structure. The pixel
$ {p}_r $
in the reference frame can be mapped to a pixel
$ {p}_n $
in the neighboring frame through the homography matrix
$ {H}_{rn} $
:

where
$ \tilde{p} $
is the homogeneous coordinate of
$ p $
, and
$ {R}_{rn} $
and
$ {T}_{rn} $
represent the relative transformation from the reference frame to the neighboring frame. Similarly, for the pixel
$ {p}_n $
in the neighboring frame, we can obtain the normal
$ {n}_n $
and the distance
$ {d}_n $
to compute the homography matrix
$ {H}_{nr} $
. The pixel
$ {p}_r $
undergoes forward and backward projections between the reference frame and the neighboring frame through
$ {H}_{rn} $
and
$ {H}_{nr} $
. Minimizing the forward and backward projection error constitutes the multi-view geometric consistency regularization:

where
$ \phi \left({p}_r\right)=\parallel {p}_r-{H}_{nr}{H}_{rn}{p}_r\parallel $
is the forward and backward projection error of
$ {p}_r $
. When
$ \phi \left({p}_r\right) $
exceeds a certain threshold, the pixel can be considered occluded or subject to significant geometric error, and it will not be considered in the loss.
Inspired by MVS methods, photometric multi-view consistency constraints based on plane patches are also incorporated in this method. Specifically, we map an
$ 11\times 11 $
pixel patch
$ {P}_r $
centered at
$ {p}_r $
in the reference image
$ {I}_r $
to the corresponding patch in the neighboring image
$ {I}_n $
using the homography matrix
$ {H}_{rn} $
. To emphasize geometric details, color images are converted to grayscale. Multi-view photometric regularization enforces consistency between
$ {P}_r $
and
$ {P}_n $
. We utilize the normalized cross-correlation (Yoo and Han, Reference Yoo and Han2009) of patches from the reference frame and the neighboring frame to measure photometric consistency:

where
$ V $
represents the set of all pixels in the reference image
$ {I}_r $
, excluding those with significant errors.
The major drawback of the multi-view constraints mentioned above is that they omit the pixels where the errors exceed a threshold. These errors are commonly present in areas with weak texture. To address this limitation, we leverage the recent advances in monocular depth estimation, which work particularly well in weak-texture surfaces. To achieve this, we first extract the low-texture areas in all the images using a gradient filter:

where
$ \frac{\partial I}{\partial x} $
and
$ \frac{\partial I}{\partial y} $
are the x and y gradients of the image. Pixels with gradients below a threshold will be considered weak-texture areas, which will be constrained using the following loss:

where
$ a,b $
are scaling and shift parameters determined using sampled rays to align the nonmetric depth
$ \overline{D} $
with the metric rendering depth
$ \hat{D} $
.
The reconstruction quality is accessed using the following metrics:
Since we use Colmap (Schönberger and Frahm, Reference Schönberger and Frahm2016) for initialization, the point cloud generated using D-PGSR is not on an absolute scale. To recover the true scale, we utilize odometry data from VINS (Visual-Inertial Navigation System) fusion, which incorporates a stereo camera and an IMU sensor to recover the absolute scale.
To calculate the scale factor, we first calculate the transformation,
$ T $
, between the first camera pose in the odometry
$ {P}_{VINS,0} $
and the first camera pose in the Colmap
$ {P}_{Colmap,0} $
and apply the transformation
$ T $
to all
$ N $
camera poses from Colmap following Equation 3.10. This step aligns the two sets of cameras to the same orientation as shown in Figure 5. Next, we compute the positional range of the camera poses in the three axes and the scaling factors
$ {x}_{scale},{y}_{scale},{z}_{scale} $
for each axis following Equation 3.11. Finally, we can compute a scaling factor to apply to the dense point cloud by taking the mean of the scale from the three axes.



Figure 5. Aligning the orientation of the camera poses from Colmap and camera poses from VINS.
3.4. Semantic segmentation
The semantic segmentation model we implemented combines a Swin transformer (Liu et al., Reference Liu, Lin, Cao, Hu, Wei, Zhang, Lin and Guo2021) backbone with an Upernet (Xiao et al., Reference Xiao, Liu, Zhou, Jiang and Sun2018) head decoder. The model architecture is shown in Figure 6. We used two sets of weights for the segmentation of building components and building materials separately. The segmentation classes are shown in Table 2.

Figure 6. Semantic segmentation architecture using Swin transformer and Uperhead.
Table 2. Semantic segmentation classes

For building components segmentation, we directly used the weights pretrained on the ADE20K dataset (Zhou et al., Reference Zhou, Zhao, Puig, Fidler, Barriuso and Torralba2017). The ADE20K dataset contains 150 classes. For our purpose, we only kept the classes shown in Table 2 and the rest of the classes are combined into Others. For building materials segmentation, we trained the model with a dataset of 1,008 images and 12 classes. The dataset is collected from 15 buildings at the author’s university using an iPhone XR and labeled using Roboflow (Dwyer et al., Reference Dwyer, Nelson and Hansen2024). Since we have a limited number of test scenes and not all the material classes are present, only the performance of the classes shown in Table 2 are evaluated. Mean intersection over union (mIoU) and mean accuracy are used as the evaluation metrics in this study and are defined as follows:


where
$ N $
is the number of classes in the segmentation task, and
$ TP, FP, TN, FN $
represent true positive, false positive, true negative, and false negative, respectively.
In order to map the semantic information onto the point cloud, we first back-project the point cloud to each camera to form a depth map. Next, the label map generated will be mapped on top of the depth map. Finally, we project the depth maps and the semantically labeled map back to 3D space using the known camera parameters.
4. Evaluation and discussion
4.1. Autonomous navigation
We tested our UAV in three different spaces in an educational building in the authors’ university: the classroom, the lobby, and the lounge. The exploration paths are shown in Figure 7. All scans were completed in <30 s. The red line shows the path of the UAV, and the rest are obstacles detected. The colors of the obstacles represent different altitudes: colder color represents lower altitude (e.g., ground as green) and warmer color represents higher altitude (e.g., upper wall as purple). We observed that the exploration showed good room coverage, with all the floors and walls covered, which provided a solid foundation for 3D reconstruction.

Figure 7. Exploration paths in three testing spaces.
4.2. 3D reconstruction
Figure 8 shows the reconstruction results using our proposed method. To evaluate the accuracy of the reconstruction, we collected point clouds using a high-fidelity terrestrial LiDAR scanner (Trimble TX8), shown in Figure 9, as ground truth. Quantitative reconstruction results are shown in Table 3. Our method achieves superior performance compared to the state-of-the-art multi-view stereo method, ACMMP (Xu et al., Reference Xu, Kong, Tao and Pollefeys2023), and the Gaussian Splatting method, PGSR (Chen et al., Reference Chen, Li, Ye, Wang, Xie, Zhai, Wang, Liu, Bao and Zhang2024). We also visualize the cloud-to-cloud distance between the reconstruction and the ground truth using CloudCompare (Cloudcompare, 2024), as shown in Figure 10. Colder color (e.g., blue) represents a smaller distance, showing the parts that are more adherent to the ground truth. Warmer color (e.g., red) represents a larger distance, representing a larger reconstruction error. There are three major sources of error we identified. The first one is the error from incorrectly reconstructed points. This is inevitable since noise is inherent in the input images, reflecting incorrect information about the environment. The second source of error comes from the limited coverage of LiDAR data. Given limited resources, we were only able to acquire one LiDAR scan for each scene. Therefore, some of the areas reconstructed by the UAV are missing in the LiDAR scan, introducing errors in the evaluation. The last source of error comes from scaling the point clouds to absolute scales. This error mainly comes from the VIO system. To evaluate this error, we scale the point clouds by manually matching feature points between the reconstructed point clouds and the ground truth point clouds, which can be considered as the ground truth scale. Table 4 shows that the error is relatively small.

Figure 8. Point clouds generated with Colmap + D-PGSR.

Figure 9. Point clouds collected using a LiDAR scanner.
Table 3. Comparison of different methods across multiple scenes. Lower values are better for Accuracy (Acc) and Completeness (Comp), while higher values are better for Precision (Prec), Recall, and F1-score


Figure 10. Cloud-to-cloud distances of the reconstructed point clouds and the point clouds from a LiDAR scanner.
Table 4. Cloud-to-cloud distance between 3D reconstructed point clouds and reference point clouds collected via terrestrial LiDAR

4.3. Semantic segmentation
To evaluate the segmentation results, we manually labeled the scenes with ground truth building components and materials. The quantitative results are shown in Table 5 and visualization is shown in Figures 11 and 12.
Table 5. Semantic segmentation results


Figure 11. Visualization of the building components segmentation.

Figure 12. Visualization of the building materials segmentation.
The overall accuracy reflects that ~80% of the points are correctly labeled in both tasks, demonstrating that the models are effective in the grand scheme. In Figures 11 and 12, we observed that the major classes like walls, ceiling, and floor are correctly labeled, as well as their corresponding materials, which lays a strong foundation toward scan-to-BIM. On the other hand, mIoU is a more difficult evaluation metric requiring precise segmentation. The mIoU we achieved for both tasks is 0.588 and 0.629, reflecting room for improvement in segmentation details. Specifically, we identified the following sources of error. First, visual similarities between classes can lead to incorrect segmentation by the model. For example, we observed that in building component segmentation, the IoU for windows is particularly low, with only 0.065, which significantly lowers the overall performance. This is due to similar visual features observed on windows compared to other classes, like walls and doors, making it hard to distinguish between the two. Furthermore, the limited number of testing scenes can introduce strong biases during evaluation. In conclusion, we focused on testing the feasibility of the proposed pipeline, and the experiment results above are robust to validate the effectiveness of our proposed pipeline.
5. Conclusion
In this study, we propose ICON drone and D-PGSR, a novel robotic approach to reconstruct semantic-rich point clouds for scan-to-BIM. Our approach consists of three modules: an autonomous navigation system based on VIO and frontier-based exploration, a Gaussian splatting-based 3D reconstruction method, and automated segmentation of building components and materials.
We tested the effectiveness and robustness of our system in three different scenes: the classroom, the lobby, and the lounge. Our UAV successfully explored the three scenes. The average F1-score between the reconstructions and the point clouds collected using a LiDAR scanner is 0.6870. The mIoU for building components and materials segmentation is 0.588 and 0.629, respectively. These results demonstrate that the ICON drone is capable of collecting building information in an unknown environment, which can relieve the labor shortage in retrofitting and inspections for building owners and facility managers.
However, we acknowledge several key limitations that currently constrain practical deployment. First, the system is not suited for low-light or completely dark environments due to its reliance on vision-based sensors. Second, it cannot detect or reconstruct non-visible infrastructure elements, such as HVAC (Heating, Ventilation, and Air Conditioning) ducts, wall cavities, or plumbing, which are crucial in BIM. Third, the UAV’s ability to navigate is impaired in cluttered or occluded areas and requires open pathways between rooms (e.g., open doors and minimal obstruction), which may not always be available in real-world scenarios. To address the above limitations, we need the drone to “see more” by incorporating additional sensors (e.g., LiDAR and radar); “think more,” by employing more advanced perception and decision-making algorithms; and “do more,” by adding the ability to interact with the environments through multi-robot collaboration with ground robots or robot arms.
Moreover, while our segmentation models demonstrate reasonable performance within our test environments, they were trained on limited datasets that do not reflect the full diversity of real-world buildings. Materials and components vary significantly across regions, architectural styles, and usage contexts, posing challenges for generalization. To achieve more robust and accurate segmentation, potential future works include: (1) Expanding the dataset to include a wider range of materials and layouts; (2) exploring domain adaptation and few-shot learning and open-vocabulary approaches to improve performance on unseen environments; (3) embedding geometric priors to reduce semantic errors in 3D space (e.g., mislabeled surfaces); and (4) conducting large-scale field tests across varied building types and operational conditions.
Finally, we recognize safety as a critical concern. The use of UAVs indoors must account for risks to occupants and property. To address this, we plan to reduce UAV weight, implement passive safety mechanisms (e.g., protective cages or soft propeller guards), and design adaptive navigation strategies that account for the presence of dynamic obstacles like humans or movable furniture.
In conclusion, while ICON and D-PGSR demonstrate promising capabilities, real-world deployment demands continued research into robustness, generalization, and operational safety. With these improvements, we believe our system can become a valuable asset for autonomous building modeling and digital twin generation.
Data availability statement
The ScanNet data that support the findings of this study are available from http://www.scan-net.org/ (Dai et al., Reference Dai, Chang, Savva, Halber, Funkhouser and Nießner2017). Restrictions apply to the availability of these data, which were used under license for this study. The rest of the data are original and openly available in Zenodo at https://zenodo.org/records/15311846 (Zhang, Reference Zhang2025).
Acknowledgments
The authors would like to thank Haibo Feng, Zhifan Liu, and Rojini Kathiravel at the Sustainable Built Environment Lab at UBC for providing support on the LiDAR scanner.
Author contribution
H.X.Z.: Conceptualization (equal), methodology (equal), investigation (equal), validation (lead), writing—original draft (lead), writing—review and editing (lead). Y.Y.: Conceptualization (supporting), methodology (equal), investigation (equal). Z.Z.: Conceptualization (equal), supervision (lead), funding acquisition (lead), writing—original draft (supporting), writing—review and editing (supporting).
Funding statement
This work received no specific grant from any funding agency, commercial, or not-for-profit sectors.
Competing interests
The authors declare none.
Ethical standard
The research meets all ethical guidelines, including adherence to the legal requirements of the study country.
Appendix A. ScanNet reconstruction
To further evaluate the reconstruction performance of the proposed method, we conduct experiments on four Scannet scenes (Dai et al., Reference Dai, Chang, Savva, Halber, Funkhouser and Nießner2017) following the format in reference (Guo et al., Reference Guo, Peng, Lin, Wang, Zhang, Bao and Zhou2022). ScanNet is an RGB-D video dataset for evaluating indoor scene reconstruction. The visualization of the reconstruction and the ground truth is shown in Figure A1, and the quantitative results are shown in Table A1. The results demonstrate significant improvement compared to existing methods. However, we observe that our proposed method performs worse on scene 0580 compared to ACMMP. By observing the input images and the reconstructed point clouds, we suggest that this is due to the lack of structural clues in the ceilings of the scene, where it is mostly white surfaces and has poor overlaps with near frames. As shown in Figure A2, the edges between the walls and the ceiling are blurry with complex lighting and shadow, leading to poor reconstruction. Gaussian splatting-based methods reconstruct the scene based on optimizing the rendered images. With limited viewpoints, the reconstruction can still generate accurate image rendering at these viewpoints, but the geometry might be incorrect or not proportional to the whole scene. Adding geometric constraints improves the results, but still fails in extreme cases, pointing to the need for future work.

Figure A1. Visualization of ScanNet reconstructions.
Table A1. Comparison of different methods across multiple scenes. Lower values are better for Accuracy (Acc) and Completeness (Comp), while higher values are better for Precision (Prec), Recall, and F1-score


Figure A2. Images corresponding to the incorrectly constructed areas.
Appendix B. Navigation in complex environments
To validate the proposed system’s ability to navigate in complex environments, we conducted experiments in a lab environment with clusters of furniture and equipment shown in Figure B2. Figure B1 presents the trajectory and the obstacles detected during flight. Our drone successfully explored the space and avoided all obstacles, proving its ability to navigate in complex environments. The reconstruction of the lab is shown in Figure B3. Due to the occlusion of the obstacles, the completeness of the reconstruction is lower compared to the other experiments. To address this issue, more cameras can be equipped in the future to increase the field of view.

Figure B1. Exploration path in a complex lab environment.

Figure B2. Lab environment.

Figure B3. 3D reconstruction of the lab.
Comments
No Comments have been published for this article.