1. Introduction
Simultaneous localization and mapping (SLAM) technology is capable of real-time self-localization and map construction in unknown environments, and it is a key technology in the fields of robotics, autonomous driving, and augmented reality [Reference Cho, Jo and Kim1–Reference Mulubika and Schreve5]. Compared with traditional laser SLAM, visual SLAM (VSLAM) has lower cost and richer environment sensing capability, which makes it show great potential in applications such as robot navigation, augmented reality (AR), virtual reality (VR), and intelligent surveillance. In recent years, with the rapid development of computer vision and machine learning, the performance and robustness of VSLAM have been significantly improved. However, most of the existing VSLAM algorithms are based on static environment assumptions, leading them to exhibit limitations in complex dynamic environments.
Current mainstream VSLAM methods, such as ORB-SLAM2 [Reference Mur-Artal and Tardós6], ORB-SLAM3 [Reference Campos, Elvira, Rodríguez, Montiel and Tardós7], LSD-SLAM [Reference Engel, Schöps and Cremers8], and DSO [Reference Engel, Koltun and Cremers9], demonstrate good accuracy and stability in static scenes. However, in dynamic environments, moving objects (such as pedestrians and vehicles) can introduce mismatched feature points, disrupting the system’s state estimation. This interference may lead to a significant decrease in the robustness of the visual SLAM system or even a crash. Therefore, effective detection and rejection of dynamic objects are crucial for the accuracy and robustness of vision SLAM systems in dynamic scenes.
Recently, methods such as DS-SLAM [Reference Yu, Liu, Liu, Xie, Yang, Wei and Fei10], Dyna-SLAM [Reference Bescos, Fácil, Civera and Neira11], and DE-SLAM [Reference Xing, Zhu and Dong12] have attempted to improve the system performance by using semantic segmentation and target detection networks to obtain semantic information about dynamic elements, and combining geometric constraints to remove these dynamic feature points. However, the high dependence of these methods on semantic information leads to a significant increase in computational overhead, posing a greater challenge to real-time performance. In addition, these methods rely on traditional indirect VSLAM system using the hand-crafted features, limiting the overall robustness of the system, and its ability to handle complex dynamic scenes. Meanwhile, the complete elimination of feature points in the dynamic region may lead to insufficient constraints in the position estimation and affect the localization accuracy. Therefore, how to balance the detection of dynamic objects and the preservation of static features in dynamic environments becomes a key issue for research.
To cope with the above challenges, this paper proposes the GAF-SLAM algorithm, which introduces the YOLO-Point [Reference Backhaus, Luettel and Wuensche13] deep learning network into the visual SLAM system to replace the traditional ORB feature extraction method. Combining gray area feature points screening and static probability calculation in dynamic environment, the system’s localization accuracy and stability are improved by analyzing the feature points in the dynamic region, effectively identifying and removing the dynamic feature points, while retaining more static feature points. After calculating the static probability from all the static feature points, the weighted static probability and spatio-temporal constraints are fused to perform the dynamic pose estimation, and a more accurate position is solved. The specific contributions of this paper are as follows:
-
1. This paper introduces the concept of “gray area feature points” and designs a dynamic feature point screening and static probability calculation framework that integrates deep learning with geometric optimization. By incorporating the YOLO-Point network, the system achieves dynamic object detection and feature point extraction, and accurately identifies gray area feature points in dynamic regions using reprojection error and epipolar geometry constraints. In addition, a static probability calculation method is proposed, which assigns static weights to gray area feature points based on reprojection distance, epipolar distance, and observation state, thereby enhancing the robustness and accuracy of pose estimation in dynamic environments.
-
2. A dynamic pose estimation algorithm fusing weighted static probability and spatio-temporal constraints is proposed, which calculates the static probability of feature points and dynamically adjusts their weights in the optimization process. In addition, the algorithm further combines temporal continuity and spatial smoothness constraints to effectively optimize the weight allocation strategy of static feature points, thus enhancing the stability and robustness of feature point matching. Ultimately, the weighted sum of the reprojection error and the temporal consistency constraint is minimized by nonlinear least squares optimization, which achieves high-precision pose estimation in dynamic environments.
-
3. We integrated the method into the front-end of ORB-SLAM2 and evaluated the method on the TUM RGB-D datasets and Bonn RGB-D datasets, as well as tested it in real-world scenarios. The results show that GAF-SLAM achieves high localization accuracy and robust performance in various dynamic environments.
The paper is organized as follows. Section 2 is an introduction to the related work. Section 3 describes the main theoretical model and algorithm design of the method in this paper. Section 4 details the comparative analysis of the experimental results. The conclusions are presented in section 5.
2. Related work
2.1. Static SLAM
VSLAM algorithms can be categorized into two main types: direct methods and indirect methods. Direct methods rely on the assumption of pixel intensity invariance, utilizing photometric information directly to minimize errors in pose estimation. Representative techniques such as LSD-SLAM and DSO are favored for their fast computational speed and adaptability to texture-scarce environments. However, these methods lack loop closure detection modules, which can lead to the accumulation of errors, and they exhibit insufficient robustness under varying lighting conditions.
In contrast, indirect methods estimate the camera pose by extracting and matching features. Mono-SLAM [Reference Davison, Reid, Molton and Stasse14] and ORB-SLAM2 are typical examples of this approach. Although they are somewhat slower in processing speed, they demonstrate greater robustness in scenarios with lighting changes and rapid camera motion. In ref. [Reference Liu, Wen and Zhang15], the authors significantly improved the system’s matching accuracy by introducing line features and utilizing IMU-assisted optical flow tracking to predict these line features. Liu [Reference Liu, Wen, Zhao, Qiu and Zhang16] proposed a lightweight SLAM method based on pyramid IMU-predicted optical flow tracking, aiming to reduce the computational cost of feature tracking while enhancing the system’s processing speed.
However, most existing VSLAM techniques assume that the external environment is static. In real-time applications, moving objects are prevalent, and this dynamic characteristic can significantly affect the localization accuracy and tracking performance of traditional VSLAM systems, which in turn seriously threatens the stability and accuracy of the system.
2.2. Dynamic SLAM
For the effect of dynamic scenes, the current VSLAM system mainly adopts two types of methods to recognize and reject dynamic feature points: geometric information methods and semantic information methods.
Geometric information methods utilize geometric constraints to detect and reject dynamic points. Such methods usually recognize dynamic features by detecting the motion consistency or geometric properties of the feature points. For example, Zou et al. [Reference Zou and Tan17] projected feature points from the previous frame to the current frame and classified static and dynamic feature points according to the magnitude of the 2D reprojection error. Wang et al. [Reference Wang, Wan, Wang and Di18] combined polar constraints and RGB-D depth clustering information to identify outliers in neighboring frames to detect moving targets. Dai et al. [Reference Dai, Zhang, Li, Fang and Scherer19] succeeded in distinguishing dynamic targets from static backgrounds by analyzing the correlation of map points, effectively reducing the influence of dynamic objects on position estimation. Song et al. [Reference Song, Yuan, Ying, Yang, Song and Zhou20] employ density-based spatial clustering of applications with noise (DBSCAN) [Reference Hahsler, Piekenbrock and Doran21] in conjunction with geometric consistency and epipolar constraints to remove dynamic feature points. However, these geometric methods rely on localized feature motion variations and are poorly adapted to large-scale dynamic scenes, which may affect the reliability of position estimation and map construction accuracy.
The semantic information approach extracts semantic information from images with the help of deep learning models to remove potential dynamic objects, which provides a new solution for the application of SLAM in dynamic environments. In recent years, the combination of deep learning and SLAM algorithms has made significant progress. For example, Dyna-SLAM combines Mask-R-CNN [Reference He, Gkioxari, Dollár and Girshick22] and a multi-view geometry approach in the ORB-SLAM2 framework to effectively remove the dynamic point. Yang et al. [Reference Yang, Yuan, Zhu, Chi, Li and Liao23] used faster R-CNN [Reference Ren, He, Girshick and Sun24] to detect dynamic objects and further confirmed dynamic objects by geometric matching. DS-SLAM combined with Seg-Net [Reference Yu, Chen, Chang and Ti25] and motion consistency detection for accurate recognition of dynamic objects. Wen et al. [Reference Wen, Li, Liu, Li, Tao, Long and Qiu26] combined semantic information with pixel spatial motion features to effectively improve localization accuracy. The OVD-SLAM [Reference He, Li, Wang and Wang27] utilizes pixel-level dynamic object segmentation to distinguish foreground from background, and recovers static points on moving objects by minimizing reprojection errors, thereby mitigating the negative impact of dynamic points on system performance. These methods perform superiorly in dynamic environments, but the high dependence on semantic information significantly increases the computational overhead and poses a serious challenge to real-time performance.
For this reason, other approaches have begun to try to optimize the use of semantic information. For example, YOLO-SLAM [Reference Wu, Guo, Gao, You, Liu and Chen28] combines a YOLO target detection network with VSLAM to cull out feature points in dynamic regions using real-time dynamic object detection, thus improving robustness and accuracy in dynamic environments. COEB-SLAM [Reference Min, Wu, Li, Wang and Liu29] proposed real-time dynamic SLAM algorithms based on deep learning, extracting semantic information from the scene using object detection networks, and combining optical flow techniques to remove dynamic feature points, significantly reducing localization errors in dynamic environments. SG-SLAM [Reference Cheng, Sun, Zhang and Zhang30] further enhances the system’s robustness by combining semantic information with geometric information, enabling more accurate identification, and elimination of dynamic point interference. GGC-SLAM [Reference Sun, Liu, Zou, Xu and Li31] calculates the fundamental matrix distance and uses object detection results to eliminate dynamic feature points, effectively reducing the impact of dynamic scenes on the SLAM system [Reference Islam, Ibrahim, Chin, Lim, Abdullah and Khozaei32] improves traditional VSLAM by combining object detection to filter dynamic objects and focusing on static points. Yang et al. [Reference Fu, Yang, Ma and Zhang33] combined deep learning with probabilistic filtering methods, significantly improving the robustness of VSLAM in dynamic environments. MPOC-SLAM [Reference Wu, Zhang, Zhang, Song, Wang and Yuan34] utilizes object category and motion probability modeling to significantly improve localization and map-building capabilities in highly dynamic environments.
Although these semantic approaches have achieved good results in dynamic environments, the complete elimination of semantic information from dynamic objects may lead to insufficient feature points and affect the matching performance of the system. Moreover, these methods still mainly rely on the traditional indirect SLAM framework, which uses hand-designed feature point detection and description methods and fails to fully exploit the potential of deep learning models for efficient joint feature extraction. This limitation provides a research direction to further improve the performance of SLAM in dynamic environments.
3. Improved VSLAM system
In most feature-based VSLAM methods, the camera rotation R and translation t are estimated by minimizing the reprojection error between the key points
$x_{i}=(u_{i},v_{i})^{T}$
and their corresponding 3D points
$X_{i}=(x,y,z)^{T}.$
\begin{align} \left\{R^{*},t^{*}\right\}&=\underset{R,t}{\mathrm{a}\mathit{rg}\min } \frac{1}{2}\sum _{i=1}^{n}\left\| x_{i}-\left(RX_{i}+t\right)\right\| _{2}^{2}\nonumber\\ &=\underset{R,t}{\mathrm{a}\mathit{rg}\min } \sum _{i=1}^{n}\frac{1}{2}\left\| x_{i}-\pi \left(\left[\begin{array}{c@{\quad}c} R & t\\ 0 & 1 \end{array}\right]\left[\begin{array}{c} X_{i}\\ 1 \end{array}\right]\right)\right\| _{2}^{2}\\[-6pt]\nonumber \end{align}
In this context,
$\pi (\cdot )$
denotes the camera transformation model that converts 3D coordinates to pixel coordinates,
$R^{\mathrm{*}}\ and\ t^{\mathrm{*}}$
represent the optimized camera pose,
$f_{x}$
and
$f_{y}$
are the camera focal length,
$c_{x}$
and
$ c_{y}$
are the camera center coordinate.
In general, in the case of static environments, all extracted feature points participate in the optimization process. However, feature points from dynamic elements may interfere with this optimization process. Specifically, due to the lack of observable motion information, dynamic feature points cannot be matched to the original camera transform model without the support of other sensors. This mismatch negatively affects the optimization process of Eq. (1), which leads to an increase in the camera position error. Therefore, in order to significantly improve the adaptability of the system in dynamic scenes, it is necessary to exclude the participation of dynamic feature points while introducing the weights of static feature points in order to solve a more accurate bit pose.
GAF-SLAM is implemented on the basis of the ORB-SLAM2 framework, which is a traditional feature-point based SLAM method. As shown in Figure 1, the specific process is to input the image frames into the YOLO-Point network, and the output includes depth feature points, descriptors, and relevant information about the target detection frame. Next, the feature points in the detection box are re-evaluated by reprojection error and limiting geometric constraints to filter out gray area feature points. Subsequently, based on our proposed static probability algorithm for feature points, these gray area feature points are discriminated, and low probability feature points are eliminated in order to retain as many static feature points as possible. Ultimately, these preserved static feature points are utilized in combination with their own static weights as well as in conjunction with spatio-temporal constraints for pose estimation.

Figure 1. To enhance the robustness and accuracy of ORB-SLAM2 in dynamic environments, a gray area feature point recognition module and a static probability calculation module are introduced. Traditional ORB feature points are replaced with YOLO-Point deep learning features, and the resulting feature points and descriptors are formatted in ORB style for seamless integration. Gray area screening and static probability calculation dynamically adjust feature point weights in pose estimation, optimizing the final pose for accurate tracking and localization.
3.1. YOLO-Point
Super-Point [Reference DeTone, Malisiewicz and Rabinovich35] is a multi-task neural network that realizes the tight integration of key point detection and descriptor generation by sharing the feature output of the backbone network. It is designed to be able to accomplish both tasks in a single forward propagation, which dramatically improves the computational efficiency and is particularly suitable for real-time scenarios. In addition, in recent years, there has been a trend to incorporate YOLO series of deep learning algorithms into SLAM systems. These algorithms provide a better solution for SLAM in dynamic scenes by accurately detecting dynamic objects and effectively reducing their interference with the localization and mapping system.
YOLO-Point proposes a unified framework that fuses key point detection with object detection. Unlike traditional methods that rely on separate feature extraction and post-processing modules, YOLO-Point is able to achieve multi-task learning in a single forward propagation, which significantly reduces computational complexity and improves real-time performance. Meanwhile, the feature points extracted by deep learning show stronger robustness in complex scenes such as lighting changes, view angle changes, and motion blurring. With the object detection capability, YOLO-Point is able to radically reduce the accumulation of localization errors in dynamic scenes, whereas traditional SLAM methods usually lack effective dynamic feature point processing strategies in dynamic environments.
The core design concept of YOLO-Point is to share the backbone network for fast and efficient predictive capability. In a single forward propagation, YOLO-Point not only synchronizes key point detection, descriptor generation, and target bounding box prediction, but also improves adaptability and robustness in dynamic environments through multi-task co-design. Compared to hand-designed feature points such as traditional ORB, deep feature points provide significantly better perception in complex scenes, as well as higher accuracy in bit-position estimation. Despite the high computational complexity of deep feature points, YOLO-Point achieves a balance between high accuracy and real-time performance by sharing feature extraction modules.
In model training, YOLO-Point is first pretrained on the synthetic shape dataset and COCO dataset [Reference Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár and Zitnick36], and combined with single-response adaptive and mosaic data augmentation methods to reduce the error caused by padding, thus improving the accuracy of key point detection. Subsequent fine-tuning on the KITTI dataset [Reference Geiger, Lenz, Stiller and Urtasun37] in stages further optimized the model’s performance on new data and new object classes by freezing and unfreezing the weights of the different layers. In addition, to further enhance the closed-loop detection capability, we use the OpenLORIS dataset [Reference Shi, Li, Zhao, Tian, Tian, Long, Zhu, Song, Qiao, Song and Guo38] to generate a vocabulary of in-depth features and achieve efficient loop detection through FBoW [Reference Munoz-Salinas and Medina-Carnicer39] search, which effectively reduces the cumulative error in the bit-position estimation. This training strategy not only ensures the accuracy of the model but also enhances its robustness in complex dynamic environments. YOLO-Point is pretrained on synthetic shape dataset and COCO dataset, combining single-response adaptive with mosaic data enhancement to reduce the filling error and improve the accuracy of key point detection. Subsequently, the model is fine-tuned in stages on the KITTI dataset to ensure that it performs well on new datasets and new object classes, ultimately achieving efficient and accurate detection performance.
3.2. Reprojection error
Reprojection error is a metric used in vision SLAM to measure the positional error of a 3D point projected onto an image. Specifically, the reprojection error measures the deviation of a 3D point from the actual detected 2D feature point after mapping it to the image coordinate system. A smaller reprojection error indicates that the 3D point is more consistent with the detected point on the image, which can usually be used as an indicator that the point belongs to a static environment, while a larger reprojection error may mean that the point is affected by a dynamic object.
As shown in Figure 2, assume that a 3D point
$X_{i}=(X_{i},Y_{i},Z_{i})$
in the local map is projected onto the image plane through the camera’s projection matrix P, resulting in the projected 2D coordinates
$p_{i}=(u_{i},v_{i})$
. Ideally, this point should coincide with the position of the feature point in the image frame. If the actual detected position of the feature point is
$p_{i}^{'}$
, then the reprojection error can be expressed as:
where
$u_{i},v_{i}$
represents the coordinates of the 3D point after projection to the image and
$u_{i}^{\mathrm{'}}$
,
$v_{i}^{\mathrm{'}}$
are the coordinates of the image where the point is actually detected.

Figure 2. Schematic diagram of reprojection error.
In dynamic environments, static feature points usually originate from fixed scene elements, and thus, the reprojection error between frames is small. Dynamic feature points, on the other hand, originate from moving objects, such as pedestrians or vehicles, and the reprojection error is usually large. The specific process is as follows: First, feature point detection and matching are carried out between the current frame and the previous frame. Then, using the estimated camera pose, 3D points from the local map are projected onto the current frame to calculate the reprojection error
$e_{\textit{reproj}}$
for each matched feature point. By setting a threshold, if
$e_{\textit{reproj}}\gt \epsilon$
, the feature point is marked as a dynamic feature point; otherwise, it is classified as a static feature point.
3.3. Epipolar geometry constraint
By epipolar geometry constraint, we can classify the current state of an object. The feature points of dynamic objects do not satisfy the constraint of epipolar geometry because they are not accurately located on the corresponding epipolar lines. Therefore, we measure the distance between feature points and their corresponding epipolar lines, and consider distances exceeding a specific threshold as outliers.
The process of epipolar geometry constraints can be divided into three steps. First, the pyramid-based Lucas-Kanade optical flow algorithm [Reference Baker and Matthews40] is used to calculate matching feature points in two adjacent images. Next, the fundamental matrix is employed to compute the epipolar lines for each matching feature point in the current frame. Finally, the distances between the feature points and their corresponding epipolar lines are calculated. Based on the comparison of these distances with a predetermined threshold, we can determine the state of the feature points: if the distance exceeds the threshold, the feature point is considered to be in a moving state; otherwise, it is classified as static.
Figure 3 illustrates the epipolar geometry constraints between the previous frame image I1 and the current frame image I2. The camera observes the same spatial point P from different angles. In a dynamic scene, the point P moves to P’ with the optical centers O1 and O2 defining the epipolar plane through the spatial point P. P1 and P2 are the feature points projected from the spatial point P in the previous and current frames, respectively. The intersection lines L1 and L2 of the epipolar plane with the two image planes are referred to as the epipolar lines. We denote the matched feature points in the previous frame and the current frame as:

Figure 3. the epipolar geometric constraints between the previous frame I1 and the current frame I2. L represents the nuclear line.
Among them, u and v are pixel coordinate values, while the homogeneous coordinates of
$p_{1}$
and
$p_{2}$
can be expressed as:
The kernel line L2 in the current frame can be determined from the basic matrix F using Eq. (6):
where
$P_{1}$
represents the feature point in the previous frame.
$u_{1}$
and
$v_{1}$
represent the horizontal and vertical coordinates of the point. The epipolar constraint is represented as follows:
where
$P_{2}$
represents the feature point in the current frame. The kernel line corresponding to the feature point
$P_{1}$
of the previous frame in the current frame is l:
\begin{equation}l=FP_{1}=F\left[\begin{array}{c} u_{1}\\ v_{1}\\ 1 \end{array}\right]=\left[\begin{array}{c} X\\ Y\\ Z \end{array}\right]\end{equation}
where
$X,Y$
, and
$Z$
represent the 3D coordinates of the feature point. Then, the distance D from the feature point
$P_{2}$
of the current frame to the kernel line l is:
3.4. Gray area feature points recognition strategy
As shown in Figure 4, dynamic feature points are extracted from image frames containing target detection frames by gray area feature point filtering technique. The core of the algorithm is to determine the dynamic feature points for each point in order to generate the grey area feature point set M.

Figure 4. Schematic diagram of filtering gray area feature points.
At the beginning of the algorithm, dynamic feature point labels are stored based on reprojection error and epipolar geometry constraints. Reprojection error is a standard used to assess the dynamics of a point by comparing the difference between the actual observed point and the predicted point calculated through the camera model. While traversing all feature points, if the reprojection error is below a predetermined threshold, the point is labeled as a dynamic feature point and added to set S, otherwise, it is labeled as a static feature point.
Next, a second round of assessment is conducted on the same feature points using epipolar geometry constraints. This process also traverses all points, and if they satisfy the geometric constraint conditions, they are labeled as dynamic feature points and added to set F. Through these two assessments, sets S and F will each contain labels indicating the dynamics of the corresponding feature points. After completing the dynamic feature point assessment, the algorithm proceeds to the filtering stage. By comparing the labels in sets S and F, the algorithm determines which points are gray area feature points. Specifically, if a point has inconsistent labels in sets S and F, or is labeled as a static feature point in set S, it is saved to the final gray area feature point set M. Algorithm 1 summarizes the specific steps for gray area feature point filtering.
Algorithm 1 Grey area feature point filtering

3.5. Gray area feature point static probability calculation
Input the gray area feature point to be determined, calculate its static probability based on its motion and geometric relationship, and determine whether to retain the feature point.
First, calculate the first probability value based on motion estimation distance, calculate the movement distance
$Dis_{a}$
between the back projection point
$P_{a}$
and the corresponding mapping point
$x_{a}$
:
$[X_{a}^{\mathrm{'}},Y_{a}^{\mathrm{'}},Z_{a}^{\mathrm{'}}]^{T}$
is the 3D point coordinates of the map points
$x_{a}$
.
Based on Eq. (10), calculate the distances between the corresponding points in the reference frame and the key points on the map, obtain a series of distances, and then calculate the variance
$S_{d}$
and mean
$\mu _{d}$
:
\begin{equation}S_{d}=\sqrt{\sum _{a=1}^{N_{a}}\left(Dis_{a}-\mu _{d}\right)^{2}/N_{a}}\end{equation}
where
$N_{a}$
is the total number of points. Using the mean
$\mu _{d}$
and variance
$S_{d}$
obtained from Eqs. (11) and (12), we can calculate the static observation weight
$W_{a}$
based on motion estimation for each feature point in the current frame, as follows:
where
$\beta$
is a constant used to adjust the sensitivity of the weight calculation formula.
Meanwhile, if the feature point is a static feature point, the number of times it is judged as a static feature point will be very large. Therefore, we further compute the observation count
$V_{st}(p_{a})$
for each feature point in the current frame using the reprojection error calculation method, with the initial value set to 0. Specifically, from the first frame to the current frame, utilizing the discriminative method outlined in Eq. (3), if a feature point
$p_{a}$
is observed in a frame and determined to be a static feature point through the reprojection error method, its count value is updated as follows:
If the feature point
$p_{a}$
is observed but is not classified as a static feature point, then the count value,
$V_{st}(p_{a})$
of the feature point
$p_{a}$
is updated as follows:
If the feature point
$p_{a}$
is not observed, no update
$V_{st}(p_{a})$
. Then, the mean and standard deviation of the
$V_{st}(p_{a})$
for the current frame are calculated.
\begin{equation}S_{v}=\sqrt{\sum _{a=1}^{N_{v}}\left(V_{st}\left(p_{a}\right)-\mu _{v}\right)^{2}/N_{v}}\end{equation}
where
$N_{v}$
is the number of feature points in the current frame.
By using the mean
$\mu _{v}$
and standard deviation
$S_{v}$
of static observations, we can calculate the static observation weight of each feature point in the current frame, as follows:
where
$\beta _{v}$
is a constant greater than 0.
The static weight of the static point
$p_{a}$
based on the reprojection error is represented as follows:
where
$\alpha _{st}$
is a real number greater than 0.
Second, the second probability value is calculated based on the epipolar geometry constraints. According to Figure 3, if the feature point P is static, its projected point in the current frame should lie on the epipolar line L2. Conversely, if the point is moving, it will not be located on the epipolar line. A similar calculation is performed to determine the epipolar distance from the feature points in the current frame to the corresponding epipolar line:
where
$H$
represents the distance from the feature point to the epipolar line L2 in the current frame, and
$A,B,C$
are the parameters of the epipolar line equation,
$u_{2}$
and
$v_{2}$
are the coordinates of the current feature point.
Then, calculate the mean
$\mu _{H}$
and variance
$S_{H}$
, and then calculate a static weight:
\begin{equation}S_{H}=\sqrt{\sum _{a=1}^{N_{a}}\left(H-\mu _{H}\right)^{2}/N_{a}}\end{equation}
where
$N_{a}$
is the total number of points,
$\beta$
is a constant greater than 0.
Similarly, based on the discrimination results of polar geometric constraints, update the observation frequency
$V_{st}'(p_{a})$
, and then calculate a probability value. The formula steps are as follows:
\begin{equation}S_{v}=\sqrt{\sum _{a=1}^{N_{v}}\left(V_{st}'\left(p_{a}\right)-\mu _{v}\right)^{2}/N_{v}}\end{equation}
where
$\mu _{v}$
is the mean of the observation frequency for static points,
$S_{v}$
is the standard deviation of the observation frequency for static points, and
$N_{v}$
is the number of samples,
$\beta$
is a constant greater than 0.
The static weight representation
$W_{egc}$
of static points based on extreme geometric constraints is as follows:
Among them,
$\beta _{st}$
is a real number greater than 0.
Finally, by combining the results
$W_{bre}\ \text{and}\ W_{egc}$
obtained from Eqs. (18) and (28), the final static weights of the feature points are calculated and published as follows:
where
$\varphi$
and
$\omega$
are real number greater than 0.
Finally, the resulting static weight values are used to determine whether they need to be eliminated or not. By static probability, more static feature points are retained. The specific steps are as in Algorithm 2:
Algorithm 2 Grey area feature point static probability calculation

3.6. Fusion of weighted static probabilities and spatio-temporal constraints for dynamic position estimation
In VSLAM systems, pose estimation is one of the core aspects to achieve accurate localization and map building. However, moving objects in dynamic environments can have an impact on the reliability of the feature points, thus reducing the accuracy and robustness of the pose estimation. In order to improve the adaptability of pose estimation, this paper proposes a dynamic pose estimation method that incorporates weighted static probabilities and spatio-temporal constraints. The weighting weights are dynamically adjusted by integrating the motion velocity variations and spatial consistency characteristics of the feature points, which in turn optimizes the robustness of the pose estimation.
Specifically, the dynamic weight model based on static probability and velocity change is first established, and the combined static probability
$W_{st}$
and dynamic velocity information
$\| {\unicode[Arial]{x0394}} v\|$
are weighted to establish dynamic weights
$W_{st}^{\mathrm{'}}$
; second, time continuity and spatial smoothness constraints are introduced in the pose estimation. Finally, the improved objective function is constructed to minimize the weighted sum of the reprojection error and the spatio-temporal consistency error, and the optimal position matrix is solved iteratively using the Gauss-Newton method. Detailed steps are as follows:
Static probabilities are calculated for all feature points in the current frame, and the static probabilities of the feature points are weighted as weights, while the motion information (velocity changes of the feature points) is combined with the static probability weights to dynamically adjust the weight allocation. The adaptive weighting formula for incorporating speed changes is:
where
$\alpha\ and\ \beta$
denote the adaptive parameters that determine the relative weights of static probabilities and dynamic information. To ensure the optimality of the parameter settings, Bayesian optimization is used to automatically adjust
$\alpha\ and\ \beta$
, which enables the dynamic weights to adaptively optimize the weight assignment of feature points in different scenarios,
$\| {\unicode[Arial]{x0394}} v\| _{max}$
is the maximum value of the velocity change.
${\unicode[Arial]{x0394}} v$
represents the inter-frame velocity change at the feature point, calculated as:
where
$p_{i}^{t}$
and
$p_{i}^{t-1}$
denote the positions of feature points in neighboring frames and
${\unicode[Arial]{x0394}} t$
denotes the time interval between neighboring frames.
In dynamic environments, motion interference can lead to inconsistencies in feature point matching, which negatively impacts pose estimation accuracy. To address this issue, we introduce a spatio-temporal consistency constraint that minimizes the spatial error between feature points in consecutive frames, thereby enhancing the robustness of pose estimation. By computing the spatial consistency error between feature points in consecutive frames, we incorporate a spatio-temporal smoothing constraint
$\| {\unicode[Arial]{x0394}} p_{i}\|$
into the objective function, optimizing the pose estimation results and solving for the optimal pose matrix
$T_{cw}$
. The formula is as follows:
where K is the internal parameter matrix of the camera,
$P_{i}$
denotes the 3D map point corresponding to
$p_{i}$
of the feature point in the current frame,
$W_{st}^{\mathrm{'}}$
is the static probability weight of the feature point, n is the number of valid static map points in the current frame,
$p_{i}^{t}$
refers to the projection position of the feature point in the current frame,
$T_{cw}^{t-1}$
is the bit-pose transformation matrix of the previous frame,
$\gamma $
is the spatio-temporal smoothing coefficient, which is used for adjusting the influence of the reprojection error and the spatio-temporal consistency error,
$T_{cw}$
is the to-be-solved bit-position transformation matrix to be solved. And the optimal transformation,
$T_{cw}$
can be expressed as:
where R is the rotation matrix of the current frame and t is the translation vector.
The pose matrix
$T_{cw}$
is optimized by Gauss-Newton method by taking the previous frame pose
$T_{cw}^{t-1}$
as the initial value of the current frame pose and constructing the residual function
$r_{i}$
to represent the residual value at each point:
The Jacobi matrix A is constructed by taking the derivatives of the rotation R and translation t, respectively, where the derivative of the rotation matrix R is shown in Eq. (37), and the derivative with respect to the translation vector t is shown in Eq. (38), and ultimately, the Jacobi matrix A obtained by combining the rotational and translational derivatives can be expressed as Eq. (40):
\begin{equation}P_{cam}=\left[\begin{array}{c@{\quad}c@{\quad}c} 0 & -z & y\\ z & 0 & -x\\ -y & x & 0 \end{array}\right]\end{equation}
where Jacobian matrix
$J_{i}$
for the i-th feature point represents,
$ \partial r_{i,x}/\partial R$
and
$\partial r_{i,y}/\partial R$
as the derivatives of the residuals in the x/y directions with respect to the rotation matrix;
$\partial r_{i,x}/\partial t$
and
$\partial r_{i,y}/\partial t$
are the derivatives of the residuals in the x/y directions with respect to the translation vector.
$K_{1x}$
and
$K_{1y}$
represent the parameters of the corresponding rows in the camera intrinsic matrix.
$[P_{cam}]_{\times }$
represents the skew-symmetric matrix of the vector
$P_{cam}$
.
Combined with the spatio-temporal consistency constraints, we further update the Jacobi matrix so as to construct a complete Jacobi matrix containing both the reprojection error and the spatio-temporal consistency error, J. The derivative of the spatio-temporal consistency error term
$\| {\unicode[Arial]{x0394}} P_{i}\|$
is:
Here,
$ {\unicode[Arial]{x0394}} P_{i}$
represents the positional difference of the feature point between two consecutive frames, and
$\| {\unicode[Arial]{x0394}} P_{i}\|$
represents the magnitude of the change.
The corresponding Jacobi matrix
$J_{{\unicode[Arial]{x0394}} P}$
is:
where
${\unicode[Arial]{x0394}} P$
represents the position change vector of a feature point between two consecutive frames,
$\| {\unicode[Arial]{x0394}} P\|$
represents the magnitude of the vector
${\unicode[Arial]{x0394}} P$
, and
${\unicode[Arial]{x0394}} P_{x}$
,
${\unicode[Arial]{x0394}} P_{y}$
,
${\unicode[Arial]{x0394}} P_{z}$
are the components of
${\unicode[Arial]{x0394}} P$
along the x,y,z directions, respectively.
The complete Jacobi matrix J combining the reprojection error and the spatio-temporal consistency error is denoted as:
\begin{equation}J=\left[\begin{array}{c@{\quad}c} \partial r_{i,x}/\partial R & \partial r_{i,x}/\partial t\\ \partial r_{i,y}/\partial R & \partial r_{i,y}/\partial t\\ \partial \left\| {\unicode[Arial]{x0394}} P_{i}\right\| /\partial R & \partial \left\| {\unicode[Arial]{x0394}} P_{i}\right\| /\partial t \end{array}\right]\end{equation}
where
$r_{i,x}$
and
$r_{i,y}$
represent the reprojection errors of the i-th feature point in the x and y directions of the image plane, respectively. The
$\partial r_{i,x}/\partial R$
and
$\partial r_{i,x}/\partial t$
,
$\partial r_{i,y}/\partial R$
, and
$\partial r_{i,y}/\partial t$
represent the derivatives of the residuals in the x and y directions with respect to the camera pose rotation R and translation t.
$\| {\unicode[Arial]{x0394}} P_{i}\|$
is the spatio-temporal position difference of the i-th feature point between consecutive frames.
$\partial \| {\unicode[Arial]{x0394}} P_{i}\| /\partial R$
and
$\partial \| {\unicode[Arial]{x0394}} P_{i}\| /\partial t$
represent the derivatives of the corresponding error with respect to rotation and translation, used to constrain the feature point’s smoothness and matching stability.
Then, iterations are performed by Gaussian Newton method to solve for the parameter increments
$\delta$
:
The
$J$
is the Jacobian matrix,
$ J^{T}$
is the transpose of the Jacobian matrix
$J$
, and
$r$
is the residual vector, representing the reprojection error and spatio-temporal consistency error under the current estimated pose.
Update pose matrix:
where
$T_{cw}$
is the pose transformation matrix of the current frame, and
$exp(\delta )$
is the exponential map of the increment
$\delta$
. By iterating to the maximum number of times, the optimal position matrix between the current frame and the map points is finally found, thus realizing the accurate matching and pose estimation of inter-frame features.
4. Experimental results
To evaluate the accuracy of our system, we conducted experiments using the publicly available TUM RGB-D dataset [Reference Sturm, Engelhard, Endres, Burgard and Cremers41]. The TUM dataset was created by the Technical University of Munich and captures data using a Kinect sensor at a rate of 30 Hz, with an image resolution of 640 × 480. Simultaneously, a high-precision motion capture system, VICON, equipped with an inertial measurement system, was used to obtain camera position and orientation data, which can be approximated as the true position data of the RGB-D camera.
This paper focuses on experiments using four high-dynamic scene sequences and one low-dynamic sequence from the TUM RGB-D dataset. In the high-dynamic sequence, two people walk in front of or around a table, while in the low-dynamic sequence, two people sit in chairs, engaging in conversation and making slight gestures. For each type of dataset series, the camera motion was also categorized into four states: static (where the camera remains stationary), xyz (where the camera moves along the spatial X–Y–Z axes), rpy (where the camera rotates with roll, pitch, and yaw angles), and hemispherical (where the camera moves along a trajectory of a hemisphere with a 1 m diameter). Figure 5 shows the effect of the implementation of the algorithm and comparison.

Figure 5. Visual comparison display of algorithm effects. (a) Real scenes. (b) ORB-SLAM2. (c) ORB-SLAM2 under YOLO. (d) GAF-SLAM.
Experiments were conducted on a computer system equipped with an Intel i5 CPU, Nvidia GeForce RTX 4060 Ti, 32 GB of RAM, and running the Ubuntu 18.04 operating system.
4.1. Performance evaluation of TUM RGB-D dataset
Absolute trajectory error (ATE) and relative pose error (RPE) are commonly used metrics for evaluating the localization accuracy of VSLAM systems. ATE measures the overall discrepancy between the estimated trajectory and the ground-truth trajectory, while RPE focuses on assessing rotational and translational drift between consecutive frames. To assess the performance improvement of our proposed GAF-SLAM over ORB-SLAM2 and ORB-SLAM3, we conducted comparative experiments on the TUM RGB-D dataset. The results report the root mean square error (RMSE), mean error (Mean), and standard deviation (Std) for both ATE and RPE.
As shown in Tables I, II, III, and Figure 6, GAF-SLAM achieves significantly better accuracy and robustness than ORB-SLAM2 and ORB-SLAM3, particularly under scenarios with substantial dynamic interference. Even in low-dynamic settings – such as when a person remains nearly stationary in a chair – our method demonstrates noticeable improvements. As illustrated in Figure 7, where the estimated trajectory is shown in black, the ground-truth trajectory in blue, and the deviation in red, GAF-SLAM consistently aligns well with the real trajectory, confirming its strong adaptability to highly dynamic environments.
Table I. Results of the metric absolute trajectory error (ATE)[M].

Table II. Results of the metric relative translation error (RTE)[M/S].

Table III. Results of the metric relative rotation error (RRE)[DEG/S].


Figure 6. ATE results of ORB-SLAM2, ORB-SLAM3, and our proposed system across five dynamic scene sequences. (a–c) Fr3_s_static; (d–f) Fr3_w_half; (g–i) Fr3_w_xyz; (j–l) Fr3_w_static; (m–o) Fr3_w_rpy.

Figure 7. Visualization of the differences between the estimated and ground-truth trajectories for ORB-SLAM2, ORB-SLAM3, and our proposed system across five dynamic scene sequences. (a–c) Fr3_s_static; (d–f) Fr3_w_half; (g–i) Fr3_w_xyz; (j–l) Fr3_w_static; (m–o) Fr3_w_rpy.
As shown in Table IV, to further evaluate the effectiveness of the proposed algorithm, GAF-SLAM is compared with several state-of-the-art dynamic visual SLAM systems, including Dyna-SLAM, YOLO-SLAM, SG-SLAM, and MPOC-SLAM. Since some of these methods are not open-sourced or reproducible under a unified hardware platform, the reported results are obtained from the corresponding original publications. Although differences in hardware settings may introduce certain deviations in performance metrics, this comparison is intended to provide a general reference for accuracy trends across different methods. As shown in the results, GAF-SLAM consistently demonstrates superior performance among all evaluated approaches, achieving the lowest trajectory error in highly dynamic sequences such as "rpy" and "static," and maintaining competitive accuracy in the other scenarios. Specifically, the bolded data in the table represent the best performance achieved in each sequence.
Table IV. Comparison of the absolute trajectory error (ATE)[M]

4.2. Performance evaluation of Bonn RGB-D dataset
The Bonn RGB-D Dynamic dataset [Reference Palazzolo, Behley, Lottes, Giguere and Stachniss42], released by Bonn University in 2019, aims to evaluate the performance of RGB-D SLAM and contains 24 dynamic sequences. In order to test the generalization ability of the dynamic feature rejection algorithm, we conducted further experiments on this dataset, and selected seven representative sequences for analysis, including three datasets of the “crowd” series, two datasets of the “person” series, and two datasets of the “synchronous” series and two datasets of “synchronous” series. The dataset of the “crowd” sequence shows three people moving freely in a room; the “person” sequence mainly shows the camera following the walker; and the “synchronous” sequence shows several people repeatedly jumping in the same direction.
To systematically evaluate the localization accuracy and robustness of GAF-SLAM, we conducted comparative experiments with two classical SLAM frameworks: ORB-SLAM2 and ORB-SLAM3. For SG-SLAM, the results were directly cited from the original publication. As summarized in Table V, GAF-SLAM achieves the best performance across all seven test sequences, significantly outperforming the compared methods. These results strongly demonstrate the robustness, localization accuracy, and generalization capability of GAF-SLAM in complex dynamic environments.
Table V. Comparison of the absolute trajectory error (ATE) [M].

4.3. Ablation experiment
To validate the effectiveness of each module, we performed ablation experiments. As shown in Table VI, the experimental results of ATE for different modules show that each module plays an important role in improving the system performance. In the experiments, we set up three different configurations to gradually evaluate the effect of the module: First, Ours(Y) uses a YOLO-Point network instead of the traditional ORB feature extraction method to perform deep learning feature extraction only at the front-end. Second, based on YOLO-Point, the static probability calculation of gray area feature points is further introduced, and the static weights are incorporated into the bit-pose estimation optimization, but not combined with the optimized static weights and spatio-temporal constraints. Finally, ours denotes our complete method, which combines YOLO-Point feature extraction with static probability computation, screening static feature points in the detection frame by reprojection error and polar geometry constraints, while excluding dynamic feature points, and ultimately combining static weights with spatio-temporal constraints for optimized bit-pose estimation.
Table VI. Results of metric ATE.

The experimental results show that the addition of each module significantly reduces the ATE value, indicating that feature screening and static probability computation effectively improve the localization accuracy of the system in dynamic environments. In particular, the complete method (Ours) has the most superior performance, which is able to maximize the rejection of dynamic feature interference while preserving static features, thus achieving accurate bit-pose estimation in highly dynamic scenes. Overall, the results of the ablation experiments validate the effectiveness of the individual modules and confirm the advantages of our proposed method for processing dynamic features in dynamic SLAM systems.
4.4. Time analysis
Real-time is also one of the important evaluation metrics for SLAM systems; therefore, we tested the time consumption of the system and compared it with five other algorithms as shown in Table VII. Dyna-SLAM uses Mask-R-CNN for pixel-level semantic segmentation, so its average time cost per frame processed is very high, and YOLO-SLAM, SG-SLAM, and MPOC-SLAM meet the real-time requirements while improving the accuracy. And GAF-SLAM consumes only 52.19 ms, so it can meet the real-time requirements of mobile robots while improving the localization accuracy.
Table VII. Time evaluation.

In addition, the run-time performance of individual modules within the GAF-SLAM framework was evaluated, as summarized in Table VIII. Specifically, Module A represents the YOLO-Point module, which simultaneously generates keypoints and object detection bounding boxes. Module B denotes the gray area feature recognition module, while Module C is responsible for static probability estimation. Module D performs dynamic point removal and pose estimation, and Module E manages keyframe insertion and local mapping. The results confirm that GAF-SLAM satisfies real-time processing requirements, even in dynamic environments characterized by dense motion interference.
Table VIII. The average run time of different modules.

4.5. Real environment experiment
To further evaluate the practicality of our system, we conducted experiments using a monocular camera in a real-world environment. During the experiments, we performed camera panning and rotation while asking the person in front of the camera to perform actions such as standing up, walking around the chair, and leaving the camera’s field of view.
Figure 8 demonstrates the effectiveness of our algorithm in removing dynamic feature points while retaining more static feature points in real-world scenarios. It is evident that we successfully preserved the static feature points within the feature frame. The experimental results indicate that the presence of dynamic objects leads to significant differences in the estimated trajectory lengths. As shown in Figure 9, we compared the trajectory estimation results of the GAF-SLAM system with those of ORB-SLAM2. When the surrounding environment is static, both systems perform well. However, when dynamic objects are present and moving in the environment (highlighted areas in Figure 9), ORB-SLAM2 experiences considerable jitter, while our system maintains consistency with the ground-truth trajectory and remains unaffected by the dynamic objects.

Figure 8. The effect of preserving static feature points in real scenes.

Figure 9. The contrast of trajectories obtained from ORB-SLAM2 and our system in real environment.
5. Conclusion and future work
In this paper, we introduce GAF-SLAM, an optimization method for visual SLAM systems in dynamic environments. Based on the ORB-SLAM2 framework, we realize the effective fusion of dynamic object detection, feature point screening and weighted pose estimation by integrating the YOLO-Point deep learning network with the static probability computing framework. Specifically, we propose a method for selecting gray area feature points based on reprojection errors and polar geometric constraints, which enables the system to retain potential static points within the dynamic detection region. In addition, we develop a new static probability calculation method for gray area feature points to further improve the determination accuracy of static feature points through the static probability scoring mechanism in order to enhance the retention of static information. Finally, the proposed weighted static probability and spatio-temporal constraint pose estimation algorithm effectively reduces the interference of dynamic points on pose estimation, thus significantly improving the posing accuracy and robustness of the system. Experimental results based on the TUM RGB-D dataset and the Bonn RGB-D Dynamic dataset show that our method is significantly more accurate in highly dynamic scenes.
Despite these encouraging results, one of the current limitations of our method lies in its reliance on a static probability model that depends heavily on multi-frame visual consistency. In environments with drastic illumination changes – such as sudden light switching, dynamic lighting conditions, or natural light interference – the same spatial location may exhibit significant appearance variations across frames. This can lead to the misclassification of dynamic regions as static, ultimately introducing noise into the map and degrading SLAM performance. To address this issue, future work will explore illumination-disentangled modeling techniques inspired by neural radiance fields (NeRF). By decoupling structural and appearance information, especially under varying lighting conditions, we aim to achieve more robust static-dynamic region separation and improve the reliability of feature selection in complex and unconstrained environments. In addition, we plan to extend the system to accommodate nonrigid motion patterns and improve the generalization ability of GAF-SLAM in more realistic and diverse scenarios.
Author contributions
Huilin Liu and Lunqi Yu conceived the method, built the framework, and conducted the theoretical study. Lunqi Yu designed and performed the experiments and wrote the article. Huilin Liu and Shenghui Zhao analyzed experimental data and assisted in revising the paper.
Financial support
This work was supported by the Scientific Research Foundation for High-level Talents of Anhui University of Science and Technology [grant number 2024yjrc52], the open Foundation of Anhui Engineering Research Center of Intelligent Perception and Elderly Care [grant number 2022OPB01], the National Natural Science Foundation of China [grant number 52374155] and the Anhui Provincial Natural Science Foundation [grant number 2308085MF218].
Competing interests
The authors declare no competing interests exist.
Ethical approval
None.









