Introduction
In agroecosystems, weeds are considered a major problem, as they compete with crops for nutrients, water, and sunlight and also provide a habitat for pests that can cause plant diseases, leading to a reduction in crop yield and quality. Site-specific weed management (SSWM) is seen as a viable solution to control weeds by precisely limiting weed growth in a specific location (Rai et al. Reference Rai, Zhang, Ram, Schumacher, Yellavajjala, Bajwa and Sun2023). The use of precise weed control methods such as spot spraying of herbicides can reduce the quantity of herbicides used in the field and prevent pesticide residues (Gerhards et al. Reference Gerhards, Andújar Sanchez, Hamouz, Peteinatos, Christensen and Fernandez-Quintanilla2022).
Accurate detection of weeds in real time while avoiding crop damage is essential for the realization of SSWM. Unmanned aerial vehicles (UAVs) are an ideal platform for weed detection, because they are able to acquire weed imagery without crop damage, efficiently provide information on weed location, and adapt to the spatial and temporal heterogeneity of weed distribution (Valente et al. Reference Valente, Hiremath, Ariza-Sentís, Doldersum and Kooistra2022). Crop and weed morphology, which can also be subject to substantial variations depending on genetics and the environment, are characteristics that pose great challenges for weed detection algorithms (Hu et al. Reference Hu, Wang, Coleman, Bender, Yao, Zeng, Song, Schumann and Walsh2023). Moreover, weeds occupy small pixels in aerial weed images compared with proximal remote sensing, making their detection more difficult. It is therefore essential to develop an accurate real-time weed detection model that can capture characteristics of small targets in UAV images.
Initial weed detection algorithms were based on traditional machine learning techniques, which required manual information extraction based on the morphological and textural features of weeds, influenced by the prior knowledge of researchers (Reedha et al. Reference Reedha, Dericquebourg, Canals and Hafiane2022). An object-based image analysis algorithm enabled a three-class weed density map by processing multispectral UAV data from maize (Zea mays L.) fields, effectively quantifying spatial distributions of weed coverage (Peña et al. Reference Peña, Torres-Sánchez, De Castro, Kelly and López-Granados2013). The random forest and k-nearest neighbors algorithms demonstrated effective detection performance when applied to calibrated and stitched UAV-derived orthophotos of weed in chili (Capsicum annuum L.) fields (Islam et al. Reference Islam, Rashid, Wibowo, Xu, Morshed, Wasimi, Moore and Rahman2021). Comparative assessment of four approaches demonstrated the automatic object-based classification method achieved optimal performance with 89% accuracy in oat (Avena sativa L.) field weed classification research (Gašparović et al. Reference Gašparović, Zrinjski, Barković and Radočaj2020). These results indicate that weeds can be identified using traditional machine learning methods, but their detection models have cumbersome steps, and most of them are based on area detection of weed density with low detection accuracy.
With the development of computer vision, deep learning methods have become widely used in agriculture (Li et al. Reference Li, Tang, Liu and Zheng2023; Lin et al. Reference Lin, Chen, Cai, Pan, Cernava, Migheli, Zhang and Qin2023; Miho et al. Reference Miho, Pagnotta, Hitaj, De Gaspari, Mancini, Koubouris, Godino, Hakan and Diez2024). In a soybean [Glycine max (L.) Merr.] field weed detection task, the object-based Faster regions with convolutional neural networks (R-CNN) achieved 65% accuracy, 68% recall, and a 66% F1 score (the harmonic mean of precision and recall), all of which outperformed the patch-based convolutional neural networks model, indicating superior performance (Veeranampalayam Sivakumar et al. Reference Veeranampalayam Sivakumar, Li, Scott, Psota, J., Luck and Shi2020). A benchmark study of seven you only look once (YOLO) versions for cotton (Gossypium hirsutum L.) field weed detection indicated YOLOv4 exhibited optimal detection capabilities with the highest mean average precision at 0.5 intersection over union threshold (mAP0.5), whereas the YOLOv3-tiny model had a low detection accuracy (Dang et al. Reference Dang, Chen, Lu and Li2023). An enhanced YOLOv7 developed for weed detection in chicory (Cichorium intybus L.) fields achieved 56.6% mAP0.5, 62.1% recall, and 61.3% precision, showing improvements over baseline models (Gallo et al. Reference Gallo, Rehman, Dehkordi, Landro, La Grassa and Boschetti2023). The integration of the convolutional block attention module (CBAM) mechanism into YOLOv5 improves its capacity to detect weeds on a multi-granularity buffalobur (Solanum rostratum Dunal) field weed dataset (Wang et al. Reference Wang, Cheng, Huang, Cai, Zhang and Yuan2022). In rice (Oryza sativa L.) paddy weed detection research utilizing mobile platforms, RetinaNet improved recognition accuracy by combining SmoothL1 loss and achieved 94.1% mAP0.5 while retaining inference speed (Peng et al. Reference Peng, Li, Zhou and Shao2022). While existing studies demonstrate the superior recognition accuracy and complex background robustness of deep learning methods compared with conventional machine learning methods, current models inadequately address the challenge of detecting small-target weeds in UAV-captured imagery.
Current deep learning object detection models can be categorized into two-stage detectors represented by the R-CNN series (Girshick et al. Reference Girshick, Donahue, Darrell and Malik2014; He et al. Reference He, Gkioxari, Dollár and Girshick2017; Ren et al. Reference Ren, He, Girshick and Sun2017) and one-stage detectors represented by the YOLO series (Bochkovskiy et al. Reference Bochkovskiy, Wang and Liao2020; Jocher Reference Jocher2020; Redmon et al. Reference Redmon, Divvala, Girshick and Farhadi2016; Redmon and Farhadi Reference Redmon and Farhadi2018; Ultralytics 2023) and the detection transformer (DETR) series (Carion et al. Reference Carion, Massa, Synnaeve, Usunier, Kirillov and Zagoruyko2020; Zhang et al. Reference Zhang, Li, Liu, Zhang, Su, Zhu, Ni and Shum2022; Zhu et al. Reference Zhu, Su, Lu, Li, Wang and Dai2021), depending on whether the processing is required to generate region proposals or not. Compared with two-stage detectors, one-stage detectors are more computationally efficient, have faster inference, and are widely used for real-time detection. Nevertheless, the YOLO series needs to select the hyperparameter non-maximum suppression (NMS) based on experience, which has a great impact on the accuracy and speed of model detection. DETR employs the Transformer (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) encoder–decoder architecture, which uses bipartite matching to achieve the prediction of the target through ensemble-based global loss, avoiding the hand-designed steps of NMS and anchor generation. RT-DETR (real-time detection transformer) achieves real-time end-to-end detection through model architecture redesign and outperforms the YOLO series in terms of accuracy and inference speed on the COCO 2017 dataset (Lv et al. Reference Lv, Zhao, Xu, Wei, Wang, Cui, Du, Dang and Liu2023).
Currently, the main challenges faced by SSWM are the lack of weed detection datasets acquired using UAVs and the insufficient ability of the model to detect small-target weeds (Khan et al. Reference Khan, Tufail, Khan, Khan and Anwar2021). Inspired by this, we developed a weed detection model using DETR with end-to-end detection properties for the challenge of a large number of small-target weeds in UAV-captured weed imagery.
Materials and Methods
Data Acquisition
The study site is located in Anlong County (25.04°N, 105.25°E), Guizhou Province, China, as shown in Figure 1. The GZWeed dataset was collected on November 21, 2023, by a DJI Phantom 4 RTK (DJI, Shenzhen, China) UAV carrying a DJI FC6310R camera. The shooting angle is vertical to the ground. The undulating mountainous terrain caused the altitude above ground level of the UAV to vary from 2.42 to 3.79 m, with a mean altitude of 3.09 m corresponding to a ground coverage area of 13.92 m2. Also, manual planting irregularities during cultivation resulted in uneven Chinese cabbage [Brassica rapa subsp. chinensis (L.) Hanelt] spacing, with average row and plant spacings of 45 cm and 35 cm, respectively. We performed quality screening after image acquisition, resulting in 108 images of weeds in Chinese cabbage fields.

Figure 1. Location of the study site: Anlong County, Guizhou Province, China.
Image Preprocessing
The dataset was labeled with Roboflow (Roboflow 2025) to annotate the weed and Chinese cabbage locations and to generate the corresponding label files of the images used for training, as shown in Figure 2A. The original images have a raw resolution of 5,472 × 3,648 pixels. To avoid the loss of details caused by the compression of the image information on the original image during the input of the detection model, the original images were cropped 4 × 4, yielding 1,728 images, as shown in Figure 2B. The dataset was divided into a training set (1,382 images), a validation set (173 images), and a test set (173 images) according to the ratio of 8:1:1, and the number of instances is shown in Table 1. Weed species were not distinguished due to the small size of the weed target in the image.

Figure 2. Flowchart of dataset preprocessing: (A) Dataset label with Roboflow, (B) image crop, and (C) augmented image.
Table 1. The number of images of object classes in GZWeed dataset.

To enhance the robustness of the model in the field environment, data augmentation methods, including proportional scaling, panning, horizontal mirroring, contrast enhancement, saturation enhancement, and brightness adjustment, are used to augment the images online. The partially augmented image is shown in Figure 2C.
The photographed weeds have variable light intensity and angles. In addition, the complex background of large quantities of dry rice straw and wet soil presented a challenge for weed detection. It can be seen from Figure 3 that there are a large number of small-target weeds, and some of the weeds are obscured by the crops, both of which present challenges for detection.

Figure 3. Representative samples from the weed dataset.
Experimental Configuration
The following experiments were performed on the GZWeed dataset. The experimental parameters used for model training are shown in Table 2.
Table 2. Experimental configuration.

To achieve a fair comparison of model performance, all models were trained from scratch for 250 epochs with a batch size of 16. The bounding box regression uses the GIoU (generalized intersection over union) loss (Rezatofighi et al. Reference Rezatofighi, Tsoi, Gwak, Sadeghian, Reid and Savarese2019), which is formulated as follows:

where x represents the ground truth box,
$\tilde x$
represents the prediction box, and u is the smallest bounding box that contains both x and
$\tilde x$
.
Considering the graphics processing unit memory, the input images were scaled to 640 × 640 during training, and each model was tested using the model weights with the highest mAP0.5 on the validation set. We train all models using the AdamW optimizer (Loshchilov and Hutter Reference Loshchilov and Hutter2019) with a 0.0001 base learning rate, 0.0001 weight decay, and 2,000 warm-up epochs.
Performance Metrics
To validate the detection performance of the proposed model, mAP0.5 and mAP0.5:0.95 are used as performance evaluation metrics. Precision is the ratio of the number of positive samples detected by the model to the number of correctly detected samples, as shown in the formula:

Recall is the ratio of the number of positive samples correctly detected by the model to the actual number of positive samples. It is calculated by the following formula:

The average precision (AP) is equal to the area under the precision-recall (PR) curve and is calculated as shown:

Mean average precision (mAP) is the result obtained by weighted average of the AP values for all sample categories with the formula shown below:

Intersection over union (IoU) denotes the ratio of intersections and connections between the prediction box and the ground truth box. The mAP0.5 denotes the mAP when the IoU of the detection model is set to 0.5, and mAP0.5:0.95 denotes the mAP when the IoU of the detection model is set to range from 0.5 to 0.95 (taking values at intervals of 0.5). AP0.5Cabbage and AP0.5Weed represent the AP0.5 for Chinese cabbage and weed categories, respectively.
The number of model parameters, floating-point operations (FLOPs), and frames per second (FPS) are used to compare the computational complexity of the models. Additionally, Grad-CAM (Selvaraju et al. Reference Selvaraju, Cogswell, Das, Vedantam, Parikh and Batra2020) is used to generate model detection heat maps.
WeedDETR
We designed WeedDETR based on RT-DETR applied to small-target weeds, and the structure of the model is shown in Figure 4. WeedDETR contains a fine-grained feature extraction backbone (RepCBNet), the feature complement fusion module (FCFM) encoder for efficient fusion of multilevel features, and a transformer decoder module. Designed based on re-parameterization, RepCBNet provides multilevel weed features through multiple branches. FCFM achieves intra-scale interaction and cross-scale fusion of features through the complementary feature integration (CFI) module. In addition, varifocal loss is used to allow the model to focus on the difficult to detect small-target weed samples to improve weed detection performance, responding to the small-target weed problem in terms of feature extraction, feature fusion, and loss computation, respectively. The transformer decoder module is from DINO (Zhang et al. Reference Zhang, Li, Liu, Zhang, Su, Zhu, Ni and Shum2022), which introduces a denoising training method to accelerate the convergence of DETR. WeedDETR achieves efficient real-time end-to-end weed detection through the design of a holistic model architecture. Each component is described in detail in the following sections.

Figure 4. The structure of WeedDETR. P1–P5 represent different levels of feature maps. CFI, complementary feature integration; IoU, intersection over union; TEncoder, transform encoder.
RepCBNet
The structure of RepCBNet is shown in Figure 5A, and feature downsampling was performed by setting the ConvNL and RepCBlock strides to 2 for the last layer. It is generally accepted that the deeper the network, the better the feature extraction ability of the image, which will lead to better object detection. However, when the depth of the network is too great, the features of small-target weeds tend to be lost. PadConv block adopts a two-branch structure; the branches operated by padding can provide diverse features, which enriches the information flow of the feature extraction network; the structure is shown in Figure 5C. The RepCBlock structure is shown in Figure 5D. One branch uses stacked PadConv blocks to deepen the network, and the other branch uses only one ConvNL layer for better gradient propagation and to avoid weed feature loss.

Figure 5. The structure of RepCBNet. (A) The structure of RepCBNet. P1–P5 represent different levels of feature maps. (B) The structure of ConvNL. k = 1/3 represents the size of the convolution kernel. (C) The structure of PadConv. (d) The structure of RepCBlock. BN, batch normalization.
Multi-branch structure is structurally stable and easy to train, but inference speed is slow, and memory consumption is significant. Single-branch structure inference is fast and saves memory, but the feature extraction capability is relatively insufficient. By decoupling the model structure in training and inference, Ding et al. (Reference Ding, Zhang, Ma, Han, Ding and Sun2021) obtained both the high performance of the multi-branch structure and the speed advantage of the single-branch structure. As shown in Figure 6A, we have re-parameterized the PadConv block according to this concept. The formulation of the re-parameterization PadConv block is detailed in the Supplementary Material. Based on this operation, we transform the structure of the PadConv block into a succinct single-branch structure during inference, which saves computational resources and accelerates inference.

Figure 6. Re-parameterization of the PadConv block. (A) Perspective of structure. (B) Perspective of parameter. BN, batch normalization.
Feature Complement Fusion Module (FCFM)
Current mainstream feature fusion structures such as FPN (Lin et al. Reference Lin, Dollár, Girshick, He, Hariharan and Belongie2017a), PAN (Liu et al. Reference Liu, Qi, Qin, Shi and Jia2018), and others often use the last three scales of features (P3–P5) for fusion. Using only these deep features tends to cause shallow features to be lost, which is not conducive to the detection of small-target weeds. We proposed the FCFM, which utilizes the transform encoder (TEncoder) module for intra-scale feature interaction and the CFI module for cross-scale fusion of features; its structure is shown in Figure 7A.

Figure 7. The structure of feature complement fusion module (FCFM) and its components: (A) the structure of FCFM, with P1–P5 representing different levels of features; (B) the structure of Fusion module; and (C) the structure of transform encoder (TEncoder) module. CFI, complementary feature integration;TEncoder, transform encoder; BN, batch normalization.
In the FCFM, the Fusion module is used for efficient information fusion, and its structure is shown in Figure 7B. One branch of the Fusion module increases network depth and efficiently represents features through three RepCBlocks, while the other branch effectively avoids gradient explosion. Similar to the PadConv block, RepCBlock is re-parameterized to speed up inference.
The TEncoder module is able to implement the self-attention operation by converting inputs into sequences, capturing long-range dependency between objects. To achieve a balance between accuracy and computational effort, only the last layer (P5) containing rich semantics is processed. The goal of self-attention is to capture the interactions between all entities by encoding each entity based on global contextual information, which is described in the Supplementary Material. TEncoder enables intra-scale feature interactions to obtain connections between targets in the image for subsequent detection of weeds, and its structure is shown in Figure 7C.
The FPN-like structure lacks full utilization of shallow features and is prone to shallow feature loss, thus affecting the detection performance of small-target weeds. To address this problem, we propose the CFI module, which takes shallow features carrying positional features, neighboring mesoscale features, and transmitted mesoscale features to be fused for cross-scale feature interactions, making full use of the rich information of the shallow features. The structure of CFI modules is shown in Figure 8A.

Figure 8. The structure of two types of complementary feature integration (CFI) modules: (A) the structure of CFI-A, and (B) the structure of CFI-B.
The number of channels is adjusted to the same number of channels as the transmitted mesoscale features by a 1 × 1 convolution before the shallow features are input. Subsequently, shallow features are downsampled using a hybrid structure of maximum pooling and average pooling, which helps to retain the high-resolution features and diversity of the weed images. Finally, the transmitted mesoscale features are spliced with neighboring mesoscale features and downsampled large-scale features in the channel dimension. As shown in Figure 7A, the CFI-A module was used for feature fusion at the T4 and T3 feature layers, respectively, which is able to increase the richness of local features and prevent the loss of small-target feature information. Taking the CFI-A used in the T4 feature layer as an example, as the following equation:


We have conceived upsampling deep features to mesoscale feature sizes to supplement semantic information, as shown in Figure 8B, but this is less effective compared with CFI-A, which is analyzed in detail in the subsequent experimental section.
Varifocal Loss
The IoU-Aware Classification Score (IACS) loss function varifocal loss (VFL) is used to focus the model training on small-target samples (Zhang et al. Reference Zhang, Wang, Dayoub and Sünderhauf2021). Varifocal loss is proposed on the basis of research on focal loss (FL) (Lin et al. Reference Lin, Goyal, Girshick, He and Dollár2017b). In this dataset, weeds only account for a small portion of the whole picture, while most of the area is the background area (negative samples). The large number of negative samples will lead to the model training effect deterioration. The focal loss balances the proportion of positive and negative samples by giving greater weight to the hard-to-detect samples, as shown in the following equation:

where y∈{−1,+1}, y = 1 represents the ground truth class, and P∈[0,1] denotes the predicted probability of the foreground class. The (1 − P)γ and Pγ represent the moderating factors of the background and foreground classes, respectively.
The formula for varifocal loss is shown below:

where P is the predicted IACS and q is the target score. For the foreground class, the value of q is the IOU of the prediction box and ground truth box, and for the background class, q is zero. The varifocal loss scales the loss by the coefficient Pγ and will only reduce the loss contribution for negative samples (q = 0). Positive samples with a large q value will have a larger loss contribution; thus, the model allow s focus on high-quality weed samples during loss training, improving the detection accuracy of small-target weeds.
Results and Discussion
Comparison of Backbone Networks
The comparison results of WeedDETR using different backbone networks are shown in Table 3. The RepCBNet collaboratively mines weed edges and texture details in images through a two-branch structure consisting of a deep feature extraction branch and a shallow gradient retention branch. Through this synergistic design, the model comprehensively captures spatial and semantic information of small-target weeds, thereby effectively improving detection accuracy. The AP0.5weed of the RepCBNet is improved by 1.8%, 1.5%, 3.0%, 3.2%, and 5.0% compared with ResNet-34 (He et al. Reference He, Zhang, Ren and Sun2016), MobileNetv3-L (Howard et al. Reference Howard, Sandler, Chu, Chen, Chen, Tan, Wang, Zhu, Pang, Vasudevan, Le and Adam2019), Swin Transformer-Tiny (Liu et al. Reference Liu, Lin, Cao, Hu, Wei, Zhang, Lin and Guo2021), HGNetv2-L (the backbone of RT-DETR) (Lv et al. Reference Lv, Zhao, Xu, Wei, Wang, Cui, Du, Dang and Liu2023), and ConvNeXtV2-Atto (Woo et al. Reference Woo, Debnath, Hu, Chen, Liu, Kweon and Xie2023), respectively.
Table 3. Comparison of WeedDETR with different backbone networks.

a mAP0.5, mean average precision at 0.5 intersection over union threshold; AP0.5Cabbage, average precision at 0.5 intersection over union threshold for Chinese cabbage categories; AP0.5Weed, average precision at 0.5 intersection over union threshold for weed categories.
b mAP0.5:0.95, the mean average precision computed across intersection over union thresholds from 0.5 to 0.95 with 0.05 intervals.
c FLOPs, floating-point operations.
d FPS, frames per second.
The PadConv block in RepCBNet extends the context-awareness range through padding operations, which enhance the detailed discrimination of weeds while maintaining parameter efficiency. The number of parameters of the RepCBNet is 54.7%, 62.1%, and 66.2% compared with Swin Transformer-Tiny, HGNetv2-L, and ResNet-34, respectively, achieving the optimal detection performance with a lighter architecture. Although MobileNetv3-L and ConvNeXtV2-Atto have fewer parameters, their lightweight design sacrifices some feature extraction capability, resulting in inadequate ability to detect small-target weeds. These results show that using the RepCBNet as the backbone can effectively extract fine-grained information, improve detection accuracy, and achieve a balance between computation and accuracy.
Effectiveness of the CFI Module
We compared the detection performance of the model when using different types of CFI modules, and the results are shown in Table 4. Compared with the model without the CFI module, the model increased AP0.5Weed by 1.1% and 2.7% with the CFI-B and the CFI-A, respectively. The results indicate that the CFI structure effectively fuses shallow features with richer detail information, achieving full integration of shallow and deep features. Compared with the CFI-B, the CFI-A is able to detect small-target weeds more accurately while using similar computational effort. This phenomenon may be due to the fact that direct access to the underlying information, rather than using deeper information for upsampling, is more conducive to feature complementarity and avoiding feature confusion (Wang et al. Reference Wang, He, Nie, Guo, Liu, Han and Wang2023). The experimental results demonstrate that the CFI-A module mitigates the problem of small-target weed information being ignored in the deep network through feature complementary fusion, hence its use in composing the FCFM.
Table 4. Comparison of WeedDETR with different complementary feature integration (CFI) modules.

a CFI-A, mode A of the CFI module; CFI-B, mode B of the CFI module.
b mAP0.5, mean average precision at 0.5 intersection over union threshold; AP0.5Cabbage, average precision at 0.5 intersection over union threshold for Chinese cabbage categories; AP0.5Weed, average precision at 0.5 intersection over union threshold for weed categories.
c mAP0.5:0.95, the mean average precision computed across intersection over union thresholds from 0.5 to 0.95 with 0.05 intervals.
d FLOPs, floating-point operations.
Comparison of Loss Functions
The loss function only affects the computation of losses during model training, as it does not increase the parameters and the FLOPs. The experimental results are shown in Table 5, where the AP0.5Weed increased by 0.6% and 1.2% after using FL and VFL in training, respectively. VFL is more capable of focusing on hard to detect weed samples than FL, thus effectively improving model detection performance (Du and Jiao Reference Du and Jiao2022; Peng et al. Reference Peng, Li, Zhou and Shao2022).
Table 5. Comparison of the results between focal loss (FL) and varifocal loss (VFL).

a mAP0.5, mean average precision at 0.5 intersection over union threshold; AP0.5Cabbage, average precision at 0.5 intersection over union threshold for Chinese cabbage categories; AP0.5Weed, average precision at 0.5 intersection over union threshold for weed categories.
b mAP0.5:0.95, the mean average precision computed across intersection over union thresholds from 0.5 to 0.95 with 0.05 intervals.
c FLOPs, floating-point operations.
Ablation Experiment
Three improvements improve the detection performance to varying degrees, as shown by the ablation experiment results in Table 6. The RepCBNet reduces the number of parameters while acquiring feature representations at a finer granularity. The FCFM module incorporates multi-layered low-level features, which effectively improves the accuracy of weed detection, resulting in a 2.7% improvement in the AP0.5Weed. By introducing VFL in the loss calculation, the loss weights of complex samples are increased, and the weed detection accuracy is improved. The FCFM provides rich weed features as discriminative guidance for VFL during sample reweighting, while VFL compels the model to prioritize learning the critical spatial features of difficult samples captured by FCFM. Their synergistic interaction achieves a 3.0% improvement in AP0.5Weed. The experimental results showed that WeedDETR effectively improved the accuracy of small-target weed detection by 2.4% for mAP0.5 and 4.5% for AP0.5Weed compared with the RT-DETR.
Table 6. Results of ablation experiment.

a FCFM, feature complement fusion module.
b VFL, varifocal loss.
c mAP0.5, mean average precision at 0.5 intersection over union threshold; AP0.5Cabbage, average precision at 0.5 intersection over union threshold for Chinese cabbage categories; AP0.5Weed, average precision at 0.5 intersection over union threshold for weed categories.
d mAP0.5:0.95, the mean average precision computed across intersection over union thresholds from 0.5 to 0.95 with 0.05 intervals.
e FLOPs, floating-point operations.
The heat map for WeedDETR and RT-DETR is shown in Figure 9. The darker red areas in the heat maps indicate the areas of the feature maps that the models focus on. RT-DETR has insufficient perception of small-target weeds and is prone to miss them, while WeedDETR is able to focus more comprehensively on small-target weeds and has better weed detection performance.

Figure 9. Heat map comparison of RT-DETR and WeedDETR. The darker red areas in the heat maps indicate the areas of the feature maps that the models focus on.
Re-parameterization Experiment
The re-parameterization operation is applied only during inference, merging training-stage multi-branch structures into a single-branch equivalent to eliminate computational redundancy while preserving the original training model architecture (Zhang and Wan Reference Zhang and Wan2024). As shown in Table 7, the operation reduces 16.9% of the parameters and 16.8% of the FLOPs in the inference process, which improves the efficiency while maintaining detection accuracy. The efficiency–accuracy decoupling optimization strategy of re-parameterization enhances the model’s computational efficiency, thereby facilitating its deployment on agricultural edge devices with limited memory and computational resources.
Table 7. Parameters and floating-point operations (FLOPs) change during training and inference.

a mAP0.5, mean average precision at 0.5 intersection over union threshold; AP0.5Cabbage, average precision at 0.5 intersection over union threshold for Chinese cabbage categories; AP0.5Weed, average precision at 0.5 intersection over union threshold for weed categories.
b mAP0.5:0.95, the mean average precision computed across intersection over union thresholds from 0.5 to 0.95 with 0.05 intervals.
Comparison of Results with Other Detections
The performance of WeedDETR was comprehensively compared with state-of-the-art detection models, including Faster R-CNN (Ren et al. Reference Ren, He, Girshick and Sun2017), SSD (Liu et al. Reference Liu, Anguelov, Erhan, Szegedy, Reed, Fu and Berg2016), RetinaNet (Lin et al. Reference Lin, Goyal, Girshick, He and Dollár2017b), and YOLO series models represented by YOLOv3-SPP (Redmon and Farhadi Reference Redmon and Farhadi2018), YOLOv5-L (Jocher Reference Jocher2020), YOLOv6-M (Li et al. Reference Li, Li, Jiang, Weng, Geng, Li, Ke, Li, Cheng, Nie, Li, Zhang, Liang, Zhou and Xu2022), and YOLOv8-L (Ultralytics 2023), all of which were trained from scratch, with the results shown in Table 8. Faster R-CNN and RetinaNet weed detection performed ineffectively, while the YOLO models achieved better detection results. Compared with YOLOv5-L, the best performer in the YOLO series, WeedDETR has a 1.9% improvement in mAP0.5, a 3.5% improvement in AP0.5Weed, and a 1.6% improvement in mAP0.5:0.95. WeedDETR achieves dual efficiency in parameters and computational complexity with 19.92 M parameters and 58.20 G FLOPs, while attaining the highest real-time detection speed of 76.28 FPS among comparative models.
Table 8. Comparison of detection results of different detection models.

a YOLO, You Only Look Once; Faster R-CNN, faster regions with convolutional neural networks; SSD, single-shot detector.
b mAP0.5, mean average precision at 0.5 intersection over union threshold; AP0.5Cabbage, average precision at 0.5 intersection over union threshold for Chinese cabbage categories; AP0.5Weed, average precision at 0.5 intersection over union threshold for weed categories.
c mAP0.5:0.95, the mean average precision computed across intersection over union thresholds from 0.5 to 0.95 with 0.05 intervals.
d FLOPs, floating-point operations.
e FPS, frames per second.
The PR curves for the four models with the highest detection accuracy are illustrated in Figure 10. The PR curve of WeedDETR comprises a larger closed region compared with YOLOv5-L, YOLOv6-M, and YOLOv8-L, which indicates that the proposed model exhibits higher detection accuracy.

Figure 10. Comparison of precision-recall (PR) curves.
Visualization of Prediction Results with Other Mainstream Detections
A comparison of detection results among the four most accurate models is presented in Figures 11 and 12. The detection results of the model under three complex backgrounds (shadow occlusion, rice straw occlusion, and waterbody interference) are presented in Figure 11. All models accurately detected Chinese cabbage in both shadow-obscured and straw-obscured backgrounds. But in the waterbody interference background, YOLOv5-L and YOLOv6-M showed false detection of marginal Chinese cabbage leaves, as shown in Figure 11C. As illustrated in Figure 11A and 11B, WeedDETR more accurately captures weeds that are shaded or obscured than other models, demonstrating its robustness of detection in complex environments. RepCBNet accurately extracts information about the differences between weeds and background, allowing WeedDETR to efficiently detect weeds obscured by shadows or straw.

Figure 11. Comparison of detection results by different models in complex background. The three scenarios are: (A), shadow occlusion, (B) rice straw occlusion, and (C) waterbody interference. Red boxes represent detected Chinese cabbage, brown boxes represent detected weeds, and yellow boxes represent missed weeds.

Figure 12. Comparison of detection results by different models in small-target weeds scenarios. Red boxes represent detected Chinese cabbage, brown boxes represent detected weeds, and yellow boxes represent missed weeds.
All models accurately detected Chinese cabbage as an obvious target, but for small-target weeds, there was partial weed miss-detection in all models except WeedDETR, as shown in Figure 12. The limited multilevel utilization of shallow features in YOLO series models might lead to progressive degradation of small-target weed representations during deep network propagation, potentially contributing to suboptimal weed detection performance, particularly under dense small-target weeds scenarios, as shown in Figure 12B (Zhang Reference Zhang2023; Zhang et al. Reference Zhang, Ye, Zhu, Liu, Guo and Yan2024). To address this phenomenon, WeedDETR effectively mitigates the loss of small-target features and achieves enhanced weed detection accuracy through the collaboration of shallow feature information supplementation and cross-scale global semantic feature fusion.
Comparative results show that WeedDETR exhibits better performance in accurately detecting small-target weeds in complex backgrounds. Based on the conducted analysis, the proposed model is able to accurately detect small-target weeds in UAV-captured images, effectively mitigating the phenomenon of underdetection of small-target weeds and accelerating the inference speed through re-parameterization convolution. Compared with other detection models, WeedDETR detects weeds with higher accuracy and faster inference, which can meet the field deployment requirements of UAVs for weed detection applications. Additionally, we have developed a weed imagery detection system built upon WeedDETR, showcasing its ability to detect weeds in high-resolution drone-captured images, with implementation details provided in the Supplementary Material.
Conclusions
To address the lack of current UAV-based weed detection datasets and the limited performance of weed detection models in weed detection, we constructed a high-quality field UAV weed detection dataset and proposed the WeedDETR based on the characteristics of small-target weeds. The WeedDETR achieved 73.9% and 91.8% AP0.5 in the weed and Chinese cabbage categories with 76.28 FPS, outperforming the existing state-of-the-art detection models. In addition, the proposed model establishes a highly reliable algorithmic foundation for intelligent weeding equipment. The weed density heat maps generated by the model can further guide variable spraying systems to achieve weed-targeted precision spraying, thereby reducing herbicide usage (Xu et al. Reference Xu, Li, Hou, Wu, Shi, Li and Zhang2025). Furthermore, the model can be extended to dynamic monitoring of herbicide-resistant weeds by analyzing spatial dispersion patterns of specific weed populations through continuous multi-season data, thereby providing data-driven support for optimizing crop rotation systems and herbicide rotation strategies (Vasileiou et al. Reference Vasileiou, Kyrgiakos, Kleisiari, Kleftodimos, Vlontzos, Belhouchette and Pardalos2024).
While WeedDETR demonstrated robust performance on the GZWeed dataset, its generalizability is limited by the single-crop scenario of the current dataset with unsegmented weed classes. Moreover, the PyTorch-based weights of WeedDETR can be further converted to lightweight inference frameworks such as TensorFlow Lite and ONNX (Open Neural Network Exchange) to enhance computational efficiency in agricultural terminals. In our next work, we will construct a multi–crop field weed dataset leveraging UAV platforms and systematically evaluate the model’s generalization capability for cross-crop weed detection. Besides, we aim to improve the inference efficiency of the model by compressing model parameters and computational overhead through knowledge distillation and model pruning, thereby advancing the implementation of SSWM.
Supplementary material
To view supplementary material for this article, please visit https://doi.org/10.1017/wsc.2025.10035
Data availability statement
The dataset with annotation is accessible to the public and thoroughly documented on GitHub: https://github.com/sxyang4399/GZWeed.
Funding statement
This study was supported by the National Key Research and Development Program of China (2021YFE0113700), the National Natural Science Foundation of China (32360705; 31960555), the Guizhou Provincial Science and Technology Program (CXTD[2025]041; GCC[2023]070; HZJD[2022]001), and the Program for Introducing Talents to Chinese Universities (111 Program; D20023).
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.