A dual-stage system for real-time license plate detection and recognition on mobile security robots

Amir Ismail; Maroua Mehri; Anis Sahbani; Najoua Essoukri Ben Amara

doi:10.1017/S0263574724001991

A dual-stage system for real-time license plate detection and recognition on mobile security robots

Part of: Recent advances in parallel and service robotics, from development to applications

Published online by Cambridge University Press: 02 January 2025

Amir Ismail

Maroua Mehri ,

Anis Sahbani and

Najoua Essoukri Ben Amara

Show author details

Amir Ismail*: Affiliation:
Ecole Nationale d’Ingénieurs de Sousse, LATIS-Laboratory of Advanced Technology and Intelligent Systems, Université de Sousse, Sousse, Tunisie Novation City Technopole de Sousse, Enova Robotics S.A., Sousse, Tunisie
Maroua Mehri: Affiliation:
Ecole Nationale d’Ingénieurs de Sousse, LATIS-Laboratory of Advanced Technology and Intelligent Systems, Université de Sousse, Sousse, Tunisie
Anis Sahbani: Affiliation:
Novation City Technopole de Sousse, Enova Robotics S.A., Sousse, Tunisie Institute for Intelligent Systems and Robotics (ISIR), CNRS, Sorbonne Université, Paris, France
Najoua Essoukri Ben Amara: Affiliation:
Ecole Nationale d’Ingénieurs de Sousse, LATIS-Laboratory of Advanced Technology and Intelligent Systems, Université de Sousse, Sousse, Tunisie
*: Corresponding author: Amir Ismail; Email: amir.ismail@eniso.u-sousse.tn

Article contents

Abstract
Introduction
Related work
Proposed ALPR system
Experimental setup
Results
Discussion
Conclusion and further work
Financial support
Competing interests
Ethical approval
Research data policy and data availability
Footnotes
References

Rights & Permissions

Abstract

Automatic license plate recognition (ALPR) systems are increasingly used to solve issues related to surveillance and security. However, these systems assume constrained recognition scenarios, thereby restricting their practical use. Therefore, we address in this article the challenge of recognizing vehicle license plates (LPs) from the video feeds of a mobile security robot by proposing an efficient two-stage ALPR system. Our ALPR system combines the on-the-shelf YOLOv7x model with a novel LP recognition model, called vision transformer-based LP recognizer (ViTLPR). ViTLPR is based on the self-attention mechanism to read character sequences on LPs. To ease the deployment of our ALPR system on mobile security robots and improve its inference speed, we also propose an optimization strategy. As an additional contribution, we provide an ALPR dataset, named PGTLP-v2, collected from surveillance robots patrolling several plants. The PGTLP-v2 dataset has multiple features to cover chiefly the in-the-wild scenario. To evaluate the effectiveness of our ALPR system, experiments are carried out on the PGTLP-v2 dataset and five benchmark ALPR datasets collected from different countries. Extensive experiments demonstrate that our proposed ALPR system outperforms state-of-the-art baselines.

Keywords

license plate detection recognition YOLOv7x vision transformer self-attention mechanism in-the-wild scenario deployment on mobile security robots annotated dataset

Information

Type: Research Article
Information: Robotica , Volume 43 , Issue 6 , June 2025 , pp. 1981 - 2002

DOI: https://doi.org/10.1017/S0263574724001991 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2025. Published by Cambridge University Press

1. Introduction

Automatic license plate recognition (ALPR) is a highly demanded feature of video analytics in smart cities, playing a key role in enhancing security-related operations. Vast range of applications benefit from this technology such as police surveillance, parking management, access control, tolling, and intelligent transportation systems. Automating the vehicle identification process creates a significant added value for security and autonomy standards. Hence, ALPR research field is in a fast-pace exploration and investigation and constantly proposes innovative systems for effective and real-time processing achievements. Compared to traditional ALPR configurations (fixed specialized cameras and processing hardware), the in-the-wild scenario has recently received growing attention [Reference Chen and Wang1–Reference Zhang, Wang, Li, Li, Shen and Zhang10].

The in-the-wild scenario refers to the use of non-specialized video surveillance systems and settings to detect and transcribe license plates (LPs). This scenario gets its complexity from a number of factors including non-specialized camera positioning and setup, lighting imperfection, and a wide field of view. Also known as complex natural scenes, unrestricted scenarios, real-world scenarios, or open scenarios, these setups significantly increase the difficulty and challenges associated with ALPR (e.g. defocusing, variations in distance, perspective distortion, and multiple vehicles), making it a notably tough task. For instance, Liu et al. [Reference Liu and Chang5] proposed a hybrid cascade structure for detecting small and vague LPs in large and complex visual surveillance scenes. Wang et al. [Reference Wang, Lu, Zhang, Yuan and Li11] introduced MFLPR-Net for reading LPs captured by driving recorders. Similarly, Laroca et al. [Reference Laroca, Severo, Zanlorensi, Oliveira, Gonçalves, Schwartz and Menotti12] presented a YOLO-based ALPR system to read LPs in images taken from inside a vehicle driving through regular traffic on Brazilian streets. Mokayed et al. [Reference Mokayed, Shivakumara, Woon, Kankanhalli, Lu and Pal13] presented DCT-PCM method for detecting LP numbers in drone images. Their method combines a phase congruency-based model [Reference Chen, Xue, Zhang, Lu and Xia14], to extract robust features and propose LP region candidates, with fully connected layers to sort out false positive candidates. However, to the best of our knowledge, none of these methods have been custom-tailored to work with security mobile robots.

In this context, our work addresses the in-the-wild scenario by proposing an efficient ALPR system implemented on a mobile security robot, named PGuard.Footnote ¹ PGuard, a security patroller shown in Fig. 1, is a rugged platform dedicated to security and safety. Its main mission is to autonomously patrol and secure high-risk sites (e.g. industrial plants, nuclear plants, military areas, airports, and logistic warehouses) through in-site or perimeter patrols with random stakeouts and doubt-clearing interventions. PGuard is equipped with centimeter precision navigation system and security-embedded payload. For recurrent patrols, PGuard maintains a speed of 4 km/h, with the capacity to reach a maximum speed of about 12 km/h if required. Communication with PGuard is established through its secure Wi-Fi Mesh network or via public or private mobile networks. During navigation, PGuard is equipped with obstacle detection and avoidance capabilities for both static and dynamic objects. Secure remote operation is also possible using joysticks. The PGuard’s security payload includes two panoramic cameras that provide 360 $^{\circ }$ immersive vision. In addition, PGuard is equipped with one thermal camera and one optical camera. The latter is an AXIS Q1806-LE camera featuring 32 $\times$ optical zoom and adaptive infrared lighting with up to 90 FPS @ $2880\times 1620$ . With its versatile features, the PGuard is well-suited for a wide spectrum of security and surveillance tasks.

Figure 1.

PGuard robot scope. PGuard is equipped with advanced features that enable it to effectively patrol and secure specific plants, either autonomously or through remote control. It streams real-time video and audio for monitoring and video analytics.

Therefore, the proposed ALPR system will enable the security robot to automatically identify and monitor unauthorized vehicles within restricted areas, providing real-time alerts to the security team. Furthermore, the mobility of the robot allows it to cover larger areas and varied terrains, extending the reach of the surveillance system.

Despite the wide range of benefits of in-patrol ALPR systems, several challenges could be raised. Primarily, environmental conditions such as low light, excessive sunlight, rain, snow, or fog could hamper image capture, LP detection (LPD), and LP recognition (LPR). Second, these systems must effectively deal with a range of angles and distances, as the robot will encounter vehicles from different perspectives while patrolling. The movement of both the robot and vehicles could lead to motion blur. Furthermore, the diversity in LP designs, including colors, sizes, and fonts could complicate the LPR task. Thus, while integrating ALPR with security robots offers significant advantages, it also requires overcoming various operational and environmental challenges. To address these challenges, we propose in this article an ALPR system, and we detail its deployment on a security robot to achieve efficient inference. The proposed ALPR system consists of YOLOv7x model to detect LPs, and then an optimized ViTLPR-based engine, to complete the LPR task. In particular, ViTLPR can manage LP images under various conditions and does not require any additional processing on the detected LP images. The conducted experiments ensure the generalization ability of the proposed system across several benchmark datasets. As an additional contribution, we collected a Tunisian LP dataset using PGuard. Our dataset contains more realistic scenes compared to existing datasets. With this work, we introduce a novel single-object tracking benchmark and aim to advance the literature through several contributions:

• An efficient two-stage ALPR system is introduced. It is based on a novel segmentation-free LPR model, named the vision transformer-based LP recognizer (ViTLPR). ViTLPR addresses the LPR task as a sequence labeling problem. It extends vision transformer to predict character sequences [Reference Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit and Houlsby15].
• An extended version of the PGTLP dataset [Reference Ismail, Mehri, Sahbani and Amara16], named PGTLP-v2 is presented. It consists of $5$ k vehicle images acquired using PGuard. The dataset is made accessible to the research community upon request through this https://emirismail.github.io/repo.
• A series of thorough experiments was carried out on the proposed dataset and five benchmark LP datasets to assess the effectiveness of the proposed system (YOLOv7x-ViTLPR). The results show that the two modules perform competitively compared to recent baselines.
• An enhancement of the efficiency of ViTLPR is proposed through an optimization strategy that involves profiling and fine-tuning with the TensorRT framework, leading to an optimized engine with lower latency and faster processing speed.

The remainder of this article is organized as follows. Section 2 provides a review of recent state-of-the-art ALPR systems. Section 3 introduces the proposed ALPR system and describes its implementation on the PGuard robot, followed by the presentation of the achieved results in Section 5. In section 6, we illustrate the optimization strategy used to achieve real-time inference in our ALPR system and discuss the impact of data and deblurring on the LPR performance. Finally, Section 7 concludes the article and suggests directions for future work.

2. Related work

2.1. ALPR systems

ALPR systems consist of two primary components: LPD and LPR. In the literature, various studies have proposed ALPR systems that address these two sub-tasks either through a single end-to-end trainable network or a two-stage system. The former typically leverages a multi-task learning approach, using shared features between the LPD and LPR tasks. For instance, Qin et al. [Reference Qin and Liu6] used feature pyramid networks (FPNs) [Reference Lin, Dollár, Girshick, He, Hariharan and Belongie17] to improve the feature extraction process of convolutional neural networks (CNNs). Li and al. [Reference Li, Wang and Shen18] used a region proposal network [Reference Ren, He, Girshick, Sun, Cortes, Lawrence, Lee, Sugiyama and Garnett19] on top of CNN extracted features to generate region proposals. These two methods are based on region of interest pooling to extract a fixed-size feature map, which is then processed through two branches: fully connected layers for bounding box regression, and recurrent neural networks (RNNs) with connectionist temporal classification (CTC) loss for LPR. For step-wise or two-module ALPR systems, for example, Kessentini et al. [Reference Kessentini, Besbes, Ammar and Chabbouh20] combined YOLOv2 and bidirectional RNNs. Laroca et al. [Reference Laroca, Severo, Zanlorensi, Oliveira, Gonçalves, Schwartz and Menotti12] also used YOLOv2 to detect LPs and CR-NET [Reference Montazzolli and Jung21] to segment and recognize LP characters. Wang et al. [Reference Wang, Bian, Zhou and Chau22] proposed two cascaded CNNs: VertexNet for LPD and SCR-Net for LPR, with the main focus of achieving high inference speed.

2.2. LPR systems

As our main contribution to this work is a novel LPR engine, recent approaches related to this sub-task are presented in subsections 2.3 and 2.4.

Over the last decade, most of the LPR systems have relied essentially on deep learning (DL). Numerous DL-based solutions continually push the boundaries and achieve new state-of-the-art results across many standards [Reference Shashirangana, Padmasiri, Meedeniya and Perera23]. Considering the perspective from which LP characters can be seen – namely as objects or texts – we classify recent solutions into two categories: object recognition and sequence labeling.

2.3. Object recognition

Recent works have approached LPR by framing it as an object recognition problem, with each character being processed as a separate object. For instance, Rizvi et al. [Reference Rizvi, Patti, Björklund, Cabodi and Francini24] presented a 12-layer fully CNN with two branches (detector and localizer). Using shared connections of convolutional layers, the detector predicts character class, while the localizer outputs bounding box coordinates. Montazzolli et al. [Reference Montazzolli and Jung21] presented LPS/CR-NET, which is a FAST-YOLO based detector. First, minor changes were introduced to the original architecture to make it suitable for Brazilian LP layouts. They then followed their module predictions using two heuristic rules to eliminate confusion between letters and digits. Their architecture was later used by Silva and al. [Reference Silva, Jung, Ferrari, Hebert, Sminchisescu and Weiss7] and Laroca et al. [Reference Laroca, Zanlorensi, Gonçalves, Todt, Schwartz and Menotti25] to recognize distorted LP in unconstrained scenarios. Relying on YOLOv2, Kessentini et al. [Reference Kessentini, Besbes, Ammar and Chabbouh20] adjusted the original architecture to fit the Tunisian LP specifications. Similarly, Henry et al. [Reference Henry, Ahn and Lee26] used an improved version of the YOLOv3 detector that integrates a spatial pyramid pooling (SPP) block [Reference He, Zhang, Ren and Sun27], called YOLOv3-SPP. Their contribution perfectly fits the double-line plates scenario across many countries. Selmi et al. [Reference Selmi, Halima, Pal and Alimi28] adopted Mask-RCNN to segment and extract character candidates. To improve detection, they applied a set of rules to remove non-character regions.

The key benefit of these solutions is their ability to provide a rapid response time while maintaining a satisfactory rate of prediction. However, to train these models, an enormous amount of training data with costly character-level annotation is required. In addition, these solutions have been proven to be ineffective for distorted and tilted LPs.

2.4. Sequence labeling

Researchers have also addressed LPR as a standard character sequence labeling task. In earlier research works [Reference Kessentini, Besbes, Ammar and Chabbouh20,Reference Cao, Fu and Ma29,Reference Wang, Huang, Qian, Cao and Dai30], a scheme of convolution layers combined with recurrent ones, followed by CTC layer [Reference Graves, Liwicki, Fernández, Bertolami, Bunke and Schmidhuber31], was adopted. Specifically, LP sequence features are encoded through CNNs and RNNs, and decoded by CTC. Bidirectional long short-term memory (Bi-LSTM) networks [Reference Hochreiter and Schmidhuber32] were commonly used. For instance, Li et al. [Reference Li, Wang and Shen18] proposed a solution that can handle oblique and normal LPs. However, integrating the CTC layer in their model for sequence decoding yielded a weak performance. Wang et al. [Reference Wang, Huang, Qian, Cao and Dai30] proposed a DL architecture similar to the regular CNN-RNN-CTC stack, but before the CNN, they fed the detected LPs to spatial transformer networks [Reference Jaderberg, Simonyan, Zisserman, Kavukcuoglu, Cortes, Lawrence, Lee, Sugiyama and Garnett33] to adjust it (i.e. to obtain aligned characters of uniform heights and widths). Lately, the CTC layer has been replaced by an attention mechanism. For instance, Xu et al. [Reference Xu, Yang, Meng, Lu, Huang, Ying, Huang, Ferrari, Hebert, Sminchisescu and Weiss34] proposed a model, called RPnet, that uses the attention mechanism in the detection module to direct the recognition module to a specific area from which it should gather features that are useful for predicting the LP number. Most recently, a scheme of encoder-decoder was highly adopted. Precisely, the encoder is a CNN structure, while the decoder is an attention-based RNN. For example, He et al. [Reference He and Hao35] and Zou et al. [63] used a 1D-attention decoder to only focus on useful character features. Whereas these methods perform well in single-line LPs, they can not deal with double-line LPs. To cope with these limitations, 2D attention was applied instead. Both Zhang et al. [Reference Zhang, Wang, Li, Li, Shen and Zhang10] and Xu et al. [Reference Xu, Zhou, Li, Liu, Li and Shi36] integrated a 2D-attention mechanism to perform decoding and retrieve characters on the 2D feature maps. This technique yielded promising results on multi-line LPs, and unaligned LPs as well. Kumar et al. [Reference Kumar, Shivakumara, Chowdhury, Pal and Liu37] replaced the 2D attention with a transformer-based decoder to generate feature maps using an eight-head attention mechanism. They showed that their module had faster training and inference compared to RNN, but it relied on CNN for feature extraction.

3. Proposed ALPR system

The initial phase in the pipeline of the proposed ALPR system deployed on PGuard involves transmitting the video feeds from the PGuard’s onboard cameras through the recording server to the Gstreamer Sink. The Sink function consists of compiling a batch of frames from three different input cameras. The Sink ensures that frames from these different cameras are synchronized and ready for further processing. Once the Sink has successfully formed a batch of frames, it forwards them to the LPD module to detect LPs in the batched frames.

Our ALPR system is composed of two main stages: LPD using YOLOv7x [Reference Wang, Bochkovskiy and Liao38] and LPR using a novel segmentation-free LPR model introduced in this article. The proposed LPR model, called ViTLPR, is specifically designed to extract and recognize characters from LP regions identified by the LPD module based on the YOLOv7x model [Reference Wang, Bochkovskiy and Liao38].

The recognition results of ViTLPR are then packaged together as metadata. This metadata is subsequently consumed by an event manager. The event manager is tasked with controlling alarms and generating recordings based on a set of predefined rules. An example of such a rule is a white list, which contains LPs that are given access to a particular area. Finally, the results are saved in the database, which includes LP region and its transcription.

These results are processed in two manners. They can either be shown as live feeds, providing real-time updates, or they can be stored as recorded videos. The recorded videos can be accessed and reviewed at a later time, providing a valuable resource for retrospective analysis or investigation.

The metadata is also consumed in the Milestone XProtect video management softwareFootnote ² through third-party integration. Our plugin, currently in development, provides features such as a live dashboard, weekly summaries, search capabilities, etc. Figure 2 depicts the pipeline of the proposed ALPR system (YOLOv7x + ViTLPR) deployed on the PGuard robot

Figure 2.

Pipeline of the proposed automatic license plate recognition system deployed on PGuard.

3.1. YOLOv7x

YOLOv7, one of the official releases in the YOLO series of object detection models [Reference Wang, Bochkovskiy and Liao38], is known for its efficiency-speed trade-off, making it popular for applications where real-time detection is crucial, such as in video surveillance, particularly LPD. At the time of its publication, YOLOv7 surpassed all known object detectors in speed and accuracy and was able to reach up to 160 FPS (GPU V100). The major upgrades in YOLOv7 are mainly:

• Extended efficient layer aggregation network (E-ELAN) [Reference Wang, Liao and Yeh39]: E-ELAN combines the features of different groups by shuffling and merging cardinality to enhance the network’s learning. It allows for feature extraction, allowing for better gradient flow, and feature reuse, improving performance without increasing complexity.
• Model Scaling: YOLOv7 applies compound model scaling techniques, balancing depth, width, and resolution to improve performance across various computing resources and image sizes.
• Bag-of-freebies: YOLOv7 uses RepConv [Reference Ding, Zhang, Ma, Han, Ding and Sun40], which allows it to have more efficient inference by converting multi-branch convolutions into a single branch during inference, boosting speed without losing in terms of accuracy.

In our proposed ALPR system, we use the “extend” version YOLOv7x. This variant uses the compound scaling method to perform scaling-up of the depth of computational block by 1.5 times and width of transition block by 1.25 times [Reference Wang, Bochkovskiy and Liao38].

3.2. ViTLPR

ViTLPR is a recurrence- and convolution-free model based on the self-attention mechanism. It leverages advancements in using attention mechanisms for various vision tasks. It is an extension of vision transformer that was originally introduced for image classification [Reference Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit and Houlsby15]. It approaches the LPR task as a sequence labeling problem for multiple reasons: internal variability of LPs (e.g. font style, orientation, shape, size, color, texture, illumination) and external influences (e.g. camera sensor orientation, location, and imperfections resulting in blur, noise, or distortions). The ViTLPR architecture is depicted in Fig. 3.

Figure 3.

Vision transformer-based LP recognizer architecture. Raw license plate images are partitioned into square patches and transformed into a sequence of vectors. After adding positional information, vectors are passed through a stacking of $L$ Vanilla transformer encoders. Finally, feature sequence is fed to the prediction head for character recognition.

The ViTLPR workflow is detailed in what follows. The LP image $x\in \mathbb{R}^{H \times W \times 3}$ ( $H$ and $W$ denote the image height and width, respectively) is partitioned into patches $x_{p}\in \mathbb{R}^{N\times (P \times P\times 3)}$ , where $N=\frac{H \times W}{P \times P}$ denotes the effective input sequence length for the encoder block. $P \times P$ denotes the resolution of each patch. Each patch is unrolled into a vector, then linearly projected to $d_{model}$ dimension. The projection is performed using a matrix $E$ . Positional embeddings are added to patch embeddings. $L$ ViT encoder blocks process the patch embeddings and produce, using a prediction head, the LP transcription. In what follows, we describe the three building blocks of ViTLPR.

1. Linear embedding: First, LP image is divided into $P \times P \times 3$ patches. Each patch is reshaped to $(P \times P \times 3)\times 1$ vector. A dense layer is applied to each vector $z_{i}=Ex_{i}+b$ , where $E$ and $b$ are parameters to be learned during the training phase. Vectors $z_{i\in [1 \dots N]}$ are patch embeddings linearly projected into a new space with a dimension $d_{model}$ . To add positional information, learnable 1D position embeddings are accordingly associated with patch embeddings.
2. Encoder: Each layer of the encoder block uses the multi-head self-attention (MHSA) and 2-layer perceptron (MLP) techniques. The skip connections are used in between to maintain lower-level features. In this work, the embedding sequence $Z \in \mathbb{R}^{(N+1) \times d_{model}}$ is linearly mapped into a new representation subspace using three trainable parameter matrices: $W^{Q}\in \mathbb{R}^{d_{model} \times d_{q}}$ , $W^{K}\in \mathbb{R}^{d_{model} \times d_{k}}$ , and $W^{V}\in \mathbb{R}^{d_{model} \times d_{v}}$ . This projection produces a triplet: query ( $Q$ ), key ( $K$ ), and value ( $V$ ), where $Q=Z \times W^{Q}$ , $K=Z \times W^{K}$ , and $V=Z \times W^{V}$ . The attention operation is defined by Equation 1.
(1) \begin{equation} \text{Z}=\text{Attention(Q, K, V)}=\text{Softmax}\!\left(\frac{Q\cdot K^{T}}{\sqrt{d_{model}}}\right)\cdot \text{V} \end{equation}

The output of the Softmax function is called an attention filter. To improve the performance of the Vanilla self-attention layer, the idea is to perform in parallel $h$ self-attention operations independently [Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, Guyon, von Luxburg, Bengio, Wallach, Fergus, Vishwanathan and Garnett41]. The single-head self-attentions have the same inputs. However, they do not share parameter matrices. For each head, different $Q$ , $K$ , and $V$ matrices are learned. The MHSA operation is defined by Equation 2.
(2) \begin{equation} \text{MultiHead}(Q', K', V') = \text{Concat}(\text{head}_{1}, \ldots , \text{head}_{h}) \cdot W^{o} \end{equation}
where:
(3) \begin{align} \text{head}_{i} = \text{Attention}(Q_{i}, K_{i}, V_{i}). \end{align}

The MHSA outputs are fed into a 2-layer perceptron. It has the form GELU $((W_{1}Z + b_{1})W_{2}+ b_{2})$ , where $W_{1}\in \mathbb{R}^{d_{model} \times d_{latent}}$ and $W_{2}\in \mathbb{R}^{d_{latent} \times d_{model}}$ are two parameter matrices shared across the sequence $z_{i\in [0..N]}$ . The encoder block is repeated $L$ times to output “feature sequence” of shape ( $N \times d_{model}$ ). Each position ( $1, \ldots , N$ ) is a $d_{model}$ -dimensional vector that carries rich, context-aware representations of its corresponding LP patch.
3. Prediction head: The “feature sequence” is transformed into class scores in order to produce a sequence of characters. First, we slice the ( $N \times d_{model}$ ) matrix to obtain ( $M \times d_{model}$ ) matrix, where $M$ is the maximum number of characters in the LP (variable). We add two additional rows (embeddings) for the start ( $start$ ) and end ( $e$ ) tokens and learn them during training. Then, we apply a linear layer to transform it to ( $M+2$ , $C=36$ alphanumeric characters) sequence. This layer will map each of the $d_{model}$ -dimensional embeddings to a $C$ -dimensional vector and predict the character sequence.

4. Experimental setup

This section describes the experimental analysis conducted to evaluate the performance of the proposed system. First, we outline the experimental protocol, including a brief description of the benchmark datasets used in our experiments and the implementation details of the proposed ALPR system.

4.1. Datasets

We validate the robustness of the proposed system by showing experimental results on the PGTLP-v2 dataset and five benchmark ALPR datasets: PGTLP-v2, LSV-LP [Reference Wang, Lu, Zhang, Yuan and Li11], RodoSol-ALPR [Reference Laroca, Cardoso, Lucio, Estevam, Menotti, Farinella, Radeva and Bouatouch42], UFPR-ALPR [Reference Laroca, Severo, Zanlorensi, Oliveira, Gonçalves, Schwartz and Menotti12], CCPD [Reference Xu, Yang, Meng, Lu, Huang, Ying, Huang, Ferrari, Hebert, Sminchisescu and Weiss34], and AOLP [Reference Hsu, Chen and Chung43]. Image samples from the six benchmark datasets and their respective plates extracted based on the provided ground truth annotations are shown in Fig. 5. The specifications of the six datasets used in our experiments are listed in Table I.

Table I.

Specifications of the six benchmarks used in our experiments to evaluate the performance of the proposed automatic license plate recognition (ALPR) system.

$^\ast$ 170,999 images do not contain LPs. $^{\ast \ast }$ 47,055 LPs without transcription.

PGTLP-v2 is a Tunisian image dataset that counts $5$ k high-resolution images captured by cameras mounted on PGuard. Images were extracted from videos after patrolling parking, entrance gates, roundabouts, and driveways. This dataset provides large and diverse set of variations in templates, resolutions, and weather conditions. One particular feature of this dataset compared to existing datasets is that it contains several LPs per image. This dataset is an extension to previous version of PGTLP dataset that was initially introduced in [Reference Ismail, Mehri, Sahbani and Amara16] for LPD [Reference Ismail, Mehri, Sahbani and Amara44]. PGTLP-v2 includes three annotation levels: rectangular boxes for the LP, four vertices of the LP region, and labels for the full LP sequence (i.e. transcription). Figure 4 depicts few samples in the PGTLP-v2 dataset.

LSV-LP is a Chinese large-scale video-based dataset recorded using driving recorders, street camera shooting, and mobile phone shooting. The shooting locations include highways, streets, parking lots, and other scenes, covering $27$ provinces of China mainland. The LSV-LP dataset contains $1,402$ video clips of $300$ frames resulting in more than $400$ k frames and $364$ k LPs. The annotations include the bounding box around the vehicles, four vertices around the LP, and LP transcription [Reference Wang, Lu, Zhang, Yuan and Li11].

RodoSol-ALPR is a Brazilian image dataset collected by static cameras placed at pay tolls where the distance from the vehicle to the camera varies slightly. It contains $20$ k images that have a resolution of $1,280 \times 720$ pixels and are divided as follows: $5$ k images of cars with Brazilian LPs, $5$ k images of motorcycles with Brazilian LPs, $5$ k images of cars with Mercosur LPs, and $5$ k images of motorcycles with Mercosur LPs. The annotations cover the LP location and its related transcription [Reference Laroca, Cardoso, Lucio, Estevam, Menotti, Farinella, Radeva and Bouatouch42].

UFPR-ALPR includes $4,500$ images of $150$ Brazilian vehicles. It is divided into three subsets: training, validation, and test, where each subset contains $60$ , $30$ , and $60$ tracks, respectively. Each track has $30$ frames capturing the same vehicle. The UFPR-ALPR images have the following annotations: LP positions, LP numbers, and positions of its characters [Reference Laroca, Severo, Zanlorensi, Oliveira, Gonçalves, Schwartz and Menotti12].

CCPD (v2019) is a large-scale Chinese LP dataset that was collected from roadside parking using handheld devices. Hence, it exhibits strong variations in vehicle distance, shooting angle, light condition, and image background. Images in CCPD have at most one annotated LP in the foreground. We used the latest available version released in 2019. The entire dataset counts $355,003$ images. Annotations in CCPD are mainly LP numbers, LP bounding boxes, four vertices locations, and tilt degrees [Reference Xu, Yang, Meng, Lu, Huang, Ying, Huang, Ferrari, Hebert, Sminchisescu and Weiss34].

AOLP is a widely used benchmark for evaluating LPR approaches. This dataset contains $2,049$ images of Taiwanese LPs. It is divided into three subsets: access control (AC), law enforcement (LE), and road patrol (RP). The subsets contains $681$ , $757$ , and $611$ images, respectively. The RP subset is considered the most challenging since it has many samples with oblique LPs. Ground-truth annotations contain coordinates of LP bounding boxes and their sequence labels [Reference Hsu, Chen and Chung43].

Figure 4.

Image samples in PGTLP-v2. From top to down, the first row presents images ( $1,920 \times 1,080$ ) at entrance checkpoint. The second row contains $180^{\circ }$ panoramic view images ( $2,560 \times 1,024$ ) at restricted access plant.

Figure 5.

Image samples from the six datasets used in our experiments and their respective license plates with respect to the ground-truth annotations.

4.2. Evaluation protocol

All the experiments were conducted on a desktop with Intel^® Xeon^® CPU E5-2660 v3@2.60 GHz (20cores), 32 GB DDR4 RAM, and one Nvidia^® TITAN^® RTX^TM GPU with 24 GB GDDR6 RAM. Table II presents the number of images and LPs used for training, testing, and validation in each dataset used in our experiments.

Table II.

An overview of the number of images used for training, testing, and validation in each dataset.

To evaluate the performance of the stage of detecting LPs, we computed precision and recall metrics. The detected LP is considered correct if the intersections over union (IoU) between the detection and ground truth is greater than $0.5$ ( $0.7$ for CCPD dataset [Reference Xu, Yang, Meng, Lu, Huang, Ying, Huang, Ferrari, Hebert, Sminchisescu and Weiss34]). To evaluate the performance of the stage of recognizing LPs, we calculate the LPR rate (LP-RR) [Reference Kessentini, Besbes, Ammar and Chabbouh20]. LP-RR represents the amount of correctly recognized LPs over all LPs in the dataset (see Equation 4). Prediction is considered correct if and only if all LP characters are correctly recognized.

(4)

\begin{equation} \text{LP-RR} = \left (\frac{\text{# of correctly recognized LPs}}{\text{Total # of LPs}}\right ) \times 100 \end{equation}

To evaluate the performance of the stage of recognizing LPs on the LSV-LP dataset, we computed two metrics: Accuracy_6C and Accuracy_7C, as defined in the original protocol by Wang et al. [Reference Wang, Lu, Zhang, Yuan and Li11]. Accuracy_6C measures the capability of the model to correctly recognize the 6 last characters (i.e. we exclude the first character representing the Chinese code region). Accuracy_7C measures the model’s ability to correctly recognize all seven characters of an LP (i.e. we include the first character representing the Chinese code region).

In our ALPR system, we use ViTLPR, which has similar properties to the base version of the vision transformer (ViT-Base [Reference Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit and Houlsby15]). Table III lists the architectural parameters of ViTLPR. Parameter $M$ is selected based on the number of LP characters in each dataset.

Table III.

Architectural parameters of vision transformer-based LP recognizer.

The training setup of the two stages of the proposed ALPR system is presented in Table IV. To train the LPD model YOLOv7x, we used the following setup (epochs = 40, batch size = 16, optimizer = SGD, initial learning rate = 0.01). Since ViT-based models require large data volumes to perform well, we initialized the ViTLPR encoder with weights provided by Touvron et al. [Reference Touvron, Cord, Douze, Massa, Sablayrolles, Jegou, Meila and Zhang45] before being fine-tuned on downstream datasets. We applied a few data augmentation techniques [Reference Cubuk, Zoph, Shlens and Le46] to raw images. The selected augmentation functions are projective distortion, blurring, noise, and rotation. To train ViTLPR, we only used LP regions as input and their character sequences as labels.

Table IV.

Selected hyperparameters for training the two modules of the proposed automatic license plate recognition system (YOLOv7x and vision transformer-based LP recognizer (ViTLPR)).

5. Results

In this section, we discuss the results achieved by the selected baselines and our ALPR modules (YOLOv7x-ViTLPR) on each dataset individually.

5.1. PGTLP-v2

For the PGTLP-v2 dataset, we partitioned our dataset into training and testing sets (80/20): 4k/1k images for LPD, resulting accordingly in $5,796$ / $1,448$ LPs for LPR. The images in the testing set were carefully selected while preserving the original ratios of the image distribution. In Table V, the precision/recall and recognition rates are presented in terms of LPD and LPR, respectively.

Table V.

Detection and recognition results on PGTLP-v2.

Figure 6.

Qualitative results of vision transformer-based LP recognizer on image samples from benchmarks used in our experiments (PGTLP-v2, UFPR-ALPR, CCPD, AOLP-RP, LSV-LP, and RodoSol-ALPR). Best viewed in color and zoomed in.

In terms of LPD, we note that YOLOv7x achieves higher performance. In particular, YOLOv7x outperforms the WPOD-NET model (ranked second) by $2.56\%$ and $2.09\%$ on precision and recall, respectively. The precision and recall rates achieved by YOLOv7x are $94.60\%$ and $91.20\%$ , respectively, indicating its ability to accurately detect LPs with minimal false positives.

In terms of LPR, ViTLPR is able to recognize $82.32\%$ of $1,448$ LPs, outperforming the baselines with a margin of $8.68\%$ compared to the second-best model (DAN), indicating a clear improvement in performance. Although CRNN with its ResNet-18 backbone and AttentionOCR are the smallest models, they reached only $47.31\%$ and $66.16\%$ of LP-RR, respectively. This result proves the robustness of ViTLPR to handle LPR in the wild. The recognition results on PGTLP-v2 are shown in Fig. 6(a).

5.2. LSV-LP

For the LSV-LP dataset, we followed the experimental protocol provided by Wang et al. [Reference Wang, Lu, Zhang, Yuan and Li11]. However, we observe that there are frames without LPs ( $170,999$ frames), and LPs annotated with a placeholder “##-#####”( $43,613$ LPs). After removing them, the new splits are presented in Table II. We recall that since the LSV-LP datasets are divided into three subsets (S2M, M2S, and M2M), each subset was processed separately and has its proper training, validation, and test splits. In Table VI, the precision/recall and recognition rates are presented in terms of LPD and LPR, respectively.

Table VI.

Detection and recognition results on LSV-LP.

In terms of LPD, we note that YOLOv7x outperforms all the baselines in both precision and recall rates. Compared with WPOD-Net, ranked second, YOLOv7x achieves a $9.38\%$ and $3.97\%$ precision and recall gains, respectively. In particular, YOLOv7x reports the best performance ( $85.00\%$ precision, $87.40\%$ recall) on the Move vs. Move subset (M2M), which proves its superiority in detecting LPs when vehicles and cameras are moving. We point out through visual inspection that images in the test set contain LP instances that are not in ground truth. This leads mainly to false positives that affect the precision rates.

In terms of LPR, ViTLPR outperforms three models (AttentionOCR [Reference Wojna, Gorban, Lee, Murphy, Yu, Li and Ibarz49], CRNN [Reference Shi, Bai and Yao47], and DAN [Reference Wang, Zhu, Jin, Luo, Chen, Wu, Wang and Cai51]) mainly designed for scene text recognition, as well as the initial LSV-LP baseline, MFLPR-Net [Reference Wang, Lu, Zhang, Yuan and Li11]. It is worth mentioning that MFLPR-Net is similar to DAN and applies LP orientation correction through affine transformation. Although the move vs. static subset (M2S) is considered the most challenging since it contains distorted LPs, ViTLPR is able to correctly recognize $14,120$ LPs out of $19,103$ LPs ( $73.92\%$ ). Furthermore, ViTLPR shows robustness against Chinese characters compared to the considered models, with only a minor decrease ( $1.06\%$ ) in accuracy when recognizing entire LPs.

5.3. RodoSol-ALPR

For the RodoSol-ALPR dataset, we used the standard split defined by Laroca et al. [Reference Laroca, Cardoso, Lucio, Estevam, Menotti, Farinella, Radeva and Bouatouch42]. We note that images in RodoSol-ALPR feature a single LP instance. In Table VII, the precision/recall and recognition rates are presented in terms of LPD and LPR, respectively.

Table VII.

Detection and recognition results on RodoSol-ALPR.

In terms of LPD, YOLOv7x reports the highest precision and recall rates, $99.90\%$ and $99.90\%$ , respectively. The high precision and recall values replicate the fact that YOLOv7x is able to correctly detect one LP in each image without almost any false negatives.

In terms of LPR, ViTLPR achieves the highest LP-RR ( $75.00\%$ ), while AttentionOCR places second. Out of $8$ k LPs, the ViTLPR model is able to correctly recognize $6014$ LPs. Meanwhile, AttentionOCR and DAN recognized $5,682$ ( $71.03\%$ ) and $5,458$ ( $68.23\%$ ) LPs, respectively. The CRNN-18 model has the lowest LP-RR rate ( $52.46\%$ ), resulting in around half of the images being correctly recognized. It is worth mentioning that in the testing set, $4$ k LPs are two-line, which is challenging for CRNN-based models.

5.4. UFPR-ALPR

For the UFPR-ALPR dataset, we followed the standard split defined by Laroca et al. [Reference Laroca, Severo, Zanlorensi, Oliveira, Gonçalves, Schwartz and Menotti12]. We observe that each image has one LP instance. In Table VIII, the precision/recall and recognition rates are presented in terms of LPD and LPR, respectively.

Table VIII.

Detection and recognition results on UFPR-ALPR.

In terms of LPD, the results show that YOLOv7x has the best precision rate compared to the baselines, and is able to achieve a precision of $89.60\%$ . Compared to the second-ranked model (WPOD-NET), there is a gain of $3,08\%$ . Through a visual inspection of the detection results, we note that YOLOv7x detects some background objects as LPs (they are quite similar to LPs), which led to false detections (i.e. limited precision rate). Regarding recall, YOLOv7x is ranked second, losing to WPOD-NET by a margin of $0,13\%$ . The relatively low recall ( $82.80\%$ ) confirms that YOLOv7x fails to detect some LPs, resulting in false negatives.

In terms of LPR, ViTLPR correctly recognizes $94.00\%$ of LPs, outperforming Sighthound and OpenALPR by $31.00\%$ and $11.80\%$ , respectively. Additionally, ViTLPR achieves competitive results compared to the two baselines proposed by Laroca et al. [Reference Laroca, Severo, Zanlorensi, Oliveira, Gonçalves, Schwartz and Menotti12,Reference Laroca, Zanlorensi, Gonçalves, Todt, Schwartz and Menotti55], with respective margins of $29.11\%$ and $4.00\%$ . Notably, although the UFPR-ALPR dataset contains motorcycle images (two-line LPs), ViTLPR remains robust. The recognition results on UFPR-ALPR are shown in Fig. 6(b).

5.5. CCPD

For the CCPD dataset, we used the standard split available with the last version (2019) to guarantee fair comparison. To the best of our knowledge, no one has reported results on the currently available CCPD version. In Table IX, the precision and recognition rates are presented in terms of LPD and LPR, respectively.

Table IX.

Detection and recognition results on CCPD.

In terms of LPD, YOLOv7x achieves an average precision of $98.96\%$ . YOLOv7x is able to outperform the baselines for all the subsets in the CCPD dataset. In particular, YOLOv7x shows high precision results ( $100.00\%$ ) on Base-Test and Weather subsets with zero false positives. We notice that the DB subset seems to be challenging to all baselines, including YOLOv7x, which was able to reach only $96.80\%$ of precision. This might be explained by the uneven illumination conditions in this subset.

In terms of LPR, we compared the proposed ViTLPR with three widely adopted models in scene text recognition: CRNN [Reference Shi, Bai and Yao47], DAN [Reference Wang, Zhu, Jin, Luo, Chen, Wu, Wang and Cai51], and AttentionOCR [Reference Wojna, Gorban, Lee, Murphy, Yu, Li and Ibarz49]. The CRNN-18 and CRNN-50 models employ ResNet-18 and ResNet-50 as their respective CNN backbone for feature extraction [Reference He, Zhang, Ren and Sun56]. The recognition rates are listed in Table IX. ViTLPR shows top-tier LP-RR results across all subsets of CCPD with an average LP-RR of $78.06\%$ . In particular, it is able to handle challenging conditions, such as extreme angles ( $78.40\%$ ) and blur ( $51.71\%$ ). On the CCPD-DB subset, where LPs have dark or extremely bright illuminations, all models show poor performance, with ViTLPR being relatively the best ( $60.25\%$ ). Similarly to the CCPD-Blur subset, where CRNN-18 is able to only recognize $333$ out of $20,611$ LPs and ViTLPR reaches an LP-RR of $51.71\%$ . This suggests room for improvement against illumination and blur conditions. The CCPD-Weather subset, where rainy, snow, or fog conditions are present, seems to be not challenging to all models. Based on the reported results, we demonstrate that ViTLPR is able to handle various weather conditions. Qualitative recognition instances from the CCPD-Challenge subset are shown in Fig. 6(c).

5.6. AOLP

For the AOLP dataset, we trained the models on any two subsets and tested them on the remaining subset. In Table X, the precision and recognition rates are presented in terms of LPD and LPR, respectively.

Table X.

Detection and recognition results on AOLP.

In terms of LPD, compared to the baselines, YOLOv7x detector achieves the best precision on LE and RP subsets. On the AC subset, WPOD-NET [Reference Silva, Jung, Ferrari, Hebert, Sminchisescu and Weiss7] has the highest precision with a gap of $0.47\%$ to YOLOv7x (ranked second).

In terms of LPR, we compared the performance of ViTLPR with three baselines: Bi-LSTM + CTC [Reference Li, Wang and Shen18], CRNN [Reference Kessentini, Besbes, Ammar and Chabbouh20], and Bi-LSTM + 1D-Attention [Reference Zou, Zhang, Yan, Jiang, Huang, Fan and Cui57]. These baselines share the same key building block: an LSTM module followed by either a CTC layer or a 1D-attention module. As shown in Table X, ViTLPR has the best LPR performance ( $97.19\%$ ) compared to all baselines. The overall recognition accuracy is increased by $1.53\%$ compared to the second-best results [Reference Zou, Zhang, Yan, Jiang, Huang, Fan and Cui57]. In particular, given that the RP subset is the most challenging (most LPs are distorted), ViTLPR achieves the highest performance gain ( $2.35\%$ ). This confirms the robustness of ViTLPR in successfully identifying irregular LPs. A recognition result achieved on a sample of the AOLP dataset is shown in Fig. 6(d).

6. Discussion

6.1. Inference optimization process

In order to make the proposed ALPR system suitable for efficient inference and deploy it on PGuard, we focus this work on proposing an optimization strategy. For this purpose, we used TensorRT, in particular, Torch-TensorRT, a compiler that allows us to export TensorRT-accelerated engines. The proposed optimization strategy consists of the following steps.

Architectural constraints: Due to its architecture, ViTLPR can not be directly supported by TensorRT. Therefore, several modifications were applied to the model design, following guidelines proposed in [Reference Xia, Li, Wu, Wang, Wang, Xiao, Zheng and Wang58], to create TensorRT-compatible engine. The modifications adopted are as follows: (1) reduce the number of heads in MHSA layer from $12$ to $3$ , (2) reduce the number of encoder blocks from $16$ to $8$ , and (3) add a ResNet bottleneck block after the encoder block to form a new encoder block. We note that while reducing the number of heads and encoder blocks results in a decrease in performance, adding a bottleneck block increases efficiency [Reference Xia, Li, Wu, Wang, Wang, Xiao, Zheng and Wang58].

Implementation constraints: We retrained ViTLPR to be quantified at the inference stage using quantization-aware training [Reference Nagel, Fournarakis, Amjad, Bondarenko, Van Baalen and Blankevoort59]. After training, ViTLPR was converted into an engine using TensorRT. Three different precision levels (FP32, FP16, and INT8) were examined for the weights. The final outcome is an optimized engine file ready for runtime execution. After making the aforementioned adjustments, the performance of ViTLPR is re-assessed in terms of accuracy and latency, and the results are presented in Table XI.

Table XI.

Performance of the proposed automatic license plate recognition system on the PGTLP-v2 test set.

Observations: In what follows, we analyze the results of the proposed ALPR system (YOLOv7x+ViTLPR) on the PGTLP-v2 dataset under different optimization settings. As expected, a decrease in accuracy for both the LPD (YOLOv7x) and LPR (ViTLPR) models is observed when the precision level is reduced. When using the FP32 and FP16 precision levels, a shift in precision/recall balance is observed, although significant improvements in speed are noted ( $59$ / $162$ FPS). The precision dropped to $71.41\%$ , but the recall increased to $96.56\%$ . This suggests that while ViTLPR is better at identifying most of the LPs, its detection accuracy has decreased (more false detections). The INT8 optimization provides a slight accuracy improvement ( $71.90\%$ ) over FP32/FP13 but at the cost of reduced ability to detect all LPs (more missed LPs). As with YOLOv7x, optimization of ViTLPR enhances speed at the cost of accuracy. Each subsequent optimization (FP32, FP16, and INT8) increases the speed significantly ( $26$ , $54$ , and $107$ FPS, respectively), but this comes with a remarkable decline in LP-RR ( $80.32\%$ , $79.44\%$ , and $77.64\%$ ). In particular, with INT8 optimization, despite its drop in accuracy ( $77.64\%$ ), ViTLPR still performs better than the baselines (see Table V) with an improved speed of $107$ FPS.

To reflect the real-world deployment of the two deep models combined in sequence in the proposed ALPR system, we use only the LPs detected by YOLOv7x as input to the ViTLPR. We compute ACC metric defined by the number of correctly recognized LPs divided by the number of annotated LPs to measure the accuracy of the proposed ALPR system. When the models are used in their native forms, ViTLPR accurately recognizes $1,091$ out of the $1,326$ LPs, that are correctly detected by YOLOv7x, resulting in a system accuracy of $75.07\%$ . Interestingly, with FP32 optimization, the system achieves the best accuracy ( $77.56\%$ ), even though ViTLPR does not have the best LP-RR. This can be explained by the highest recall of YOLOv7x ( $96.56\%$ ).

6.2. Cross-dataset validation protocol

To evaluate the robustness and generalizability of the proposed system across different datasets, additional experiments were conducted using the Try-One-Dataset-Out validation protocol, where one dataset was held out at a time as unseen data for testing. The testing subsets of the held-out datasets are used to evaluate performance, providing insight into the models’ ability to handle data from various sources. In Table XII, the recall and recognition rates are presented in terms of LPD and LPR, respectively.

Table XII.

Recall and license plate recognition rates following the Try-One-Dataset-Out (^*) validation protocol.

Regarding the LPD stage, the results show no significant impact on the recall rates when using the Try-One-Dataset-Out validation protocol compared to the traditional split. For the PGTLP-v2 dataset, a slight decrease of 0.98% is noted. This may be attributed to the nature of Tunisian LPs, as they are not represented in the other datasets. On the opposite, the LSV-LP and UFPR-ALPR datasets show an increase in recall of 0.25% and 0.95%, respectively. This may be due to highly similar instances of LPs in the CCPD and RodoSol-ALPR datasets. Overall, we suggest that deep models trained for LPD on various datasets are arguably able to perform reliably on images from datasets not previously encountered.

Regarding the LPR stage, we observe a decrease in recognition rates across all datasets when using the Try-One-Dataset-Out validation protocol compared to the traditional split. For instance, the LP-RR drops from 82.32% to 54.14% (−28.18%) when using PGTLP-v2 testing set. Furthermore, AOLP is the most impacted dataset and notices a significant performance drop of −44.05%. This decline suggests that the ViTLPR’s ability to generalize across different datasets is limited. When trained on all but one dataset, the model struggles to perform as well on the unseen dataset compared to when it is trained and tested on the same dataset.

6.3. Deblurring step

To address the issue of motion blur caused by the movement of both the robot and vehicles, a thorough set of experiments was conducted on three different datasets by adding a deblurring module as a pre-processing step in the proposed system. The goal is to investigate the impact of a deblurring step on both the recognition accuracy of the proposed system and its inference time.

In this work, we used two recent deep models: LaKDNet [Reference Ruan, Bemana, Seidel, Myszkowski and Chen60] and NAFNet [Reference Chen, Chu, Zhang, Sun, Avidan, Brostow, Cissé, Farinella and Hassner61], both designed for image deblurring. LaKDNet is a UNet-like lightweight CNN model with a special block LaKD that has large depth-wise convolution, providing larger effective receptive field. On the other hand, NAFNet is a non-linear activation-free model that was built by investigating state-of-the-art models through empirical evaluation and integrating them. Pretrained weights were used for both models since they require paired images for training.

Figure 7 illustrates the qualitative results of the detected LPs without a deblurring step (w/o) and with a deblurring module applied using the LaKDNet model [Reference Ruan, Bemana, Seidel, Myszkowski and Chen60] (ViTLPR w/ LaKDNet) and the NAFNet model [Reference Chen, Chu, Zhang, Sun, Avidan, Brostow, Cissé, Farinella and Hassner61] (ViTLPR w/ NAFNet). We note that the two models demonstrate significant improvement in deblurring in the context of LPR.

We applied these two deep models to images from PGTLP-v2 and two subsets from LSV-LP and CCPD, which are predominantly blurry. Table XIII presents a comparison of the results obtained with and without the deblurring step. Both models have shown positive impact by improving the recognition rates. In particular, NAFNet increases the LP-RR by 7.64% and 1.93% on the CCPD-Blur subset and PGTLP-v2 testing set, respectively. A minor impact is noted by both models on the LSV-LP (M2M) subset. For instance, LaKDNet is able to only improve the LP-RR by 0.06%. Although both deblurring modules improve recognition accuracy, an increase in inference time is observed, with LaKDNet adding 0.6239s and NAFNet 0.0793s, respectively. NAFNet has lower latency due to its lightweight design, achieving a better accuracy/latency balance. To address the issue of increased inference time due to the deblurring step, an interesting alternative could be integrating hardware-accelerated deblurring techniques into the PGuard security payload, which can significantly improve recognition accuracy and ensure faster inference time.

Table XIII.

LP-RR rates and inference times achieved without a deblurring step (w/o) and with a deblurring module applied using LaKDNet [Reference Ruan, Bemana, Seidel, Myszkowski and Chen60] (ViTLPR w/ LaKDNet) and NAFNet [Reference Chen, Chu, Zhang, Sun, Avidan, Brostow, Cissé, Farinella and Hassner61] (ViTLPR w/ NAFNet).

Figure 7.

Quantitative results of the detected license plates without a deblurring step (w/o) and adding a deblurring module applied using LaKDNet [Reference Ruan, Bemana, Seidel, Myszkowski and Chen60] (ViTLPR w/ LaKDNet) and NAFNet [Reference Chen, Chu, Zhang, Sun, Avidan, Brostow, Cissé, Farinella and Hassner61] (ViTLPR w/ NAFNet). For better clarity, it is recommended to zoom in.

7. Conclusion and further work

In this article, we tackle the problem of ALPR in the wild. We present a dual-stage ALPR system designed for a mobile security robot named PGuard. The proposed system provides 7/7 monitoring, quick response, and accurate identification of vehicles, thus improving security measures. It integrates the off-the-shelf YOLOv7x model for LPD with a novel LPR model, called ViTLPR. ViTLPR predicts LP numbers using self-attention mechanism. Extensive experiments show that our system achieves competitive performance on different benchmark datasets using a simple encoder model without any pre-/post-processing steps. In particular, ViTLPR consistently outperforms its CNN/RNN-based counterparts with high-performance boost. The second major contribution is proposing an optimization strategy for ViTLPR using TensorRT. ViTLPR is optimized with different precision levels of FP32, FP16, and INT8 to reduce latency and improve throughput, resulting in a faster inference speed. As a supplement, we introduce an updated version of the PGTLP dataset, PGTLP-v2, which counts $5$ k annotated images of Tunisian LPs collected using PGuard made available for researchers upon request.

In future studies, there are essentially two main research directions. (i) A major bottleneck in ALPR systems is their low performance, at night, largely due to the lack of nighttime data. Thermal-to-visible domain adaptation for ALPR is a potential solution that leverages the robot’s thermal camera [Reference Marnissi, Fradi, Sahbani and Amara62,Reference Marnissi, Fradi, Sahbani and Amara63]. (ii) ALPR systems, like our system, commonly contain two stages (LPD and LPR). A unified one-stage module that simultaneously handles LPD and LPR using attention mechanism may reduce computational costs.

Author contributions

Conceptualization and software, A.I.; methodology, data curation, investigation, formal analysis, visualization, and writing – original draft preparation, A.I. and M.M.; formal analysis and writing – review and editing, A.S. and N.EBA.; supervision, project administration, and funding acquisition, N.EBA. All authors have read and agreed to the published version of the manuscript.

Financial support

This work has been supported by the MOBIDOC scheme, funded by the EU through the EMORI program and managed by the ANPR, which is gratefully acknowledged. It is also carried out under the research results valorization program (VRR) of the Tunisian Ministry of Higher Education and Scientific Research, which is gratefully acknowledged.

Competing interests

The authors declare no conflicts of interest exist.

Ethical approval

Not applicable.

Research data policy and data availability

The LSV-LP, RodoSol-ALPR, UFPR-ALPR, CCPD, and AOLP datasets used in this article are publicly available (some upon request).

LSV-LP: https://github.com/Forest-art/LSV-LP

RodoSol-ALPR: https://github.com/raysonlaroca/rodosol-alpr-dataset

UFPR-ALPR: https://github.com/raysonlaroca/ufpr-alpr-dataset

CCPD: https://github.com/detectRecog/CCPD

AOLP: https://github.com/AvLab-CV/AOLP

The PGTLP-v2 dataset used in this article is available on request for research purposes and scientific use.

Footnotes

1 https://enovarobotics.eu/pguard/

2 https://www.milestonesys.com/

References

Chen, C. P. and Wang, B., “Random-positioned license plate recognition using hybrid broad learning system and convolutional networks,” IEEE Trans. Intell. Transp. Syst. 23(1), 444–456 (2020).CrossRef Google Scholar

Fan, X. and Zhao, W., “Improving robustness of license plates automatic recognition in natural scenes,” IEEE Trans. Intell. Transp. Syst. 23(10), 18845–18854 (2022).CrossRef Google Scholar

Jiang, Y., Jiang, F., Luo, H., Lin, H., Yao, J., Liu, J. and Ren, J., “An efficient and unified recognition method for multiple license plates in unconstrained scenarios,” IEEE Trans. Veh. Technol. 24(5), 5376–5389 (2023).Google Scholar

Ke, X., Zeng, G. and Guo, W., “An ultra-fast automatic license plate recognition approach for unconstrained scenarios,” IEEE Trans. Intell. Transp. Syst. 24(5), 5172–5185 (2023).CrossRef Google Scholar

Liu, C. and Chang, F., “Hybrid cascade structure for license plate detection in large visual surveillance scenes,” IEEE Trans. Intell. Transp. Syst. 20(6), 2122–2135 (2018).Google Scholar

Qin, S. and Liu, S., “Towards end-to-end car license plate location and recognition in unconstrained scenarios,” Neural Comput. Appl. 34(24), 21551–21566 (2022).CrossRef Google Scholar

Silva, S. M. and Jung, C. R., “License Plate Detection and Recognition in Unconstrained Scenarios,” In: Computer Vision – ECCV 2018: Proceedings of the 15th European Conference on Computer Vision, Part XII, Lecture Notes in Computer Science (Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., eds.) (vol. 11216 , Springer, 2018) pp. 593–609.CrossRef Google Scholar

Silva, S. M. and Jung, C. R., “A flexible approach for automatic license plate recognition in unconstrained scenarios,” IEEE Trans. Intell. Transp. Syst. 23(6), 5693–5703 (2021).CrossRef Google Scholar

Wang, D., Tian, Y., Geng, W., Zhao, L. and Gong, C., “LPR-Net: Recognizing Chinese license plate in complex environments,” Pattern Recogn. Lett. 130, 148–156 (2020).CrossRef Google Scholar

Zhang, L., Wang, P., Li, H., Li, Z., Shen, C. and Zhang, Y., “A robust attentional framework for license plate recognition in the wild,” IEEE Trans. Intell. Transp. Syst. 22(11), 6967–6976 (2020).CrossRef Google Scholar

Wang, Q., Lu, X., Zhang, C., Yuan, Y. and Li, X., “LSV-LP: Large-scale video-based license plate detection and recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 752–767 (2022).CrossRef Google Scholar PubMed

Laroca, R., Severo, E., Zanlorensi, L. A., Oliveira, L. S., Gonçalves, G. R., Schwartz, W. R. and Menotti, D., “A Robust Real-Time Automatic License Plate Recognition Based on the YOLO Detector,” In: Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN) (2018) pp. 1–10.Google Scholar

Mokayed, H., Shivakumara, P., Woon, H. H., Kankanhalli, M., Lu, T. and Pal, U., “A new DCT-PCM method for license plate number detection in drone images,” Pattern Recogn. Lett. 148, 45–53 (2021).CrossRef Google Scholar

Chen, H., Xue, N., Zhang, Y., Lu, Q. and Xia, G.-S., “Robust visible-infrared image matching by exploiting dominant edge orientations,” Pattern Recogn. Lett. 127(0167-8655), 3–10 (2019).CrossRef Google Scholar

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J. and Houlsby, N., “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale,” In: International Conference on Learning Representations (ICLR) (2021). Published online by OpenReview.net.Google Scholar

Ismail, A., Mehri, M., Sahbani, A. and Amara, N. E. B., “PGTLP: A Dataset for Tunisian License Plate Detection and Recognition,” In: Proceedings of the 18th International Multi-Conference on Systems, Signals & Devices (SSD) (2021) pp. 661–666.Google Scholar

Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B. and Belongie, S., “Feature Pyramid Networks for Object Detection,” In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) pp. 2117–2125.Google Scholar

Li, H., Wang, P. and Shen, C., “Toward end-to-end car license plate detection and recognition with deep neural networks,” IEEE Trans. Intell. Transp. Syst. 20(3), 1126–1136 (2018).CrossRef Google Scholar

Ren, S., He, K., Girshick, R. and Sun, J., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” In: Advances in Neural Information Processing Systems 28 (NeurIPS 2015) (Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. and Garnett, R., eds.) (2015) pp. 91–99.Google Scholar

Kessentini, Y., Besbes, M. D., Ammar, S. and Chabbouh, A., “A two-stage deep neural network for multi-norm license plate detection and recognition,” Expert Syst. Appl. 136, 159–170 (2019).CrossRef Google Scholar

Montazzolli, S. and Jung, C., “Real-Time Brazilian License Plate Detection and Recognition Using Deep Convolutional Neural Networks,” In: 2017 30th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Niteroi, Brazil (2017) pp. 55–62.Google Scholar

Wang, Y., Bian, Z.-P., Zhou, Y. and Chau, L.-P., “Rethinking and designing a high-performing automatic license plate recognition approach,” IEEE Trans. Intell. Transp. Syst. 23(7), 8868–8880 (2021).CrossRef Google Scholar

Shashirangana, J., Padmasiri, H., Meedeniya, D. and Perera, C., “Automated license plate recognition: A survey on methods and techniques,” IEEE Access 9, 11203–11225 (2020).CrossRef Google Scholar

Rizvi, S. T. H., Patti, D., Björklund, T., Cabodi, G. and Francini, G., “Deep classifiers-based license plate detection, localization and recognition on GPU-powered mobile platform,” Future Internet 9(4), 66 (2017).CrossRef Google Scholar

Laroca, R., Zanlorensi, L. A., Gonçalves, G. R., Todt, E., Schwartz, W. R. and Menotti, D., "An efficient and layout-independent automatic license plate recognition system based on the YOLO detector,” IET Intell. Transp. Syst. 15, 483–503 (2019).CrossRef Google Scholar

Henry, C., Ahn, S. Y. and Lee, S.-W., “Multinational license plate recognition using generalized character sequence detection,” IEEE Access 8, 35185–35199 (2020).CrossRef Google Scholar

He, K., Zhang, X., Ren, S. and Sun, J., “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Trans. Pattern Anal. 37(9), 1904–1916 (2015).CrossRef Google Scholar PubMed

Selmi, Z., Halima, M. B., Pal, U. and Alimi, M. A., “DELP-DAR system for license plate detection and recognition,” Pattern Recogn. Lett. 129, 213–223 (2020).CrossRef Google Scholar

Cao, Y., Fu, H. and Ma, H., “An End-to-End Neural Network for Multi-line License Plate Recognition,” In: 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China (2018) pp. 3698–3703.Google Scholar

Wang, J., Huang, H., Qian, X., Cao, J. and Dai, Y., “Sequence recognition of Chinese license plates,” Neurocomputing 317, 149–158 (2018).CrossRef Google Scholar

Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H. and Schmidhuber, J., “A novel connectionist system for unconstrained handwriting recognition,” IEEE Trans. Pattern Anal. 31(5), 855–868 (2008).CrossRef Google Scholar

Hochreiter, S. and Schmidhuber, J., “Long short-term memory,” Neural Comput. 9(8), 1735–1780 (1997).CrossRef Google Scholar PubMed

Jaderberg, M., Simonyan, K., Zisserman, A. and Kavukcuoglu, K., “Spatial Transformer Networks,” In: Advances in Neural Information Processing Systems (Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. and Garnett, R., eds.) (vol. 28, Curran Associates, Inc., 2015) pp. 2017–2025.Google Scholar

Xu, Z., Yang, W., Meng, A., Lu, N., Huang, H., Ying, C. and Huang, L., “Towards End-to-End License Plate Detection and Recognition: A Large Dataset and Baseline,” In: Computer Vision – ECCV 2018: Proceedings of the 15th European Conference on Computer Vision, Part XIII, Lecture Notes in Computer Science (Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., eds.) (vol.11217, Springer, 2018) pp. 255–271.CrossRef Google Scholar

He, M.-X. and Hao, P., “Robust automatic recognition of Chinese license plates in natural scenes,” IEEE Access 8, 173804–173814 (2020).CrossRef Google Scholar

Xu, H., Zhou, X.-D., Li, Z., Liu, L., Li, C. and Shi, Y., “EILPR: Toward end-to-end irregular license plate recognition based on automatic perspective alignment,” IEEE Trans. Intell. Transp. Syst. 23(3), 2586–2595 (2021).CrossRef Google Scholar

Kumar, A., Shivakumara, P., Chowdhury, P. N., Pal, U. and Liu, C.-L., “DPAM: A New Deep Parallel Attention Model for Multiple License Plate Number Recognition,” In: Proceedings of the 26th International Conference on Pattern Recognition (ICPR) (2022) pp. 1485–1491.Google Scholar

Wang, C.-Y., Bochkovskiy, A. and Liao, H.-Y. M., “YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023) (2023) pp. 7464–7475.Google Scholar

Wang, C.-Y., Liao, H.-Y. M. and Yeh, I.-H., "Designing network design strategies through gradient path analysis,” J. Inform. Sci. Eng. 39(3), 975–995 (2022).Google Scholar

Ding, X., Zhang, X., Ma, N., Han, J., Ding, G. and Sun, J., “RepVGG: Making VGG-Style Convnets Great Again,” In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021) pp. 13733–13742.Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. and Polosukhin, I., “Attention is All You Need,” In: Advances in Neural Information Processing Systems 30 (NeurIPS 2017) (Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N.and Garnett, R., eds.) (2017) pp. 5998–6008.Google Scholar

Laroca, R., Cardoso, E. V., Lucio, D. R., Estevam, V. and Menotti, D., “On the Cross-Dataset Generalization in License Plate Recognition,” In: Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP) (Farinella, G. M., Radeva, P. and Bouatouch, K., eds.) (vol. 5: VISAPP, SCITEPRESS, 2022) pp. 166–178.Google Scholar

Hsu, G.-S., Chen, J.-C. and Chung, Y.-Z., “Application-oriented license plate recognition,” IEEE Trans. Veh. Technol. 62(2), 552–561 (2012).CrossRef Google Scholar

Ismail, A., Mehri, M., Sahbani, A. and Amara, N. E. B., “Performance Benchmarking of YOLO Architectures for Vehicle License Plate Detection from Real-time Videos Captured by a Mobile Robot,” In: Proceedings of the 16th International Conference on Computer Vision Theory and Applications (VISAPP) (2021) pp. 661–668.Google Scholar

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A. and Jegou, H., “Training Data-Efficient Image Transformers & Distillation through Attention,” In: Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, Meila, M. and Zhang, T., eds.) (vol. 139, PMLR, 2021) pp. 10347–10357.Google Scholar

Cubuk, E. D., Zoph, B., Shlens, J. and Le, Q. V., “RandAugment: Practical Automated Data Augmentation with a Reduced Search Space,” In: Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition (CVPR) Workshops (June 2019).Google Scholar

Shi, B., Bai, X. and Yao, C., “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2016).CrossRef Google Scholar PubMed

Bochkovskiy, A., "Yolov4: Optimal speed and accuracy of object detection,” arXiv:2004.10934 (2020).Google Scholar

Wojna, Z., Gorban, A. N., Lee, D.-S., Murphy, K., Yu, Q., Li, Y. and Ibarz, J., “Attention-Based Extraction of Structured Information from Street View Imagery,” In: Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR 2017) (2017) pp. 844–850.Google Scholar

Tan, M., Pang, R. and Le, Q. V., “Efficientdet: Scalable and Efficient Object Detection,” In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020) pp. 10781–10790.Google Scholar

Wang, T., Zhu, Y., Jin, L., Luo, C., Chen, X., Wu, Y., Wang, Q. and Cai, M., “Decoupled attention network for text recognition,” In: AAAI Conference on Artificial Intelligence (2022) pp. 12216–12224.Google Scholar

Luo, C., Jin, L. and Sun, Z., “MORAN: A multi-object rectified attention network for scene text recognition,” Pattern Recognit. 90, 109–118 (2019).CrossRef Google Scholar

Masood, S. Z., Shu, G., Dehghan, A. and Ortiz, E. G., "License plate detection and recognition using deeply learned convolutional neural networks,” CoRR, abs/1703.07330 (2017). http://arxiv.org/abs/1703.07330.Google Scholar

OpenALPR, "OpenALPR," (2014). https://github.com/openalpr/openalprGoogle Scholar

Laroca, R., Zanlorensi, L. A., Gonçalves, G. R., Todt, E., Schwartz, W. R. and Menotti, D., “An efficient and layout-independent automatic license plate recognition system based on the YOLO detector,” IET Intell. Transp. Syst. 15(4), 483–503 (2021).CrossRef Google Scholar

He, K., Zhang, X., Ren, S. and Sun, J., “Deep Residual Learning for Image Recognition,” In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) pp. 770–778.Google Scholar

Zou, Y., Zhang, Y., Yan, J., Jiang, X., Huang, T., Fan, H. and Cui, Z., “A robust license plate recognition model based on Bi-LSTM,” IEEE Access 8, 211630–211641 (2020).CrossRef Google Scholar

Xia, X., Li, J., Wu, J., Wang, X., Wang, M., Xiao, X., Zheng, M. and Wang, R., TRT-ViT: TensorRT-oriented vision transformer. CoRR, abs/2205.09579 (2022). https://doi.org/10.48550/arXiv.2205.09579.CrossRef Google Scholar

Nagel, M., Fournarakis, M., Amjad, R. A., Bondarenko, Y., Van Baalen, M. and Blankevoort, T., “A white paper on neural network quantization, CoRR, abs/2106.08295 (2021). https://arxiv.org/abs/2106.08295.Google Scholar

Ruan, L., Bemana, M., Seidel, H.-P., Myszkowski, K. and Chen, B., "Revisiting image deblurring with an eConvNet,” CoRR, abs/2302.02234 (2023). https://doi.org/10.48550/arXiv.2302.02234.CrossRef Google Scholar

Chen, L., Chu, X., Zhang, X. and Sun, J., “Simple Baselines for Image Restoration,” ” In: Computer Vision – ECCV 2022 . Lecture Notes in Computer Science (Avidan, S., Brostow, G., Cissé, M., Farinella, G. M., and Hassner, T., eds.) (vol. 13667, Springer, Cham, 2022) pp. 17–33.CrossRef Google Scholar

Marnissi, M. A., Fradi, H., Sahbani, A. and Amara, N. E. B., “Unsupervised thermal-to-visible domain adaptation method for pedestrian detection,” Pattern Recogn. Lett. 153, 222–231 (2022).CrossRef Google Scholar

Marnissi, M. A., Fradi, H., Sahbani, A. and Amara, N. E. B., “Improved domain adaptive object detector via adversarial feature learning,” Comput. Vis. Image Underst. 230, 103660 (2023).CrossRef Google Scholar

Figure 1. PGuard robot scope. PGuard is equipped with advanced features that enable it to effectively patrol and secure specific plants, either autonomously or through remote control. It streams real-time video and audio for monitoring and video analytics.

Figure 2. Pipeline of the proposed automatic license plate recognition system deployed on PGuard.

Figure 3. Vision transformer-based LP recognizer architecture. Raw license plate images are partitioned into square patches and transformed into a sequence of vectors. After adding positional information, vectors are passed through a stacking of$L$Vanilla transformer encoders. Finally, feature sequence is fed to the prediction head for character recognition.

Table I. Specifications of the six benchmarks used in our experiments to evaluate the performance of the proposed automatic license plate recognition (ALPR) system.

Figure 4. Image samples in PGTLP-v2. From top to down, the first row presents images ($1,920 \times 1,080$) at entrance checkpoint. The second row contains$180^{\circ }$panoramic view images ($2,560 \times 1,024$) at restricted access plant.

Figure 5. Image samples from the six datasets used in our experiments and their respective license plates with respect to the ground-truth annotations.

Table II. An overview of the number of images used for training, testing, and validation in each dataset.

Table III. Architectural parameters of vision transformer-based LP recognizer.

Table IV. Selected hyperparameters for training the two modules of the proposed automatic license plate recognition system (YOLOv7x and vision transformer-based LP recognizer (ViTLPR)).

Table V. Detection and recognition results on PGTLP-v2.

Figure 6. Qualitative results of vision transformer-based LP recognizer on image samples from benchmarks used in our experiments (PGTLP-v2, UFPR-ALPR, CCPD, AOLP-RP, LSV-LP, and RodoSol-ALPR). Best viewed in color and zoomed in.

Table VI. Detection and recognition results on LSV-LP.

Table VII. Detection and recognition results on RodoSol-ALPR.

Table VIII. Detection and recognition results on UFPR-ALPR.

Table IX. Detection and recognition results on CCPD.

Table X. Detection and recognition results on AOLP.

Table XI. Performance of the proposed automatic license plate recognition system on the PGTLP-v2 test set.

Table XII. Recall and license plate recognition rates following the Try-One-Dataset-Out (*) validation protocol.

Table XIII. LP-RR rates and inference times achieved without a deblurring step (w/o) and with a deblurring module applied using LaKDNet [60] (ViTLPR w/ LaKDNet) and NAFNet [61] (ViTLPR w/ NAFNet).

Figure 7. Quantitative results of the detected license plates without a deblurring step (w/o) and adding a deblurring module applied using LaKDNet [60] (ViTLPR w/ LaKDNet) and NAFNet [61] (ViTLPR w/ NAFNet). For better clarity, it is recommended to zoom in.

Article contents

A dual-stage system for real-time license plate detection and recognition on mobile security robots

Abstract

Keywords

Information

1. Introduction

2. Related work

2.1. ALPR systems

2.2. LPR systems

2.3. Object recognition

2.4. Sequence labeling

3. Proposed ALPR system

3.1. YOLOv7x

3.2. ViTLPR

4. Experimental setup

4.1. Datasets

4.2. Evaluation protocol

5. Results

5.1. PGTLP-v2

5.2. LSV-LP

5.3. RodoSol-ALPR

5.4. UFPR-ALPR

5.5. CCPD

5.6. AOLP

6. Discussion

6.1. Inference optimization process

6.2. Cross-dataset validation protocol

6.3. Deblurring step

7. Conclusion and further work

Author contributions

Financial support

Competing interests

Ethical approval

Research data policy and data availability

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests