Hostname: page-component-89b8bd64d-46n74 Total loading time: 0 Render date: 2026-05-12T13:21:20.839Z Has data issue: false hasContentIssue false

RT-SCNNs: real-time spiking convolutional neural networks for a novel hand gesture recognition using time-domain mm-wave radar data

Published online by Cambridge University Press:  09 January 2024

Ahmed Shaaban*
Affiliation:
Institute for Electronics Engineering, University of Erlangen-Nuremberg, Erlangen, Germany Infineon Technologies AG, Munich, Germany
Maximilian Strobel
Affiliation:
Infineon Technologies AG, Munich, Germany
Wolfgang Furtner
Affiliation:
Infineon Technologies AG, Munich, Germany
Robert Weigel
Affiliation:
Institute for Electronics Engineering, University of Erlangen-Nuremberg, Erlangen, Germany
Fabian Lurz
Affiliation:
Institute for Electronics Engineering, University of Erlangen-Nuremberg, Erlangen, Germany Chair of Integrated Electronic Systems, Otto-von-Guericke-University Magdeburg, Magdeburg, Germany
*
Corresponding author: Ahmed Shaaban; Email: ahmed.shaaban@fau.de
Rights & Permissions [Opens in a new window]

Abstract

This study introduces a novel approach to radar-based hand gesture recognition (HGR), addressing the challenges of energy efficiency and reliability by employing real-time gesture recognition at the frame level. Our solution bypasses the computationally expensive preprocessing steps, such as 2D fast Fourier transforms (FFTs), traditionally employed for range-Doppler information generation. Instead, we capitalize on time-domain radar data and harness the energy-efficient capabilities of spiking neural networks (SNNs) models, recognized for their sparsity and spikes-based communication, thus optimizing the overall energy efficiency of our proposed solution. Experimental results affirm the effectiveness of our approach, showcasing significant classification accuracy on the test dataset, with peak performance achieving a mean accuracy of 99.75%. To further validate the reliability of our solution, individuals who have not participated in the dataset collection conduct real-time live testing, demonstrating the consistency of our theoretical findings. Real-time inference reveals a substantial degree of spikes sparsity, ranging from 75% to 97%, depending on the presence or absence of a performed gesture. By eliminating the computational burden of preprocessing steps and leveraging the power of (SNNs), our solution presents a promising alternative that enhances the performance and usability of radar-based (HGR) systems.

Information

Type
EuMW 2022 Special Issue
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NC
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial licence (http://creativecommons.org/licenses/by-nc/4.0), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use.
Copyright
© The Author(s), 2024. Published by Cambridge University Press in association with the European Microwave Association.
Figure 0

Figure 1. (a) FMCW radar block diagram. (b) Infineon’s FMCW radar chipset with a transmitter antenna and three L-shaped receiving antennas.

Figure 1

Table 1. Radar operating parameters

Figure 2

Figure 2. A visual representation showcasing the execution of the recorded gesture.

Figure 3

Figure 3. Configuration of the radar recording setup, illustrating the range of recording positions, angles, and height adjustments of the radar holder.

Figure 4

Figure 4. Outline of the gesture frame detection process: (a) Preprocessing on each frame involves a first FFT on the fast-time dimension to generate the range profile (RProfile). Smoothing and refinement locate the first local maxima as RGesture, representing the range bin of the hand. A Doppler FFT on RGesture produces the Doppler profile (DProfile). The peak signal amplitude in DProfile is designated as PGesture. (b) Frame refinement: Using RGesture and PGesture values for each frame, a refinement process is performed across all frames. Frames with a PGesture value below the predetermined threshold are discarded as they are considered not to contain any gesture. From the remaining frames, the frame closest to the radar, determined by the nearest RGesture index, is identified as FGesture, indicating the frame where the hand performed the gesture and was closest to the radar.

Figure 5

Figure 5. Illustration of gesture frame detection for a SwipeLeft gesture. (a) Conventional range spectrogram of a SwipeLeft gesture. (b) Conventional Doppler spectrogram. (c) and (d) RProfile and DProfile for all 100 frames within the SwipeLeft gesture, respectively. Frames in the gray areas are discarded due to thresholding during FGesture estimation. These frames are not clearly visible in panels (a) and (b), confirming that they represent frames with gesture-accompanying noise rather than those with actual gesture execution. The green dotted line highlights the estimated FGesture, indicating the hand’s closest position to the radar during the gesture. The results from panels (c) and (d) are corroborated by the conventional spectrograms in panels (a) and (b).

Figure 6

Figure 6. The architecture of the SCNN model is presented, with each convolutional layer annotated with its respective input channels and kernel size. The initial convolutional layer comprises three input channels, indicating that the network receives a single frame (composed of chirps or samples) from a single gesture at a time, with the input channels corresponding to the three antennas. Max-pooling layers with a stride of 2 are utilized. The first layer of LIF spiking neurons converts the output of the initial max-pooled convolutional layer into spike representations, which are then propagated to the subsequent layers of the network. The membrane potential of the final LIF spiking neurons layer is stacked for each frame, denoted by N, indicating the number of frames in the processed gesture.

Figure 7

Figure 7. The LIF spiking neuron operating principle. (a) Spikes as inputs to the LIF neuron over time. (b) The membrane potential integrates over input spikes over time with a decay rate of beta. (c) An output spike is generated only when the membrane potential exceeds the spiking threshold (Vth).

Figure 8

Table 2. Optuna hyperparameter search range

Figure 9

Table 3. Hyperparameters overview

Figure 10

Table 4. Comparison of time-domain frame-based and record-based approaches

Figure 11

Table 5. Average confusion matrix

Figure 12

Table 6. Average refined confusion matrix

Figure 13

Figure 8. Prediction results out of a SCNN model for three gestures (300 frames). The results exhibit a high level of agreement between the predicted labels and ground truth. Minute shifts between predictions and labels are noticed in certain cases, thus explaining the model’s behavior.

Supplementary material: File

Shaaban et al. supplementary material

Shaaban et al. supplementary material
Download Shaaban et al. supplementary material(File)
File 1 MB