Hostname: page-component-89b8bd64d-j4x9h Total loading time: 0 Render date: 2026-05-08T01:59:20.961Z Has data issue: false hasContentIssue false

Voice-enabled human-robot interaction: adaptive self-learning systems for enhanced collaboration

Published online by Cambridge University Press:  23 May 2025

Indra Kishor
Affiliation:
Department of CSE, Poornima Institute of Engineering & Technology, Jaipur, Rajasthan, India
Udit Mamodiya
Affiliation:
Faculty of Engineering & Technology, Poornima University, Jaipur, Rajasthan, India
Sumit Saini*
Affiliation:
Department of Electrical Engineering, Central University of Haryana, Mahendergarh, Haryana, India
Badre Bossoufi
Affiliation:
LIMAS Laboratory, Faculty of Sciences Dhar El Mahraz, Sidi Mohammed Ben Abdellah University, Fez, Morocco
*
Corresponding author: Sumit Saini; Email: drsumiteed@cuh.ac.in
Rights & Permissions [Opens in a new window]

Abstract

This research proposes an adaptive human-robot interaction (HRI) that combines voice recognition, emotional context detection, decision-making, and self-learning. The aim is to overcome challenges in dynamic and noisy environments while achieving real-time and scalable performance. The architecture is based on a three-stage HRI system: voice input acquisition, feature extraction, and adaptive decision-making. For voice recognition, modern pre-processing techniques and mel-frequency cepstral coefficients are used to robustly implement the commands. Emotional context detection is governed by neural network classification on pitch, energy, and jitter features. Decision-making uses reinforcement learning where actions are taken and then the user is prompted to provide feedback that serves as a basis for re-evaluation. Iterative self-learning mechanisms are included, thereby increasing the adaptability as stored patterns and policies are updated dynamically. The experimental results show substantial improvements in recognition accuracy along with task success rates and emotional detection. The proposed system achieved 95% accuracy and a task success rate of 96%, even against challenging noise conditions. It is apparent that emotional detection achieves a high F1-score of 92%. Real-world validation showed the system’s ability to dynamically adapt, thus mitigating 15% latency through self-learning. The proposed system has potential applications in assistive robotics, interactive learning systems, and smart environments, addressing scalability and adaptability for real-world deployment. Novel contributions to adaptive HRI arise from the integration of voice recognition, emotional context detection, and self-learning mechanisms. The findings act as a bridge between the theoretical advancements and the practical utility of further system improvements in human-robot collaboration.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1. Methodology flowchart of proposed system: A detailed representation of the workflow, highlighting key stages, i.e. initialization, feature analysis, decision-making, and task execution, with integrated self-learning and feedback mechanisms.

Figure 1

Figure 2. System architecture diagram. This diagram show a detailed representation of self-learning and data-matching processes, encompassing input, feature analysis, decision-making, and output sections with integrated components.

Figure 2

Algorithm 1: Voice Recognition Workflow

Figure 3

Figure 3. Representation of voice signals: (a) Original clean voice signal, (b) noisy voice signal with added background interference, and (c) spectrogram of the noisy voice signal showing time-frequency energy distribution for further processing.

Figure 4

Figure 4. Mood detection probabilities. Probabilities of mood categories (Happy, Neutral, Sad, Angry) predicted by the emotional context detection module for a given voice input.

Figure 5

Figure 5. Heatmap of action probabilities (Heatmap showing robot’s decision-making probabilities based on states.

Figure 6

Figure 6. Experimental setup of the proposed system: (1) Hardware integration with robotic components for voice acquisition and task execution, (2) Testing and interface debug with voice signal visualization, (3) Prototype robot design for interaction and navigation, and (4) microcontroller and sensor setup for real-time operation and processing.

Figure 7

Figure 7. Confusion matrix for voice recognition performance, illustrating the system’s accuracy in command classification.

Figure 8

Table I. Comparison of voice recognition accuracy under moderate noise conditions.

Figure 9

Figure 8. Accuracy of voice recognition in different levels of noise. Comparison of the performance with and without self-learning. The graph indicates that high improvements in accuracy were realized in using self-learning mechanisms, especially under high and extreme noise levels.

Figure 10

Figure 9. Performance analysis of the proposed system. This 3D surface plot plots the interaction between noise levels and iterations in terms of feedback and how the factors might impact the system’s overall accuracy. The 2D line plot further supports the progressive development in improvement with successive iterative development to demonstrate the self-learning and adaptability of the system.

Figure 11

Figure 10. Performance metrics for emotional detection, showing precision, recall, and F1-score for each emotion classification task.

Figure 12

Figure 11. Emotional state recognition confidence. This figure demonstrates the impact of different voice features (pitch, energy, jitter) on emotion recognition confidence across multiple emotion categories.

Figure 13

Figure 12. Impact of self-learning on recognition accuracy, showcasing the improvement over 20 interaction iterations compared to the baseline without self-learning.

Figure 14

Table II. Performance comparison of the system with and without self-learning, highlighting improvements in recognition accuracy and task success rate. The self-learning mechanism demonstrates its impact by adapting to dynamic environments and refining system performance over iterative interactions.

Figure 15

Figure 13. Diagram for self-learning effectiveness. The diagram shows the relationship among the learning iterations, task complexity, and accuracy. The system’s adaptive capabilities and self-learning efficiency improve performance over time.

Figure 16

Figure 14. The heatmap of action probabilities during decision-making, highlighting the system’s adaptability in choosing optimal actions for various states.

Figure 17

Figure 15. (a, b) Real-world demonstration of the proposed robot system interacting with users. The figure highlights the robot’s ability to process voice commands, adapt to dynamic scenarios, and refine task execution through iterative learning.

Figure 18

Figure 16. Command accuracy improvement over five iterations of self-learning. The visual demonstrates the system’s ability to adapt and refine task execution in a real-world noisy environment.

Figure 19

Figure 17. Performance comparison of models. This visualization compares the accuracy of different AI models under varying test conditions. The proposed model demonstrates superior performance, particularly in challenging scenarios.

Figure 20

Figure 18. Latency analysis across processing stages. The surface plot illustrates system response time across various voice input lengths and processing stages, ensuring optimized real-time execution.

Figure 21

Figure 19. Multi-panel visualization showcasing energy consumption trends: (a) High-Performance Domain for high workload scenarios, (b) Optimized Efficiency Domain for energy savings, and (c) Energy Trade-Off Analysis highlighting critical points (A, B, C) in task complexity and energy consumption balance.