Hostname: page-component-76c49bb84f-lxspr Total loading time: 0 Render date: 2025-07-10T09:46:12.825Z Has data issue: false hasContentIssue false

Manipulate as human: learning task-oriented manipulation skills by adversarial motion priors

Published online by Cambridge University Press:  11 June 2025

Ziqi Ma
Affiliation:
ParisTech Elite Institute of Technology, Shanghai Jiao Tong University, Shanghai, P.R. China
Changda Tian
Affiliation:
Department of Automation, Shanghai Jiao Tong University, Shanghai, P.R. China
Yue Gao*
Affiliation:
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, P.R. China. Shanghai Innovation Institute, Shanghai, P.R. China
*
Corresponding author: Yue Gao; Email: yuegao@sjtu.edu.cn
Rights & Permissions [Opens in a new window]

Abstract

In recent years, there has been growing interest in developing robots and autonomous systems that can interact with human in a more natural and intuitive way. One of the key challenges in achieving this goal is to enable these systems to manipulate objects and tools in a manner that is similar to that of humans. In this paper, we propose a novel approach for learning human-style manipulation skills by using adversarial motion priors, which we name HMAMP. The approach leverages adversarial networks to model the complex dynamics of tool and object manipulation and the aim of the manipulation task. The discriminator is trained using a combination of real-world data and simulation data executed by the agent, which is designed to train a policy that generates realistic motion trajectories that match the statistical properties of human motion. We evaluated HMAMP on one challenging manipulation task: hammering, and the results indicate that HMAMP is capable of learning human-style manipulation skills that outperform current baseline methods. Additionally, we demonstrate that HMAMP has potential for real-world applications by performing real robot arm hammering tasks. In general, HMAMP represents a significant step towards developing robots and autonomous systems that can interact with humans in a more natural and intuitive way, by learning to manipulate tools and objects in a manner similar to how humans do.

Information

Type
Research Article
Copyright
© The Author(s), 2025. Published by Cambridge University Press

1. Introduction

Manipulating tools with robot arms is a long-standing area of study in the field of robot intelligence. To effectively manipulate an object with a tool and achieve a specific goal, robots must develop a comprehensive understanding of the environment based on sensor data and then perform intricate physical interactions with targets. Several prior efforts have focused on data-driven methods that aim at learning reliable tool representations for manipulation tasks. In particular, the application of end-to-end deep neural networks has gained popularity for acquiring such representations [Reference Ainetter and Fraundorfer1Reference Kalashnikov, Irpan, Pastor, Ibarz, Herzog, Jang, Quillen, Holly, Kalakrishnan, Vanhoucke and Levine3]. These methods allow for the generation of latent tool object representations through end-to-end neural networks and extensive datasets, eliminating the need for manually crafting stage pipelines and defining object features. However, these black-box methods often lack interpretability and compactness, and they may not fully account for subsequent manipulation processes, as they do not consider actions beyond the successful tool grasp.

Some works try to use keypoints to define the tool and environment in order to mathematically formulate the manipulation task. In the works of Qin et al. [Reference Qin, Fang, Zhu, Fei-Fei and Savarese4] and Manuelli et al. [Reference Manuelli, Gao, Florence and Tedrake5], they express the tool and the task in keypoints and devise optimization algorithms to manipulate actions. By acquiring tool keypoints through supervised or reinforcement learning and defining task keypoints in the environment, these approaches formulate Quadratic Programming problems to generate robot movement trajectories within the task context. Another recent study by Turpin et al. [Reference Turpin, Wang, Tsogkas, Dickinson and Garg6] also adopts keypoints to represent tool objects. However, they employ reinforcement learning to predict tool affordances using a complexly designed reward function.

Figure 1. Difference of hammering between humans and robots. When humans hammer the nail, they swing the hammer in the opposite direction of striking in order to stock energy, while robots only focus on the achievement of task and ignore this important action.

Nevertheless, previous studies focus on achieving task completion in complex manipulation scenarios as they often overlook whether the manipulation trajectory of robot is similar to that of human being. For instance, when faced with intricate tasks, such as hammering a nail shown in Figure 1, these methods typically provide a straightforward policy. After grasping a hammer, they position it directly on the nail and strike down until contact occurs. However, human hammering involves a distinct action: the buildup of kinetic energy by swinging the hammer in the opposite direction before striking above the nail’s position. This natural human-style approach stores energy in the hammer during the swing and releases it upon impact with the nail. Such nuanced behavior is challenging to learn using conventional optimization methods because encoding the swing action with constraints in an objective equation proves difficult. The works in refs. [Reference Edmonds, Gao, Liu, Xie, Qi, Rothrock, Zhu, Wu, Lu and Zhu7Reference Zhang, Jiao, Wang, Zhu, Zhu and Liu9] use various sensors to capture the effects that human conduct on the tool and on the oriented objects to generalize the use of tool from human to robot, these methods provide a preview by learning the human movement on the physical effects; however, the inputs that they use depends on tactile data, it is expensive to gain in the normal life and will limit the propagation of the method.

Recent advances have seen a surge in deep reinforcement learning algorithms for manipulation tasks [Reference Johns10Reference Zorina, Carpentier, Sivic and Petrík12] with easy data gain, highlighting their potential for teaching robots tool-based tasks. In imitation learning, there has been progress, with supervised training using human-teleoperated demonstrations [Reference Zhang, McCarthy, Jow, Lee, Chen, Goldberg and Abbeel13] or hand-manipulated trajectories [Reference Johns10]. The research in ref. [Reference Zorina, Carpentier, Sivic and Petrík12] introduces a robot tool manipulation strategy using human manipulation videos. They create a simulation environment aligned with the guidance video and calculate robot states with guided policy samples and trajectory optimization. Although effective, this approach involves solving optimization problems for each aligned environment, which consumes considerable time and computing resources.

In order to teach a robot to learn human manipulation skills in a more natural and intuitive way, we combine adversarial motion priors (AMP) within a reinforcement learning problem. AMP [Reference Peng, Ma, Abbeel, Levine and Kanazawa14] is a cutting-edge approach first appearing in computer graphics, it uses an adversarial network to learn a “style” with a reference motion dataset. The reward function involves a style reward that encourages the agent to replicate similar trajectories to those in the dataset and a task reward that assesses whether the agent achieves the task while mimicking the motion style. We also adapt style and task rewards to teach a robot arm human-like tool manipulation skills using demonstration video clips. Our approach involves competitive training between a policy network and an adversarial network. The policy network is trained using both task-specific and adversarial rewards to generate a policy that accomplishes the task with human style. The adversarial network acts as a discriminator, determining the origin of state transitions and providing a reward that effectively motivates the training of an agent. Due to the use of AMP in guiding the robot to learn human-like manipulation skills, we name the method HMAMP. The contribution of our work is that:

  • We introduce an implementation of task-oriented reinforcement learning combined with style in manipulation domain and evaluate its performance on the task hammering.

  • We provide an idea where the training data is easily acquired to learn a robot tool manipulating policy in human style.

  • We construct an environment of tool manipulation in simulation and verify that the HMAMP is also useful in the real world.

2. Related work

2.1. Manipulation of tools

Tools use has been a fundamental issue in cognitive science studies that seeks to comprehend the nature of intelligence [Reference Sanz, Call and Boesch15Reference Van Lawick-Goodall17]. To enable robots to perform complex tasks that require the use of tools, many advanced studies have focused on recognizing affordance-specific features on tool objects [Reference Chen, Liang, Chen, Sun and Zhang18,Reference Xu, Chu, Tang, Liu and Vela19], which describes the potential physical interaction between object and manipulator and associates the regional part to planned sequential actions. These approaches have been successful in equipping robots with the capability to understand how objects may serve different purposes. In addition to recognizing the features of tool objects, learning, and planning are essential components in the manipulation of tools, as demonstrated in various studies [Reference Qin, Fang, Zhu, Fei-Fei and Savarese4,Reference Turpin, Wang, Tsogkas, Dickinson and Garg6,Reference Murali, Liu, Marino, Chernova and Gupta20]. Researchers have also explored methods of incorporating real-time feedback and environmental factors to improve the accuracy and precision of tool grasping [Reference Al-Shanoon and Lang21,Reference Ribeiro, de Queiroz Mendes and Grassi22]. Some studies aim to identify suitable tools for a certain task, they learn an embedding knowledge by DNN model between grasping tool, desired action, and target goal [Reference Saito, Ogata, Funabashi, Mori and Sugano23,Reference Sun and Gao24].

2.2. Learning from human videos

Plenty of recent research has explored utilizing human videos to improve the efficiency of RL in robots. Some works use data from the egocentric view to enable robots learning human skills [Reference Nair, Rajeswaran, Kumar, Finn and Gupta25,Reference Xiong, Fu, Zhang, Bao, Zhang, Huang, Xu, Garg and Lu26]. However because of the variety of data source and the slight difference of view, it becomes difficult to use the pre-trained representation in a specific manipulation task. In order to mend the domain gap, another type of work utilizes the in-domain human demonstration, where the sequence of human pose is recorded by motion capture [Reference Taheri, Ghorbani, Black and Tzionas27] or by a view from the third person [Reference Xiong, Li, Chen, Bharadhwaj, Sinha and Garg28]. Data of this kind have a narrower disparity between the human and robot domains, which makes it possible to construct efficient reward function for training imitation learning algorithms. Instead of extracting and re-targeting the whole human pose from the video, we focus on the motion of parts of important joints and the motion of the tool, which allows a flexible transfer from human morphology to robot morphology. By cooperating with the task-oriented approach with an AMP, we enable our system to learn the movement of robot arm using unstructured motion data.

2.3. Generative adversarial imitation learning

Generative adversarial imitation learning (GAIL) [Reference Ho and Ermon29] is inspired by the idea developed for generative adversarial networks (GAN) [Reference Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville and Bengio30]. It aims to train generators that learn policies matching the trajectory distribution of the dataset, meanwhile, the discriminators serve as the reward functions to judge whether the generated behaviors look like the demonstrations. By using a small number of demonstration data from experts, GAIL learns both the unknown environment’s policy and reward function. Although these methods have demonstrated success in low-dimensional domains [Reference Ho and Ermon29], their performance in high-dimensional tasks is not outstanding. Recently, Peng et al. [Reference Peng, Ma, Abbeel, Levine and Kanazawa14] have introduced AMP, which integrates task goals with generative adversarial imitation learning. This allows simulated agents to perform high-level tasks by learning to imitate behaviors from extensive motion dataset. Escontrela et al. [Reference Escontrela, Peng, Yu, Zhang, Iscen, Goldberg and Abbeel31] also apply this adversarial technique with a limited number of reference motion clips to learn locomotion skills for legged robots. In manipulation areas, we apply this technique to guide robots to interpret and perform behavior shown by demonstration and we show that the agents have more flexibility to perform more natural and feasible behaviors.

3. Learning tool manipulation policy

We model the problem of learning human-like tool-manipulating skills as a Markov Decision Process. The goal of the reinforcement learning is to find the parameter $\theta$ that optimize policy $\pi _{\theta }$ in which the expected discounted return is maximal $J(\theta )=\mathbb{E}_{\pi _{\theta }}\left [\sum _{t=0}^{T-1} \gamma ^{t} r_{t}\right ]$ . We then propose the mathematical abstraction of tool manipulation task and introduce the design of reward function.

3.1. Tool keypoint definition

For each manipulation task with tool, we make the assumption that the tool objects interact with the environment containing one or more target objects. Inspired by ref. [Reference Qin, Fang, Zhu, Fei-Fei and Savarese4], in the framework of HMAMP, we define some keypoints in each task manually, which consists of a set of keypoints on the tool $\boldsymbol{K}_{\boldsymbol{o}}$ and a set of keypoints in the environment $\boldsymbol{K}_{\boldsymbol{e}}$ . Specifically, we consider $\boldsymbol{K}_{\boldsymbol{o}} =\left [ \boldsymbol{x_g}, \boldsymbol{x_f}, \boldsymbol{x_m} \right ]$ , where $\boldsymbol{x}_{\boldsymbol{g}}$ characterizes the grasping position on the tools and function point, $\boldsymbol{x}_{\boldsymbol{f}}$ characterizes the functional part that make contact with the target object and $\boldsymbol{x}_{\boldsymbol{m}}$ represents an auxiliary point that aides to determine the direction of tool. Environment keypoints are represented as $\boldsymbol{K}_{\boldsymbol{e}} =\left [ \boldsymbol{x}_{\boldsymbol{c}}\right ]$ , where $\boldsymbol{x}_{\boldsymbol{c}}$ denotes the position where the target interacts with the tool. The keypoints of tools and environment are mostly used to determine the goal reward which needs to be optimized in a reinforcement learning problem.

3.2. Rewards design

As mentioned in ref. [Reference Peng, Ma, Abbeel, Levine and Kanazawa14], the total reward $r_t$ is determined as a goal reward $r^g_t$ , which is specific to the task, helps to describe the completion of task and a style reward $r^s_t$ that evaluates whether the behaviors produced by the agent is similar to the behaviors produced via the distribution of reference motion set. The more similar the behaviors are, the higher value the style reward is. The proportion between style reward and goal reward is adjusted manually before training.

(1) \begin{equation} r_t = \alpha ^g r^g_t + \beta ^s r^s_t \end{equation}

3.2.1. Goal reward about task

The goal reward is related to the task and must be designed specifically. For the tool manipulation tasks where there is a contact with environment, we design the goal reward $r^g_t$ composed by two terms at one instance $t$ .

(2) \begin{equation} \begin{aligned} r^g_t &= \omega ^f r^f_t + \omega ^d r^d_t \\ r^f_t &= \left \{ \begin{array}{rcl} \lVert F^{s}_t(\boldsymbol{x}_f,\boldsymbol{x}_c)\rVert /F^d & & {F^{s}_t \leq F^d} \\ 1 \quad \quad \quad \quad & & {F^{s}_t \gt F^d} \end{array}\right . \\ r^d_t &= 1- \tanh {(\left \lVert \boldsymbol{x}_f - \boldsymbol{x}_c\rVert _t\right )} \end{aligned} \end{equation}

The first term $r_t^f$ stimulates task completion, which is defined with the contact force detected on the target object that is exerted by the tool, the force is captured by force sensor in the simulation. The detected force is encouraged to converge towards the force $F_d$ that we desire. This reward has values only when contact occurs and after the detection of contact force, the policy terminates. The second term $r_t^d$ is defined between tool function point $\boldsymbol{x}_f$ and environment target point $\boldsymbol{x}_c$ to guide to exploit a policy that minimizes the distance between tool and target. The utilization of $\tanh$ function aims to bind the reward to $[0, 1]$ . These terms are weighted with manually specified coefficients $\omega ^f$ and $\omega ^d$ , however, the magnitude of second term is much lower than the first.

3.2.2. Style reward with motion prior

As the idea mentioned in ref. [Reference Peng, Ma, Abbeel, Levine and Kanazawa14], we define a discriminator $D_{\phi }$ as a neural network with parameter $\phi$ , the discriminator is trained to distinguish whether a transition $(s, s')$ is a fake one produced by an agent or a true one sampled from a real motion distribution $d^{\mathcal{M}}$ .

We update the objective of discriminator as:

(3) \begin{equation} \begin{aligned} \underset {\phi }{\operatorname {argmin}}\quad &\mathbb{E}_{d^{\mathcal{M}}(s,s')}\left [\left (D_{\phi }(s,s')-1\right )^2\right ] \\+&\mathbb{E}_{d^\pi (s,s')}\left [\left (D_{\phi }\left (s,s'\right )+1\right )^2\right ] \\ +&\dfrac {w^{gp}}{2} \mathbb{E}_{d^{\mathcal{M}}(s,s')} \left [\lVert \nabla _{\phi }D_{\phi }\left (s,s'\right )\rVert ^2\right ] \end{aligned} \end{equation}

The former two terms serve to motivate the discriminator to differentiate between the input state derived from a policy and the input state derived from the reference motion data. They are proposed in LSGAN [Reference Mao, Li, Xie, Lau and Wang32] to solve the challenge of vanishing gradients caused by the standard GAN objective function where a sigmoid cross-entropy loss function is usually used. LSGAN prefer to optimize $\chi ^2$ divergence between the reference distribution and the policy distribution, which may alleviate the mode collapse problem and lead to stable performance during the training process [Reference Mao, Li, Xie, Lau, Zhen and Smolley33]. The last term in (3) is a gradient penalty term to penalize nonzero gradients on samples, which may avoid oscillations and improve training stability. The $w^{gp}$ in the formula is a coefficient adjusted manually.

The style reward is then defined by:

(4) \begin{equation} r^s_t = \max \left [0, 1-\gamma ^d (D_{\phi }(s,s')-1)^2 \right ] \end{equation}

with the additional offset and scale, the style reward is bounded between $[0, 1]$ .

Figure 2. Framework of HMAMP. With human manipulation video clips, we extract the keypoints of human arm and manipulation tools. Then we do keypoints alignment between robot arm in simulation and real-world human motion clips. The AMP Discriminator is to discriminate whether an action sequence is a real human expert motion or generated by the policy network. The AMP reward and task reward for manipulation task is added to be the total reward for RL training.

The training process of policy and discriminator is shown in Figure 2. The agent steps to interact with environment and produces a state transition $(s,s')$ , the observation in the environment is used to calculate the goal reward $r^g_t$ . The discriminator takes the state transition from a simulated environment and from a reference motion clips to calculate the style reward $r^s_t$ . In the end, the combined reward is used to optimize competitively the policy and the discriminator. The training details are demonstrated in Algorithm 1.

Algorithm 1. HMAMP: Learning Human-like Manipulation Skills by Adversarial Motion Prior

4. Training

4.1. Data preprocess

The raw data used for HMAMP consist of video clips capturing manipulation skills from a third-person perspective. This type of data offers several advantages: it minimizes the domain gap between simulation and reality, and it is relatively easy to acquire, making it a practical choice for training purposes. In this work, we create the dataset which records human skill: hammering. The duration of the human motion in each video clips is less than one second, and we collect five pieces of video clips that perform the hammering movement by two persons. Then, we use the most popular keypoint detection algorithm BlazePose [Reference Bazarevsky, Grishchenko, Raveendran, Zhu, Zhang and Grundmann34] to detect human joint including Hip, Elbow, Wrist, Hand, and we use a CV algorithm to extract time-series of tool keypoints by manually signing them on the tool.

To retarget human motion to robot motion, it is necessary to construct an effective transfer function that maps the human world space to the robot world space. Numerous studies [Reference Geng, Lee and Hülse35Reference Suárez, Rosell and García37] have explored motion retargeting between these two domains, and any established retargeting method can be applied in this process. In our work, we adopt a simple and straightforward approach: direct mapping. While there are significant differences in topology between the human arm and the robot arm, they share corresponding joints, such as the human elbow, wrist, and hand, which align with certain robot joints and the end-effector. By mapping key human joints to their robotic counterparts, we achieve a rough but effective transfer of motion from the human domain to the robot domain.

4.2. Model details

The policy that we used in HMAMP is PPO, the hidden parameters of policy network are of size [512, 256, 128] with exponential linear unit activation layers. The policy outputs the distribution from which the target joint positions are sampled with the representation of mean and standard deviation. Then, the target joint positions are fed to our customized PD controllers to compute the motor torques. The policy is trained on an observation $o_t$ derived from the state, which contains environment information such as hammer position and nail position and robot information such as joint angles, joint velocities, end-effector orientation, and previous actions. The discriminator is an MLP with hidden layers of size [1024, 512] and exponential linear unit activation layers. It takes the orientation of robot joints and the orientation of tools as input. The value of all manually determined parameters is: $\alpha ^g = 0.6,{} \beta ^s = 0.4, \gamma ^d = 0.25, \omega ^f = 10^5, \omega ^d = 1, \omega ^{gp} = 1, F^d = 100$ .

4.3. Simulation

The robot that we choose to train the policy is Kinova Gen3 with the gripper 2f85, the correspondence joint mapping result between Gen3 and human is shown in Figure 3.

Figure 3. Direct mapping between human and robot arm. Some joints and the gripper of a Kinova Gen3 are mapped to act as human hip, elbow, wrist, and hand.

We selected Isaac Gym [38] as our simulation platform due to its ability to accelerate the RL training process using GPU resources. The policy was trained in parallel across 2048 agents, utilizing a single NVIDIA RTX 3090 GPU. The entire training process required 11 h of wall-clock time and spanned 60,000 training epochs. Each RL episode lasted a maximum of 152 steps, corresponding to 3 s of simulated time, and terminated early if the termination criteria were met. The policy operated at a control frequency of 50 Hz during the simulation.

4.4. Termination

An episode terminates and the next one starts when the robot satisfy the termination criteria. It contains task finish signal when the hammer knocks the nail and the collision signal when a force is detected between robot components or between the robot and the table or the hammer and the table.

4.5. Domain randomization and training process

In order to improve the robustness of HMAMP policy and facilitate the transfer of learned policy from simulation to the real world, we apply domain randomization in the training process. In detail, we randomize the coefficient of friction applied to hammer and nail to $[0.5, 1.25]$ and the joint-level PD gains to $[0.9, 1.1]$ . In addition, the same observation noise as in ref. [Reference Zorina, Carpentier, Sivic and Petrík12] is added during the training phase, where the Cartesian position observation noise is $\pm 0.01\,{\rm m}$ and joint position observation noise is $\pm 0.02\,{\rm rad}$ .

From Figure 4, the training curve shows the confrontation and balance between style reward and goal reward. The early stage goal reward has a strong guiding effect, while in the late stage, amp discriminator quickly converges, giving the movements of robot a human style.

Figure 4. Training curves of HMAMP. Left figure shows the evolution of reach reward and knock force reward, while right figure indicates the discriminator loss and gradient in the training process. The two figures show the confrontation and balance between style reward and goal reward. In the early stage goal reward has a strong guiding effect while in the late stage amp discriminator converges quickly, giving the trajectory of robot a human style.

5. Experiment

In this section, we present the experimental setup and results to demonstrate the effectiveness of HMAMP in learning task-oriented, human-like manipulation skills. To evaluate the performance of our method, we conducted a series of comparative experiments against two baseline approaches: a direct path-planning control policy and a reinforcement learning (RL) approach without AMP. The experimental results were recorded and analyzed quantitatively to highlight the contributions of HMAMP. The evaluation focused on three key aspects: the quality of the learned manipulation skills, task completion efficiency, and the similarity of the robot’s movements to human behavior.

5.1. Comparative experiment in simulation

5.1.1. Task definition

The chosen manipulation task for the experiments involves knocking a nail with a hammer. The experiments are conducted on the Isaac Gym simulation platform, using the model of a 7-dof Kinova Gen3 Arm with 2f-85 gripper to complete the task. At the start of the task, the hammer is securely grasped by the robotic arm, which begins in its initial home position, with the gripper oriented perpendicular to the platform (see Figure 6). The nail is placed arbitrarily on the manipulation platform. The objective of the task is to successfully hammer the nail while replicating a human style of movement.

5.1.2. Baseline methods

We compare HMAMP against the following baseline methods:

  • Direct Path-Planning Control Policy (DPPCP): This baseline approach determines manipulation actions using predefined path-planning strategies. In this method, proportional-derivative (PD) control is employed to generate a planned trajectory guiding the hammer from its initial position to the nail.

  • Reinforcement Learning without AMP (RL-noAMP): This baseline approach uses a standard reinforcement learning method for the agent to acquire manipulation skills. The configuration of this approach is identical to that of HMAMP, except that the AMP component is removed. The training process follows the same procedure as in HMAMP.

5.1.3. Evaluation metrics

To quantitatively evaluate the performance of each approach, we define the following criteria:

  • Knock Impulse: A measure of the knock effect received by the nail. It is calculated by the formula : $I = \int F_{nail}(t) \, dt$ . Large Impulse means the nail receives large force at one instance.

  • Energy Efficiency: The energy used by the arm is $E = \sum _{i=1}^{n} \int \tau _i(t) \cdot \omega _i(t) \, dt$ , where $n$ is the joint number of the arm. The energy efficiency can be represented as the ratio of knock impulse received by the nail and the energy cost by the arm: $\eta = I/E$

  • Vertical Force Ratio: The vertical force ratio reflects how efficiently the force is applied in the vertical direction, which is critical for tasks involving hammering. Higher ratios indicate more effective and efficient nail hammering: $\text{Vertical Force Ratio} = \frac {F_{\text{vertical, nail}}}{F_{\text{total, nail}}}$

  • Frechet Distance: The quantitative analysis of motion similarity between human and robot arm manipulation [Reference Aronov, Har-Peled, Knauer, Wang and Wenk39]. The smaller the Frechet distance between two trajectories is, the more similar their shapes are.

5.1.4. Results in simulation

We implemented the two baseline control strategies, DPPCP and RL-noAMP, as described in Section 5.1.2, within the simulation environment. Comparative experiments were conducted to evaluate their performance against our proposed method. Each method was tested across 10 trials, and the average values for each evaluation criterion were calculated to ensure stable and reliable performance measurements.

Table I. Comparison results of HMAMP with baselines.

Table I presents a comprehensive comparison of the various approaches, highlighting key performance metrics. The HMAMP consistently outperforms both DPPCP and RL-noAMP in all evaluated aspects. Notably, it has an overwhelming advantage in the impulse received by the nail, which is the most critical task of hammering nails. From the table, we can also obtain that the HMAMP is more efficient in hammering, since both the energy efficiency and vertical force ratio excel the other methods. In addition, HMAMP has the closest Frechet Distance to human manipulation trajectories, which reflects the effectiveness of our method in learning human motion styles. We can experience this more intuitively from Figure 5.

The experimental results clearly demonstrate the effectiveness of the proposed HMAMP framework in learning task-oriented, human-like manipulation skills through the use of AMP. The integration of AMP significantly improves task completion efficiency, knock impulse, and energy efficiency, resulting in superior performance compared to both the direct path-planning approach and traditional reinforcement learning (RL) methods.

Figure 5. Movement trajectory of the end of the robotic arm in Cartesian space. The end-of-arm motion trajectory obtained by HMAMP is the most similar to the human expert trajectory.

5.2. Real robot arm experiment

Figure 6. Experiment in simulation and real world. The first row shows human knocking motion clips that we used as motion priors. The second row shows the policy HMAMP in simulation, the hammer can successfully complete the task with the manipulation trajectory that we desired. The third row shows the HMAMP implemented in real world on Kinova Gen3, and the fourth row is the details about hammering a nail in real world.

We employed the same arm and the same task as the simulation for real-world experiments. The environment setup closely resembled real-world scenarios to ensure the applicability of our approach in practical scenarios.

Our proposed approach, which incorporates AMP into reinforcement learning (RL), was integrated into the control system of the robotic arm. The parameters and settings were optimized based on the training results obtained in the simulation environment. The robotic arm was tasked with performing the manipulation task using the learned skills.

Figure 6 showcases the manipulation effect of HMAMP on the real robot arm. The sequence of images illustrates the robotic arm successfully completing the manipulation task with precision and human-like motion. The trajectory followed by the arm demonstrates smoothness, accuracy, energy efficiency, and human manipulation skills, confirming the benefits of incorporating AMPs.

6. Conclusion

In this paper, we presented a novel approach named HMAMP to enable robotic arms to perform tool manipulation with human-like skills. Our method integrates AMP with deep reinforcement learning to capture complex manipulation dynamics. By leveraging both real-world motion data and synthetic motion data generated through simulation, we demonstrated the ability of our approach to surpass existing techniques in learning human-style manipulation behaviors. The evaluation of the challenging hammering task highlighted the effectiveness of our method and its potential for real-world applications. This research bridges the gap between robotic and human capabilities, paving the way for more intuitive and natural human-robot interactions. The proposed framework serves as a foundation for future research aimed at developing robots with advanced manipulation skills, envisioning a future where machines seamlessly mimic human manipulation.

Supplementary material

The supplementary material for this article can be found at http://doi.org/10.1017/S0263574725001444.

Author contribution

Ziqi Ma, Changda Tian, and Yue Gao designed the study. Ziqi Ma and Changda Tian wrote the code. Ziqi Ma conducted the experiments and data gathering. Ziqi Ma and Changda Tian performed statistical analyses. Ziqi Ma and wrote the article.

Financial support

This work was supported by the National Natural Science Foundation of China (Grant No. 92248303 and No. 62373242) and the Shanghai Municipal Science and Technology Major Project (Grant No. 2021SHZDZX0102).

Competing interests

The authors declare no conflicts of interest exist.

Ethical approval

Not applicable.

References

Ainetter, S. and Fraundorfer, F., “End-to-End Trainable Deep Neural Network for Robotic Grasp Detection and Semantic Segmentation from RGB,” 2021 IEEE International Conference on Robotics and Automation (ICRA) (IEEE, 2021) pp. 1345213458.10.1109/ICRA48506.2021.9561398CrossRefGoogle Scholar
Fang, K., Zhu, Y., Garg, A., Kurenkov, A., Mehta, V., Fei-Fei, L. and Savarese, S., “Learning task-oriented grasping for tool manipulation from simulated self-supervision,” Int. J. Robot. Res. 39(2-3), 202216 (2020).CrossRefGoogle Scholar
Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V. and Levine, S., Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv: 1806.10293 (2018).Google Scholar
Qin, Z., Fang, K., Zhu, Y., Fei-Fei, L. and Savarese, S., “Keto: Learning Keypoint Representations for Tool Manipulation,” 2020 IEEE International Conference on Robotics and Automation (ICRA) (IEEE, 2020) pp. 72787285.Google Scholar
Manuelli, L., Gao, W., Florence, P. and Tedrake, R., “kpam: Keypoint Affordances for Category-Level Robotic Manipulation,” Robotics Research: The 19th International Symposium ISRR (Springer, 2022) pp. 132157.10.1007/978-3-030-95459-8_9CrossRefGoogle Scholar
Turpin, D., Wang, L., Tsogkas, S., Dickinson, S. and Garg, A., Gift: Generalizable interaction-aware functional tool affordances without labels. arXiv preprint arXiv: 2106.14973 (2021).Google Scholar
Edmonds, M., Gao, F., Liu, H., Xie, X., Qi, S., Rothrock, B., Zhu, Y., Wu, Y. N., Lu, H. and Zhu, S.-C., “A tale of two explanations: Enhancing human trust by explaining robot behavior,” Sci. Robot. 4(37), eaay4663 (2019).10.1126/scirobotics.aay4663CrossRefGoogle ScholarPubMed
Liu, H., Zhang, C., Zhu, Y., Jiang, C. and Zhu, S.-C., “Mirroring Without Overimitation: Learning Functionally Equivalent Manipulation Actions,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33 (2019) pp. 80258033.Google Scholar
Zhang, Z., Jiao, Z., Wang, W., Zhu, Y., Zhu, S.-C. and Liu, H., “Understanding physical effects for effective tool-use,” IEEE Robot. Autom. Lett. 7(4), 94699476 (2022).10.1109/LRA.2022.3191793CrossRefGoogle Scholar
Johns, E., “Coarse-to-Fine Imitation Learning: Robot Manipulation from a Single Demonstration,” 2021 IEEE International Conference on Robotics and Automation (ICRA) (IEEE, 2021) pp. 46134619.Google Scholar
Wang, C., Fan, L., Sun, J., Zhang, R., Fei-Fei, L., Xu, D., Zhu, Y. and Anandkumar, A., Mimicplay: Long-horizon imitation learning by watching human play (2023).Google Scholar
Zorina, K., Carpentier, J., Sivic, J. and Petrík, V., Learning to manipulate tools by aligning simulation to video demonstration (2021).Google Scholar
Zhang, T., McCarthy, Z., Jow, O., Lee, D., Chen, X., Goldberg, K. and Abbeel, P., “Deep Imitation Learning for Complex Manipulation Tasks from Virtual Reality Teleoperation,” 2018 IEEE International Conference on Robotics and Automation (ICRA) (IEEE, 2018) pp. 56285635.CrossRefGoogle Scholar
Peng, X. B., Ma, Z., Abbeel, P., Levine, S. and Kanazawa, A., “Amp: Adversarial motion priors for stylized physics-based character control,” ACM Trans. Graphics (TOG) 40(4), 120 (2021).CrossRefGoogle Scholar
Sanz, C., Call, J. and Boesch, C., Tool Use in Animals: Cognition and Ecology (Cambridge University Press, 2013).Google Scholar
St Amant, R. and Horton, T. E., “Revisiting the definition of animal tool use,” Anim. Behav. 75(4), 11991208 (2008).Google Scholar
Van Lawick-Goodall, J., “Tool-Using in Primates and Other Vertebrates,” In: Advances in the Study of Behavior, vol. 3 (Academic Press, 1971) pp. 195249.Google Scholar
Chen, W., Liang, H., Chen, Z., Sun, F. and Zhang, J., Learning 6-dof task-oriented grasp detection via implicit estimation and visual affordance (2022).Google Scholar
Xu, R., Chu, F.-J., Tang, C., Liu, W. and Vela, P. A., “An affordance keypoint detection network for robot manipulation,” IEEE Robot. Autom. Lett. 6(2), 28702877 (2021).CrossRefGoogle Scholar
Murali, A., Liu, W., Marino, K., Chernova, S. and Gupta, A., “Same object, different grasps: Data and semantic knowledge for task-oriented grasping. CoRR abs/2011.06431 (2020).Google Scholar
Al-Shanoon, A. and Lang, H., “Robotic manipulation based on 3-d visual servoing and deep neural networks,” Robot. Auton. Syst. 152(C), 104041 (2022).CrossRefGoogle Scholar
Ribeiro, E. G., de Queiroz Mendes, R. and Grassi, V., “Real-time deep learning approach to visual servo control and grasp detection for autonomous robotic manipulation,” Robot. Auton. Syst. 139(C), 103757 (2021).Google Scholar
Saito, N., Ogata, T., Funabashi, S., Mori, H. and Sugano, S., “How to select and use tools? : Active perception of target objects using multimodal deep learning. CoRR abs/2106.02445 (2021).Google Scholar
Sun, M. and Gao, Y., “Gater: Learning grasp-action-target embeddings and relations for task-specific grasping,” IEEE Robot. Autom. Lett. 7(1), 618625 (2022).10.1109/LRA.2021.3131378CrossRefGoogle Scholar
Nair, S., Rajeswaran, A., Kumar, V., Finn, C. and Gupta, A., “R3m: A Universal Visual Representation for Robot Manipulation,” 6th Annual Conference on Robot Learning (2022).Google Scholar
Xiong, H., Fu, H., Zhang, J., Bao, C., Zhang, Q., Huang, Y., Xu, W., Garg, A. and Lu, C., “Robotube: Learning Household Manipulation from Human Videos with Simulated Twin Environments,” 6th Annual Conference on Robot Learning (2022).Google Scholar
Taheri, O., Ghorbani, N., Black, M. J. and Tzionas, D., GRAB: A dataset of whole-body human grasping of objects. CoRR abs/2008.11200 (2020).CrossRefGoogle Scholar
Xiong, H., Li, Q., Chen, Y., Bharadhwaj, H., Sinha, S. and Garg, A., Learning by watching: Physical imitation of manipulation skills from human videos. CoRR abs/2101.07241 (2021).CrossRefGoogle Scholar
Ho, J. and Ermon, S., “Generative Adversarial Imitation Learning,” Advances in Neural Information Processing Systems 29 (2016).Google Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y., “Generative adversarial networks,” Commun. ACM 63(11), 139144 (2020).CrossRefGoogle Scholar
Escontrela, A., Peng, X. B., Yu, W., Zhang, T., Iscen, A., Goldberg, K. and Abbeel, P., “Adversarial Motion Priors make Good Substitutes for Complex Reward Functions,” 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE, 2022) pp. 2532.CrossRefGoogle Scholar
Mao, X., Li, Q., Xie, H., Lau, R. Y. K. and Wang, Z., Multi-class generative adversarial networks with the L2 loss function. CoRR abs/1611.04076 (2016).Google Scholar
Mao, X., Li, Q., Xie, H., Lau, R., Zhen, W. and Smolley, S., “On the effectiveness of least squares generative adversarial networks,” IEEE Trans. Pattern Anal. Mach. Intell. 41(12), 29472960 (2018).CrossRefGoogle ScholarPubMed
Bazarevsky, V., Grishchenko, I., Raveendran, K., Zhu, T., Zhang, F. and Grundmann, M., Blazepose: On-device real-time body pose tracking. arXiv preprint arXiv: 2006.10204 (2020).Google Scholar
Geng, T., Lee, M. and Hülse, M., “Transferring human grasping synergies to a robot,” Mechatronics 21(1), 272284 (2011).CrossRefGoogle Scholar
Gioioso, G., Salvietti, G., Malvezzi, M. and Prattichizzo, D., “Mapping synergies from human to robotic hands with dissimilar kinematics: An approach in the object domain,” IEEE Trans. Robot. 29(4), 825837 (2013).CrossRefGoogle Scholar
Suárez, R., Rosell, J. and García, N., “Using Synergies in Dual-Arm Manipulation Tasks,” 2015 IEEE International Conference on Robotics and Automation (ICRA) (2015) pp. 56555661.Google Scholar
N. Physics Simulation Environment for Reinforcement Learning Research, Isaac gym - preview release (2023). https://developer.nvidia.com/isaac-gym.Google Scholar
Aronov, B., Har-Peled, S., Knauer, C., Wang, Y. and Wenk, C., “Fréchet Distance for Curves, Revisited,” Algorithms–ESA 2006: 14th Annual European Symposium, Zurich, Switzerland, September 11-13, 2006. Proceedings 14 (Springer, 2006) pp. 5263.Google Scholar
Figure 0

Figure 1. Difference of hammering between humans and robots. When humans hammer the nail, they swing the hammer in the opposite direction of striking in order to stock energy, while robots only focus on the achievement of task and ignore this important action.

Figure 1

Figure 2. Framework of HMAMP. With human manipulation video clips, we extract the keypoints of human arm and manipulation tools. Then we do keypoints alignment between robot arm in simulation and real-world human motion clips. The AMP Discriminator is to discriminate whether an action sequence is a real human expert motion or generated by the policy network. The AMP reward and task reward for manipulation task is added to be the total reward for RL training.

Figure 2

Algorithm 1. HMAMP: Learning Human-like Manipulation Skills by Adversarial Motion Prior

Figure 3

Figure 3. Direct mapping between human and robot arm. Some joints and the gripper of a Kinova Gen3 are mapped to act as human hip, elbow, wrist, and hand.

Figure 4

Figure 4. Training curves of HMAMP. Left figure shows the evolution of reach reward and knock force reward, while right figure indicates the discriminator loss and gradient in the training process. The two figures show the confrontation and balance between style reward and goal reward. In the early stage goal reward has a strong guiding effect while in the late stage amp discriminator converges quickly, giving the trajectory of robot a human style.

Figure 5

Table I. Comparison results of HMAMP with baselines.

Figure 6

Figure 5. Movement trajectory of the end of the robotic arm in Cartesian space. The end-of-arm motion trajectory obtained by HMAMP is the most similar to the human expert trajectory.

Figure 7

Figure 6. Experiment in simulation and real world. The first row shows human knocking motion clips that we used as motion priors. The second row shows the policy HMAMP in simulation, the hammer can successfully complete the task with the manipulation trajectory that we desired. The third row shows the HMAMP implemented in real world on Kinova Gen3, and the fourth row is the details about hammering a nail in real world.

Supplementary material: File

Ma et al. supplementary material 1

Ma et al. supplementary material
Download Ma et al. supplementary material 1(File)
File 51 MB
Supplementary material: File

Ma et al. supplementary material 2

Ma et al. supplementary material
Download Ma et al. supplementary material 2(File)
File 36.2 MB