Meta, CMU and ETU Zurich joint research: Dexterous grasping and object manipulation of simulated humanoid robots
The authors describe a method of controlling and simulating humanoids to grasp objects and move them according to their trajectory. Due to the challenges of controlling humanoid robots with dexterous hands, previous methods often used non-physical hands and only considered vertical lifts or short trajectories. This limited range hinders their suitability for the manipulation of objects required for animation and simulation. To close this gap, the authors learned a controller that can pick up a large number (greater than 1200) of objects and carry them to follow a randomly generated trajectory. The authors' main insight is to utilize humanoid motor representations to provide human-like motor skills and greatly speed up training. The authors' method uses only simple reward, state, and object representations, showing good scalability on different objects and trajectories. At the time of training, the authors did not need a paired dataset of whole-body motion and object trajectories. At the time of testing, the authors only needed the mesh of the object and the tracks needed for grabbing and transporting. To demonstrate the capabilities of the authors' method, the authors demonstrate state-of-the-art success rates in tracking object trajectories and generalizing to unseen objects. The author will publish the code and model.
1. Introduction
In today's rapid development of artificial intelligence and robotics, it is still a major challenge to enable humanoid robots to flexibly grasp and manipulate various objects. This technology is not only essential for human-computer interaction in animation and virtual/augmented reality, but is also expected to be applied in the field of real-world humanoid robots in the future.
As shown in Figure 1, given a mesh of objects, the author's goal is to control a simulated humanoid robot equipped with two dexterous hands to pick up the object and move it along a reasonable trajectory. This capability can be widely applied to animation and AV/VR human-object interactions, with the potential to expand to the field of humanoid robots. However, there are many difficulties in achieving precise object manipulation with dexterous two-handed control of simulated humanoid robots:
1. Balance control: Bipedal humanoid robots need to maintain overall balance while performing delicate hand movements.
2. Flexible grip: The robot needs to be able to adapt to objects of different shapes to form a stable grip.
3. High degree of freedom control: Humanoid robots have a highly complex structure, which increases the difficulty of control.
4. Whole body coordination: Hand movements need to be coordinated with the movements of the whole body.
5. Trajectory following: Robots need to be able to move objects along a variety of complex trajectories, not just a simple vertical lift.
6. Generalization ability: The control system needs to be able to cope with a variety of unseen object shapes and movement trajectories.
Previous research has often been limited to the use of independent manipulators or preset single interaction sequences, and it is difficult to achieve the same flexible and changeable object operation as humans. In this study, we propose a new method called Omnigrasp, which aims to develop a humanoid robot controller that can coordinate and flexibly control the whole body through reinforcement learning technology. The controller is capable of picking up a variety of objects and moving them along a variety of trajectories, demonstrating unprecedented flexibility and versatility.
This research has made three important contributions in the field of dexterous grasping and object manipulation of humanoid robots:
(1) Dexterous and universal humanoid robot motion representation: The authors designed a dexterous and universal humanoid motion representation, which can significantly improve the sampling efficiency and learn to grasp through simple and effective state and reward design.
(2) Grasping strategy learning based on synthetic data: The authors demonstrate that using this motion representation, people can learn grasping strategies by synthesizing grasping postures and trajectories without using any pairs of whole-body and object motion data.
(3) High-performance humanoid robot controller: The authors demonstrate the feasibility of training humanoid controllers that can achieve high success rates in grasping objects, tracking complex trajectories, scaling to different training objects, and generalizing to unseen objects.
2.Omnigrasp: Grab all kinds of objects and track object trajectories
To solve the challenging problem of picking up objects and moving along different trajectories, the authors first obtained a generic representation of dexterous humanoid locomotion in Section 4.1. Using this motion representation, the authors devised a hierarchical RL framework (Section 4.2) to grasp objects using a simple state and reward design guided by a pre-sketch. The author's schema is shown in Figure 2.
2.1PULSE-X: Physics-based representation of universal dexterous humanoid motion
PULSE-X adds knuckles to the team's previous PULSE [ICLR 2024 Spotlight✨] work, expanding to the ambidextrous humanoid robot. The authors then utilize a variational information bottleneck (similar to VAE) to refine the motion mimic into a motion representation.
Data augmentation: Randomly pairing whole-body movements from the AMASS dataset with hand movements from the GRAB and Re:InterHand datasets to create a complete body movement dataset with finger movements
PHC-X: Humanoid Robot Motion Imitation with Articulated Fingers: Extending the PHC approach to include finger joints and training mimics using reinforcement learning
Learning Kinesiology Representations Through Online Distillation: In PULSE, encoders, decoders and priors learn to compress motor skills into latent representations. The encoder computes the latent code distribution based on the current input state, the decoder generates joint-driven actions based on the latent code, and the prior definition is based on the proprioceptive Gaussian distribution, replacing the unit Gaussian distribution used in VAEs to guide downstream task learning. The encoder and prior distributions are modeled as diagonal Gaussian distributions:
To train the model, the authors used an online refinement method similar to DAgger, i.e., pushing out an encoder-decoder in a simulation and querying to get action labels
2.2 The operation method of the object of pre-grasping guidance
Using hierarchical reinforcement learning and PULSE-X's training decoders D_PULSE-X and a priori P_PULSE-X, the action space of the author's object manipulation strategy becomes a latent motion representation. Since the action space serves as a powerful human-like movement prior, the authors are able to use simple state and reward designs and do not need any paired objects and human movements to learn the grasping strategy. The author only uses the hand pose before grasping (pre-grabbing), whether from the generative method or MoCap, to train the author's strategy.
State. In order to provide information about the object and the desired object trajectory to the task strategy π_Omnigrasp, the authors define the target state as:
It contains the difference between the reference object pose and the reference object trajectory for the next φ frame and the current current object state. All values are normalized relative to the orientation of the humanoid robot. Note that the state does not contain full-body poses, grip guidance, or phase variables, which allows the authors' method to be applied directly to unseen objects and reference trajectories at the time of testing.
Action. Similar to the downstream task strategy in PULSE, the authors form the action space as residual actions relative to the prior mean and calculate the PD target:
While the author's strategy does not include any crawl guidance or reference to body tracks as input, the author utilizes pre-crawl guidance in the reward. The authors define pre-grab as a single-frame hand pose consisting of hand panning and . Use the piecewise reward function: proximity reward: when the object is far away from the hand; Pre-Grab Reward: When the hand approaches a predefined gripping stance; Trajectory Following Reward: Used to guide the object along the desired trajectory after grabbing.
3D Trajectory Generator. Due to the limited number of real-world object trajectories (whether collected from MoCap or animators), the authors designed a 3D object trajectory generator that can create trajectories with different velocities and directions, improving the generalization ability of unseen trajectories. The author extends the 2D trajectory generator used in PACER to 3D and creates the author's trajectory generator. Given the initial object pose, a series of reasonable reference object motions can be generated. The authors limited the z-axis trajectory to between 0.03m and 1.8m, and left the xy-direction unrestricted.
Training phase. The author's training process is shown in Algorithm 1. One of the main sources of performance improvement in motion mimicry is hard negative case mining, where strategies are regularly evaluated to find failed sequences to train. Therefore, instead of using the object course learning, the author uses a simple hard negative case mining process to select difficult objects for training. Specifically, increase the number of failures of object j in all previous runs by s{j}. The probability of selecting object j out of all objects is.
**Objects and humanoid robots are randomized in their initial state. Given that objects may have a diverse initial position and orientation relative to a humanoid robot, it is critical to adapt the strategy to a diverse initial object state. Given the object dataset and the initial state provided (from motion capture or object placement in the simulation), the authors perturbate by adding a randomly sampled yaw rotation and adjusting the position components. The authors do not change the pitch and yaw angles of the object's initial attitude, as some attitudes may not be valid in the simulation. For humanoid robots, if paired data is provided (e.g., the GRAB dataset), the author uses the initial state from the dataset; If no pairing data is available, a standing T-position is used.
In the inference phase, only the latent coding of the object, the initial attitude of the random object, and the desired object trajectory are required, and there is no need to rely on pre-grasping or paired kinematic poses.
3. Experiment
Diverse datasets such as GRAB, OakInk, and OMOMO were used to study small and large object grasping. The experiment was carried out in the Isaac Gym simulation environment, the strategy ran at a frequency of 30 Hz, and the 6-layer MLP was used as the main network structure, and a GRU-based cyclic strategy was introduced into the grasping task. The training process lasted 3 days on an Nvidia A100 GPU, and about a sample was collected. The evaluation indicators include position error, rotation error, acceleration error, speed error, as well as innovative grasping success rate and trajectory target achievement rate. The main experiments revolved around grasping and trajectory following, and cross-validation was performed on the GRAB and OakInk datasets, with each experiment being repeated 10 times to ensure the reliability of the results. The innovation of the research lies in the exploration of the grasping task of the full-body simulated humanoid robot.
The authors use diverse datasets for training and evaluation, and propose new comprehensive evaluation indicators. For the object trajectory following task, the authors reported position error (mm), rotation error (radan), physics-based metrics such as acceleration error (mm/frame²) and velocity error (mm/frame). Following previous studies of full-body humanoid robot grasping, the authors also recorded the grasping success rate and trajectory target achievement rate (TTR). The grasping success rate means that the object is continuously grasped for at least 0.5 seconds without falling in the physics simulation. TTR measures the ratio of reaching the target position (< 12 cm from the target position) in all time steps in the trajectory, measuring only on successful trajectories. To measure the complete trajectory success rate, the authors also report that if the object is more than 25 cm away from the reference trajectory at any point in time, the trajectory following is considered to have failed.
3.1 Gripping and trajectory tracking
This experiment was performed on GRAB and OakInk datasets and compared with the method of Braun et al., AMP, and PHC. All experiments were repeated 10 times and averaged to eliminate slight differences in results due to factors such as floating-point error in the simulator. The experiment mainly evaluated the grabbing success rate, trajectory following accuracy and other indicators, and carried out cross-dataset tests.
On the GRAB dataset, Omnigrasp outperformed the available best methods and baselines across all metrics, especially in terms of success rate and trajectory following. Compared to the method of Braun et al., Omnigrasp achieves a high success rate in both object lifting and trajectory following. The direct use of a motion simulator (PHC) can only achieve a low success rate even when providing a realistic kinematic pose, indicating that the error of the simulator (30mm on average) is too large for precise object grasping. AMP results in a lower trajectory success rate, showing the importance of using motion priors in the action space. Omnigrasp is able to accurately track MoCap trajectories with an average error of 28mm.
On the OakInk dataset, the authors extended the scraping strategy to more than 1,000 objects and tested the ability to generalize to unseen objects. The results showed that 1,272 out of 1,330 objects were successfully picked up, and the entire lifting process also had a high success rate. Similar results were observed on the test set. Failed objects are often either too large or too small to establish a stable grip. Strategies trained on both GRAB and OakInk showed the highest success rate because the GRAB dataset included two-handed pre-grabbing, and the strategy learned to use both hands, which significantly improved the success rate for some larger objects.
3.2 Ablation experiments and analysis
Ablation experiments have shown that the use of PULSE-X in the action space significantly increases the success rate and produces human-like movements. Pre-gripping guidance is essential for learning to grab steadily. The dexterous AMASS dataset is important for trajectory following: without it, the strategy can learn to pick up objects, but will have difficulty in trajectory following. Object position randomization and hard example mining are essential for learning robust and successful strategies.
The visualization results show that based on the shape of the object, the authors' strategy uses a variety of grasping strategies to hold the object in the process of trajectory following. Based on the trajectory and the initial pose of the object, Omnigrasp discovered different gripping poses for the same object, demonstrating the advantages of using simulation and the laws of physics for grip generation. For larger objects, the author's strategy is to use a two-handed and non-gripping transfer strategy, a behavior learned from the pre-grasping of objects with both hands in Grab.
1. Omnigrasp is superior to existing methods in grasping and trajectory following tasks.
2. It can generalize well to a large number of objects (> 1000).
3. Universal Motion Prior (PULSE-X) significantly improves performance.
4. The diversity of training data and pre-crawl guidance are critical to success.
5. The strategy can adaptively adjust the grasping strategy according to the characteristics of the object and the trajectory.
4. Summary
Omnigrasp demonstrates the feasibility of controlling a simulated humanoid robot to grasp a variety of objects and move along a full trajectory, but there are still many limitations. These include the need for further improvement in rotation error, the lack of support for precise manual manipulation, the need to improve the success rate of trajectory following, and the need to implement specific gripping types. Achieving human-level dexterity is still challenging, even in a simulated environment.
Omnigrasp, a humanoid robot controller capable of grabbing more than 1,200 objects and following the trajectory, demonstrates a learning approach for simple reward and state design using pre-trained generic humanoid motion representations. Future work directions include increasing trajectory following success rates, improving grip diversity, supporting more object classes, and improving humanoid motion representation. In addition, the development of effective object representations that do not depend on normative object posture and can be generalized to visual systems is also an important research direction.
Source: CAAI Committee on Cognitive Systems and Information Processing
It is only used for academic sharing, if there is any infringement, please leave a message and delete the infringement immediately!
This article is from Xinzhi self-media and does not represent the views and positions of Business Xinzhi.If there is any suspicion of infringement, please contact the administrator of the Business News Platform.Contact: system@shangyexinzhi.com