Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation

1AI Thrust, The Hong Kong University of Science and Technology (Guangzhou) 2Computer Science and Engineering, Sun Yat-sen University 3CSE, The Hong Kong University of Science and Technology
† Indicates Corresponding Author
The code, data, and models will be released upon the acceptance of paper.


Learning generalizable visual dynamic representation across different embodied environments is crucial for real-world robotic manipulation. As the scale and diversity of robot demonstration data are limited, recent works have turned to large-scale pre-training using human data. However, the morphological differences between humans and robots introduce a significant human-robot domain discrepancy, challenging the generalization of these human-data pre-trained models to downstream manipulation tasks. To address this, we propose a novel adaptation paradigm that utilizes readily available paired human-robot video data to bridge the discrepancy. Following this paradigm, our method exploits a human-robot contrastive alignment loss to align the semantics of human and robot videos, adapting pre-trained models to the robotic domain in a parameter-efficient manner. The experiments demonstrate significant improvements on 25 tasks across three different benchmarks, where the single-task, language-conditioned multi-task settings are covered, and two different pre-trained models are evaluated. On the large RLBench benchmark, our adaptation method achieves an average improvement of 8.9% in success rate over the pre-trained R3M model across multiple tasks.

Contribution #1

We highlight the human-robot domain discrepancy in visual pre-training for robotic manipulation, and provide a new adaptation paradigm that simultaneously alleviates the domain discrepancy and maintains the versatility of pre-trained models;


Contribution #2

We propose a Human-Robot Semantic Alignment method, which adapts pre-trained models with parameter-efficient design and exploits a human-robot contrastive alignment loss for effectively mitigating the domain discrepancy;


Contribution #3

We evaluate the effectiveness of our method in three different environments, covering 7 single-tasks and 18 language-conditioned multi-tasks, as well as pre-trained models with different pre-training methodologies.


The existing learning paradigm of visual pre-training on human data for robotic manipulation encounters the human-robot domain discrepancy. Our work takes a preliminary attempt to solve this challenging problem. In this work, we contribute a new adaptation paradigm by leveraging existing semantic-aligned human-robot video data and proposing an efficient semantic alignment method. In this way, the existing human-data pre-trained models can be efficiently and explicitly adapted to the robot domain, without the need to be tailored for each downstream robotic environment. Experiments on 25 robotic manipulation tasks across three environments and different pre-trained models demonstrate the efficacy of our proposed method.


        title={Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation},
        author={Zhou, Jiaming and Ma, Teli and Lin, Kun-Yu and Qiu, Ronghe and Wang, Zifan and Liang, Junwei},
        journal={arXiv preprint arXiv:2406.14235},