Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation

CVPR 2025

¹AI Thrust, The Hong Kong University of Science and Technology (Guangzhou) ²Computer Science and Engineering, Sun Yat-sen University ³CSE, The Hong Kong University of Science and Technology
† Indicates Corresponding Author

Abstract

Learning generalizable visual representations across different embodied environments is essential for effective robotic manipulation in real-world scenarios. However, the limited scale and diversity of robot demonstration data pose a significant challenge. Recent research has explored leveraging large-scale human activity data for pre-training, but the substantial morphological differences between humans and robots introduce a significant human-robot domain discrepancy, hindering the generalization of these models to downstream manipulation tasks. To overcome this, we propose a novel adaptation paradigm that leverages readily available paired human-robot video data to bridge the domain gap. Our method employs a human-robot contrastive alignment loss to align the semantics of human and robot videos, adapting pre-trained models to the robot domain in a parameter-efficient manner. Experiments on 20 simulated tasks across two different benchmarks and five real-world tasks demonstrate significant improvements. These results span both single-task and language-conditioned multi-task settings, evaluated using two different pre-trained models. Compared to existing pre-trained models, our adaptation method improves the average success rate by over 7% across multiple tasks on both simulated benchmarks and real-world evaluations.

Motivation

We highlight the human-robot domain discrepancy in visual pre-training for robotic manipulation, and provide a new adaptation paradigm that simultaneously alleviates the domain discrepancy and maintains the versatility of pre-trained models.

Method

We propose a Human-Robot Semantic Alignment method, which adapts pre-trained models with parameter-efficient design and exploits a human-robot contrastive alignment loss for effectively mitigating the domain discrepancy;

Experiments on real-world tasks

Conclusion

The existing learning paradigm of visual pre-training on human data for robotic manipulation encounters the human-robot domain discrepancy. Our work takes a preliminary attempt to solve this challenging problem. In this work, we contribute a new adaptation paradigm by leveraging existing semantic-aligned human-robot video data and proposing an efficient semantic alignment method. In this way, the existing human-data pre-trained models can be efficiently and explicitly adapted to the robot domain, without the need to be tailored for each downstream robotic environment. Experiments on 25 robotic manipulation tasks across different environments and different pre-trained models demonstrate the efficacy of our proposed method.

BibTeX

@article{zhou2024mitigating, title={Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation}, author={Zhou, Jiaming and Ma, Teli and Lin, Kun-Yu and Qiu, Ronghe and Wang, Zifan and Liang, Junwei}, journal={arXiv preprint arXiv:2406.14235}, year={2024} }