Learning generalizable visual dynamic representation across different embodied environments is crucial for real-world robotic manipulation.
As the scale and diversity of robot demonstration data are limited, recent works have turned to large-scale pre-training using human data.
However, the morphological differences between humans and robots introduce a significant human-robot domain discrepancy, challenging the generalization of these human-data pre-trained models to downstream manipulation tasks.
To address this, we propose a novel adaptation paradigm that utilizes readily available paired human-robot video data to bridge the discrepancy.
Following this paradigm, our method exploits a human-robot contrastive alignment loss to align the semantics of human and robot videos, adapting pre-trained models to the robotic domain in a parameter-efficient manner.
The experiments demonstrate significant improvements on 25 tasks across three different benchmarks, where the single-task, language-conditioned multi-task settings are covered, and two different pre-trained models are evaluated.
On the large RLBench benchmark, our adaptation method achieves an average improvement of 8.9% in success rate over the pre-trained R3M model across multiple tasks.