Project Logo

Exploring the Limits of Vision-Language-Action

Manipulations in Cross-task Generalization

1HKUST(GZ), 2HKU, 3SYSU, 4HKUST
*Corresponding Author

This work introduces AGNOSTOS, a cross-task generalization benchmark for robotic manipulation, and a novel Vision-Language-Action method that shows promising cross-task generalization capabilities.

Abstract

The generalization capabilities of vision-language-action (VLA) models to unseen tasks are crucial to achieving general-purpose robotic manipulation in open-world settings. However, the cross-task generalization capabilities of existing VLA models remain significantly underexplored. To address this gap, we introduce AGNOSTOS, a novel simulation benchmark designed to rigorously evaluate cross-task zero-shot generalization in manipulation. AGNOSTOS comprises 23 unseen manipulation tasks for test-distinct from common training task distributions-and incorporates two levels of generalization difficulty to assess robustness. Our systematic evaluation reveals that current VLA models, despite being trained on diverse datasets, struggle to generalize effectively to these unseen tasks. To overcome this limitation, we propose Cross-Task In-Context Manipulation (X-ICM), a method that conditions large language models (LLMs) on in-context demonstrations from seen tasks to predict action sequences for unseen tasks. Additionally, we introduce a dynamics-guided sample selection strategy that identifies relevant demonstrations by capturing cross-task dynamics. On AGNOSTOS, X-ICM significantly improves cross-task zero-shot generalization performance over leading VLAs. We believe AGNOSTOS and X-ICM will serve as valuable tools for advancing general-purpose robotic manipulation.

AGNOSTOS Benchmark

We present AGNOSTOS, a simulation benchmark built in RLBench to rigorously evaluate zero-shot, cross-task generalization for robotic manipulation. It features 23 unseen test tasks, distinct from typical training distributions.

These tasks are categorized into two difficulty levels:
  • Level-1: 13 unseen tasks, sharing partial semantics (e.g., similar objects like "cups" or motions like "put") with seen tasks
  • Level-2: 10 unseen tasks, introducing entirely novel scenarios with no overlapping objects or actions.
We benchmark three broad categories of VLA models:
  • Foundation VLAs: trained on large-scale real-world cross-embodiment robotic data or built upon LLM or VLM models, including OpenVLA, RDT, π0, LLARVA, SAM2Act, 3D-LOTUS++, and VoxPoser.
  • Human-video VLAs: pre-trained on large-scale human action videos to capture rich human-object interactions for downstream robotic fine-tuning, including R3M, D4R, R3M-Align, and D4R-Align.
  • In-domain VLAs: rained from scratch on RLBench's 18 seen tasks with task specific model architectures. These serve as strong baselines without domain mismatch, including PerAct, RVT, RVT2, Sigma-Agent, and Instant Policy.

X-ICM Method

To push the boundaries of cross-task zero-shot generalization in vision-language-action (VLA) models, we propose a method called Cross-task In-context Manipulation (X-ICM). Leveraging the cross-task generalization capabilities of LLMs, X-ICM utilizes demonstrations from seen tasks as in-context examples. The dynamic characteristics of these examples are used to prompt the LLM to predict action sequences for unseen tasks. A central challenge in this setting is that the selection of in-context demonstrations significantly affects generalization performance. To address this, we design a dynamics-guided sample selection module that measures similarities between dynamic representations of seen and unseen tasks to guide the selection process, resulting in improved cross-task generalization.

X-ICM Method Overview

Overview of the X-ICM framework. For an unseen task, dynamically relevant demonstrations from seen tasks are retrieved to prompt an LLM for action sequence prediction.

Benchmarking Results

  • X-ICM (7B) and X-ICM (72B) achieve average success rates of 23.5% and 30.1%, respectively, outperforming all existing VLA models.
  • All prior models completely fail on at least eight of the 23 tasks. In contrast, X-ICM (7B) fails on only two, and X-ICM (72B) succeeds on all.
Benchmarking Results

BibTeX

@article{zhou2025exploring,
  title     = {Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization},
  author    = {Jiaming Zhou and Ke Ye and Jiayi Liu and Teli Ma and Zifan Wang and Ronghe Qiu and Kun-Yu Lin and Zhilin Zhao and Junwei Liang},
  journal   = {arXiv preprint},
  year      = {2025},
  note      = {Replace with actual publication details when available}
}