This work introduces AGNOSTOS, a cross-task generalization benchmark for robotic manipulation, and a novel Vision-Language-Action method that shows promising cross-task generalization capabilities.
The generalization capabilities of vision-language-action (VLA) models to unseen tasks are crucial to achieving general-purpose robotic manipulation in open-world settings. However, the cross-task generalization capabilities of existing VLA models remain significantly underexplored. To address this gap, we introduce AGNOSTOS, a novel simulation benchmark designed to rigorously evaluate cross-task zero-shot generalization in manipulation. AGNOSTOS comprises 23 unseen manipulation tasks for test-distinct from common training task distributions-and incorporates two levels of generalization difficulty to assess robustness. Our systematic evaluation reveals that current VLA models, despite being trained on diverse datasets, struggle to generalize effectively to these unseen tasks. To overcome this limitation, we propose Cross-Task In-Context Manipulation (X-ICM), a method that conditions large language models (LLMs) on in-context demonstrations from seen tasks to predict action sequences for unseen tasks. Additionally, we introduce a dynamics-guided sample selection strategy that identifies relevant demonstrations by capturing cross-task dynamics. On AGNOSTOS, X-ICM significantly improves cross-task zero-shot generalization performance over leading VLAs. We believe AGNOSTOS and X-ICM will serve as valuable tools for advancing general-purpose robotic manipulation.
We present AGNOSTOS, a simulation benchmark built in RLBench to rigorously
evaluate zero-shot, cross-task generalization for robotic manipulation. It features 23
unseen test tasks, distinct from typical training distributions.
shares partial semantics (e.g., similar objects like "cups" or motions like "put") with seen tasks.
entirely novel scenarios with no overlapping objects or actions.
To push the boundaries of cross-task zero-shot generalization in vision-language-action (VLA) models, we propose a method called Cross-task In-context Manipulation (X-ICM). Leveraging the cross-task generalization capabilities of LLMs, X-ICM utilizes demonstrations from seen tasks as in-context examples. The dynamic characteristics of these examples are used to prompt the LLM to predict action sequences for unseen tasks. A central challenge in this setting is that the selection of in-context demonstrations significantly affects generalization performance. To address this, we design a dynamics-guided sample selection module that measures similarities between dynamic representations of seen and unseen tasks to guide the selection process, resulting in improved cross-task generalization.
Overview of the X-ICM framework. For an unseen task, dynamically relevant demonstrations from seen tasks are retrieved to prompt an LLM for action sequence prediction.
@article{zhou2025exploring,
title = {Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization},
author = {Jiaming Zhou and Ke Ye and Jiayi Liu and Teli Ma and Zifan Wang and Ronghe Qiu and Kun-Yu Lin and Zhilin Zhao and Junwei Liang},
journal = {arXiv preprint},
year = {2025},
note = {Replace with actual publication details when available}
}