Exploring the Limits of Vision-Language-Action Manipulations

Abstract

The generalization capabilities of vision-language-action (VLA) models to unseen tasks are crucial to achieving general-purpose robotic manipulation in open-world settings. However, the cross-task generalization capabilities of existing VLA models remain significantly underexplored. To address this gap, we introduce AGNOSTOS, a novel simulation benchmark designed to rigorously evaluate cross-task zero-shot generalization in manipulation. AGNOSTOS comprises 23 unseen manipulation tasks for test-distinct from common training task distributions-and incorporates two levels of generalization difficulty to assess robustness. Our systematic evaluation reveals that current VLA models, despite being trained on diverse datasets, struggle to generalize effectively to these unseen tasks. To overcome this limitation, we propose Cross-Task In-Context Manipulation (X-ICM), a method that conditions large language models (LLMs) on in-context demonstrations from seen tasks to predict action sequences for unseen tasks. Additionally, we introduce a dynamics-guided sample selection strategy that identifies relevant demonstrations by capturing cross-task dynamics. On AGNOSTOS, X-ICM significantly improves cross-task zero-shot generalization performance over leading VLAs. We believe AGNOSTOS and X-ICM will serve as valuable tools for advancing general-purpose robotic manipulation.

AGNOSTOS Benchmark

We present AGNOSTOS, a simulation benchmark built in RLBench to rigorously evaluate zero-shot, cross-task generalization for robotic manipulation. It features 23 unseen test tasks, distinct from typical training distributions.

These tasks are categorized into two difficulty levels:

Level-1: 13 unseen tasks, sharing partial semantics (e.g., similar objects like "cups" or motions like "put") with seen tasks
Level-2: 10 unseen tasks, introducing entirely novel scenarios with no overlapping objects or actions.

We benchmark three broad categories of VLA models:

Foundation VLAs: trained on large-scale real-world cross-embodiment robotic data or built upon LLM or VLM models, including OpenVLA, RDT, π0, LLARVA, SAM2Act, 3D-LOTUS++, and VoxPoser.
Human-video VLAs: pre-trained on large-scale human action videos to capture rich human-object interactions for downstream robotic fine-tuning, including R3M, D4R, R3M-Align, and D4R-Align.
In-domain VLAs: rained from scratch on RLBench's 18 seen tasks with task specific model architectures. These serve as strong baselines without domain mismatch, including PerAct, RVT, RVT2, Sigma-Agent, and Instant Policy.

AGNOSTOS Training Tasks

Close Jar

Insert Onto Square Peg

Light Bulb In

Meat Off Grill

Open Drawer

Place Cups

Place Shape In Sorter

Place Wine At Rack

Push Buttons

Put Groceries In Cupboard

Put Item In Drawer

Put Money In Safe

Reach And Drag

Slide Block To Color Target

Stack Blocks

Stack Cups

Sweep To Dustpan

Turn Tap

AGNOSTOS: Level-1, 13 Unseen Tasks

Close Fridge

Close Laptop Lid

Close Microwave

Lamp Off

Lamp On

Open Grill

Phone On Base

Put Books On Bookshelf

Put Knife On Chopping Board

Put Rubbish In Bin

Put Toilet Roll On Stand

Put Umbrella In Umbrella Stand

Toilet Seat Down

shares partial semantics (e.g., similar objects like "cups" or motions like "put") with seen tasks.

AGNOSTOS: Level-2, 10 Unseen Tasks

Basketball In Hoop

Beat The Buzz

Scoop With Spatula

Straighten Rope

Take Lid Off Saucepan

Take Plate Off Colored Dish Rack

Take USB Out Of Computer

Turn Oven On

Unplug Charger

Water Plants

entirely novel scenarios with no overlapping objects or actions.

X-ICM Method

To push the boundaries of cross-task zero-shot generalization in vision-language-action (VLA) models, we propose a method called Cross-task In-context Manipulation (X-ICM). Leveraging the cross-task generalization capabilities of LLMs, X-ICM utilizes demonstrations from seen tasks as in-context examples. The dynamic characteristics of these examples are used to prompt the LLM to predict action sequences for unseen tasks. A central challenge in this setting is that the selection of in-context demonstrations significantly affects generalization performance. To address this, we design a dynamics-guided sample selection module that measures similarities between dynamic representations of seen and unseen tasks to guide the selection process, resulting in improved cross-task generalization.

Overview of the X-ICM framework. For an unseen task, dynamically relevant demonstrations from seen tasks are retrieved to prompt an LLM for action sequence prediction.

Benchmarking Results

X-ICM (7B) and X-ICM (72B) achieve average success rates of 23.5% and 30.1%, respectively, outperforming all existing VLA models.
All prior models completely fail on at least eight of the 23 tasks. In contrast, X-ICM (7B) fails on only two, and X-ICM (72B) succeeds on all.

Real-world task testings

Block Into Bin, Success 1

Block Into Bin, Success 2

Block Into Bin, Failure 2

Pill Bottle Into Box, Success 1

Pill Bottle Into Box, Success 2

Pill Bottle Into Box, Failure 1

Stack Blocks, Success 1

Stack Blocks, Success 2

Stack Blocks, Failure

BibTeX

@article{zhou2025exploring,
    title={Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization},
    author={Zhou, Jiaming and Ye, Ke and Liu, Jiayi and Ma, Teli and Wang, Zifang and Qiu, Ronghe and Lin, Kun-Yu and Zhao, Zhilin and Liang, Junwei},
    journal={arXiv preprint arXiv:2505.15660},
    year={2025}
}