Published as a conference paper at ICLR 2026

Verifier-free Test-Time Sampling
for Vision-Language-Action Models

1KAIST, 2Seoul National University, 3RLWRLD
TL;DR

We propose a masking distribution‑guided test-time scaling method to improve the precision of VLAs in robotic manipulation.

Overview of MG-Select

Abstract

Vision-Language-Action models (VLAs) excel at robot control but their single-inference, greedy-decoding paradigm bottlenecks high-precision manipulation. Existing test-time scaling approaches rely on external verifiers that require additional training and fail to generalize to unseen conditions.

We propose Masking Distribution Guided Selection (MG-Select), a verifier-free test-time scaling framework that uses KL divergence from a condition-masking reference distribution as a Best-of-N confidence signal. The reference is produced by the same VLA with randomly masked states and language conditions. A joint training strategy that applies dropout to state and language conditions further sharpens the reference at test time.

Our experiments demonstrate that MG-Select consistently improves state-of-the-art VLAs across diverse simulation and real-world pick-and-place benchmarks, without any additional training or external module at inference.

Key Results

MG-Select achieves substantial gains over greedy decoding across simulation and real-world benchmarks, without any external verifier.

+168%
RoboCasa pick-and-place (30 demos)
+28%
Real-world in-distribution
+35%
Real-world out-of-distribution

Method

1. Motivation

VLAs have shown remarkable performance in robot control, yet they remain fundamentally limited on tasks that demand high precision. Even after extensive pre-training, they often fail on fine-grained manipulation such as grasping or object placement. This precision gap is especially problematic for real-world applications where millimeter-level accuracy can decide task success.

A natural remedy, inspired by Test-Time Scaling (TTS) in LLMs, is repeated sampling + Best-of-N selection. Prior work pairs sampling with an external verifier trained via reinforcement learning on robotic data, which introduces two significant drawbacks: (i) it adds substantial training cost and deployment complexity, and (ii) the learned verifiers fail to generalize to unseen task prompts or objects, limiting broader applicability.

Our goal is therefore a TTS framework that leverages the model's internal properties, with no extra training and no external modules.

2. Condition-Masking Distributional Confidence

Naive likelihood-based Best-of-N often fails because VLAs fine-tuned on the target task produce overly concentrated action token distributions, causing multiple samples to collapse to the same action. Instead, we compute a confidence score via the KL divergence between the predicted distribution and a reference distribution that represents uncertainty. Intuitively, actions that deviate most from an uncertainty-aware reference are the most confident.

We build that reference using the same VLA, but with specific input modalities masked, approximating failure modes where essential conditions are ignored. We consider three variants:

Text-masking: $\mathrm{KL}_{\text{text}} = \mathrm{KL}\big(\pi_\theta(\cdot \mid o_t, q_t, \varnothing, a_{<i}) \,\|\, \pi_\theta(\cdot \mid o_t, q_t, I, a_{<i})\big)$
State-masking: $\mathrm{KL}_{\text{state}} = \mathrm{KL}\big(\pi_\theta(\cdot \mid o_t, \varnothing, I, a_{<i}) \,\|\, \pi_\theta(\cdot \mid o_t, q_t, I, a_{<i})\big)$
Both: $\mathrm{KL}_{\text{both}} = \mathrm{KL}\big(\pi_\theta(\cdot \mid o_t, \varnothing, \varnothing, a_{<i}) \,\|\, \pi_\theta(\cdot \mid o_t, q_t, I, a_{<i})\big)$

Token-wise KL is aggregated into an action-level confidence $C_{\tilde{a}} = \sum_{i \in \mathcal{I}} \mathrm{KL}(Q_i \,\|\, P_i)$, and the final action is selected via Best-of-N:

$a^{\ast} = \arg\max_{\tilde{a}^{(n)} \in \tilde{\mathcal{A}}} C_{\tilde{a}^{(n)}}$

The optimal masking variant depends on the environment: state-masking works best on benchmarks that are single-task pick-and-place (e.g., SIMPLER-WidowX), while text-masking dominates on multi-task environments (e.g., RoboCasa) where instructions are indispensable.

3. Joint Training Strategy

Standard VLAs are not trained to see masked inputs, so directly feeding a masked condition yields degenerate distributions. We augment fine-tuning with all four masking variants:

(i) all-condition $(q_t, I)$ (ii) text-masking $(q_t, \varnothing)$ (iii) state-masking $(\varnothing, I)$ (iv) both-masking $(\varnothing, \varnothing)$

Dropout over $q_t$ and $I$ is applied during training. The resulting VLA, denoted MG-Select*, matches standard-fine-tuning performance and produces meaningful condition-masking references at test time, yielding a stronger confidence signal and further amplifying MG-Select's gains.

Qualitative Results

MG-Select produces high-precision actions at the critical moments, grasping and releasing, where the base policy often fails.

Grasping sponge from box
π0-FAST-DROID (base model)
π0-FAST-DROID + Ours
Releasing sponge to bowl
π0-FAST-DROID (base model)
π0-FAST-DROID + Ours

Real-world "Box to Bowl" task on the Franka Research 3.

Efficient Deployment

Since MG-Select generates $N$ candidate actions in parallel, a naive implementation repeats the expensive prefill step $N$ times. This is particularly critical for VLAs, which re-prefill at every timestep to condition on the current observation.

We design a single-prefill deployment that shares one prefill across all $N$ candidates before decoding. With $N=4$, this gives a 45% latency reduction compared to vanilla MG-Select, keeping inference time comparable to single-action inference across different candidate sizes.

Inference latency comparison on LIBERO-Object
Figure 3. Inference latency on LIBERO-Object. MG-Select with single prefill stays near-flat as $N$ grows, while vanilla MG-Select scales super-linearly.

Table 5 (b). Effect of candidate count $N$ on RoboCasa (100 demos). Gains saturate quickly, and $N=4$ already captures most of the improvement.

$N$ PnP All
127.643.8
230.046.2
431.048.1
830.046.9
1630.746.1
3231.046.6
6433.348.4

Benchmark Results

MG-Select consistently improves state-of-the-art VLAs across simulation, real-world Franka experiments, a real-to-sim evaluation, and a long-horizon zero-shot benchmark, all without any external verifier.

(1) RoboCasa, pick-and-place precision

  • 8 pick-and-place tasks (of 24 kitchen tasks), trained with 30, 100, and 300 demonstrations.
  • MG-Select* yields the largest gains in the low-data regime, +168% relative on PnP at 30 demos.
Model 30 Demos 100 Demos 300 Demos
PnPAll PnPAll PnPAll
GR00T N1 0.417.4 2.232.1 22.649.6
π0-FAST† 5.330.9 17.040.2 43.261.2
  + MG-Select 7.232.0 22.643.7 46.561.3
  + MG-Select* 14.234.6 31.048.1 46.962.9

* denotes additional joint training before test-time scaling.

(2) Real-World, Franka Research 3

  • 7-DoF Franka arm with π0-FAST-DROID, tested on in-distribution (seen objects) and out-of-distribution (unseen objects) pick-and-place tasks.
  • MG-Select* gives +28% on ID tasks and +35% on OOD tasks, confirming real-world gains beyond simulation.

Table 4. In-distribution: 60 demos per task, 24 trials (4 objects × 6 trials) per task.

Model Box→Bowl Box→Plate Basket→Bowl Plate→Basket Avg.
π0-FAST-DROID 41.737.545.825.037.5
  + MG-Select* 58.354.250.029.247.9

* denotes additional joint training before test-time scaling.

Table 3. Out-of-distribution: unseen objects, 16 trials per task.

Model Pick up Tape Take Cup out of Bowl Avg.
π0-FAST-DROID 56.350.053.1
  + MG-Select 68.875.071.9

(3) SIMPLER-WidowX, real-to-sim evaluation

  • π0-FAST trained on BridgeData V2, evaluated on 4 SIMPLER pick-and-place tasks in a real-to-sim setting.
  • MG-Select* raises average success from 46.9 to 50.3, the highest among all compared VLAs.
Model Spoon on Towel Carrot on Plate Stack Cubes Eggplant in Basket Avg.
RT-1-X0.04.20.00.01.1
Octo12.58.30.043.116.0
RoboVLM29.225.012.558.331.3
SpatialVLA16.725.029.2100.042.7
π0-FAST† 66.770.841.78.346.9
  + MG-Select* 69.475.043.113.950.3

* denotes additional joint training before test-time scaling.

(4) CALVIN, long-horizon zero-shot generalization

  • π0-FAST trained on environments A, B, and C, zero-shot evaluated on novel environment D (ABC→D) over 1,000 instruction chains across 34 tasks.
  • MG-Select* improves success at every chain depth (1–5 steps), raising average chain length from 3.69 to 3.86.
Model Task Tasks Completed in a Row (%) Avg. Len (↑)
12345
π0-FAST† ABC→D 96.085.874.462.450.6 3.69
  + MG-Select* ABC→D 96.988.077.867.655.8 3.86

* denotes additional joint training before test-time scaling.

BibTeX

@inproceedings{jang2026verifierfree,
  title     = {Verifier-free Test-Time Sampling for Vision-Language-Action Models},
  author    = {Suhyeok Jang and Dongyoung Kim and Changyeon Kim and Youngsuk Kim and Jinwoo Shin},
  booktitle = {The Fourteenth International Conference on Learning Representations},
  year      = {2026},
  url       = {https://openreview.net/forum?id=UD4Rw8MOEK}
}