Uncertainty Quantification for Flow-Based Vision-Language-Action Models

Römer, Ralf; Seeliger, Maximilian; Liu, Saida; Sturgis, Ben; Bagatella, Marco; Marta, Daniel; Krause, Andreas; Schoellig, Angela P.

Uncertainty Quantification for Flow-Based Vision-Language-Action Models

2026

Ralf Römer¹, Maximilian Seeliger², Saida Liu¹, Ben Sturgis¹, Marco Bagatella^2,3, Daniel Marta², Andreas Krause², Angela P. Schoellig¹

¹ Technical University of Munich ² ETH Zurich ³ MPI for Intelligent Systems, Tübingen

Paper Code (coming soon)

Abstract

Vision-language-action models (VLAs) combine vision-language backbones with expressive generative action heads trained via flow matching on large-scale robotic datasets. Despite their strong empirical performance in robotic manipulation, VLAs lack mechanisms to quantify confidence in their predictions and to detect when their actions may be unreliable. This presents a critical limitation for real-world deployment in non-stationary environments, where models inevitably encounter scenarios outside their pretraining distribution and may fail without warning. To address this, we derive an efficient method for quantifying epistemic uncertainty in flow-matching models by leveraging velocity-field disagreement (VFD) across a small ensemble. We successfully use this uncertainty estimate for failure detection during deployment and active fine-tuning of flow-based VLAs. To this end, we propose SAVE, a framework for uncertainty-guided active multitask fine-tuning that reduces the number of costly expert demonstrations required to adapt VLAs to new tasks. Through extensive experiments on the LIBERO benchmark, we demonstrate that VFD yields better-calibrated uncertainty estimates predictive of downstream performance, that VFD achieves strong performance in detecting failures, and that uncertainty-guided data acquisition with SAVE requires at least 22% fewer samples than baselines. In summary, our work shows that quantifying epistemic uncertainty in flow-based VLAs improves both failure awareness and adaptation.

Overview

Figure 1. Top: VFD quantifies epistemic uncertainty by measuring scaled differences between ensembled velocity fields. Bottom: SAVE prioritizes tasks by their mean VFD uncertainty and, for the most uncertain initial observations within each sampled task, requests an expert demonstration. The models are then fine-tuned using new and replay data, yielding data-efficient multitask adaptation.

Motivation

Modern VLAs pair a pre-trained vision-language backbone with an action expert trained via flow matching, achieving impressive multitask performance. Yet deploying them in the real world exposes them to non-stationarity: users repurpose robots for novel tasks, object appearances change, and environments evolve.

VLAs confidently execute erratic actions in out-of-distribution scenarios rather than abstaining or asking for help.
They cannot communicate what they do not know, preventing robust self-improvement and timely failure detection.
Robustly adapting VLAs to new domains currently requires collecting large numbers of costly expert demonstrations.

We address these challenges by quantifying epistemic uncertainty in flow-based VLAs through velocity-field disagreement (VFD), and use it both to detect failures at deployment and to guide data-efficient adaptation with SAVE (sample-efficient active fine-tuning via velocity-field epistemic uncertainty).

Velocity-Field Disagreement

VFD measures the disagreement between the velocity fields of a small ensemble of flow-matching models along their generative ODE trajectories. The estimator is mathematically grounded, computationally tractable, and naturally handles the high-dimensional, multimodal action distributions of VLAs. On a 2D toy problem, VFD is high precisely where inputs leave the training distribution, tracking the KL divergence between the learned and ground-truth conditional distributions.

Figure 2. Epistemic uncertainty estimation for a 2D generative modeling problem. Our velocity-field disagreement (VFD) uncertainty score is high for inputs far from the training distribution, similar to the KL divergence between the learned models' conditional distributions and the ground truth.

Calibration

On the LIBERO benchmark, VFD is more strongly correlated with per-task success rates than a wide range of uncertainty baselines, making it a reliable predictor of downstream performance. A lightweight two-member ensemble is sufficient, and VFD remains well-calibrated even when only the language prompt is varied.

**Table 1. Calibration analysis.** Negative Spearman rank (↑) and negative Pearson (↑) correlation between uncertainty estimates and per-task success rates, averaged across iterative fine-tuning rounds.
Metric	Action-L2	ACE	DECU	GU	Entropy	Perplexity	VFD (ours)
−Spearman	0.50^±0.13	0.31^±0.12	0.31^±0.13	0.62^±0.00	0.10^±0.12	−0.04^±0.09	0.71^±0.03
−Pearson	0.48^±0.09	0.36^±0.08	0.23^±0.15	0.65^±0.02	0.23^±0.21	0.02^±0.15	0.71^±0.02

Figure 3. The VLA ensemble size has little impact on calibration, allowing for a lightweight two-member ensemble.

Figure 4. VFD is also well-calibrated when varying only the language prompt.

Sample-Efficient Active Fine-Tuning

Using VFD to guide demonstration collection, SAVE reaches target success rates with substantially fewer fine-tuning rounds and attains the highest final success rate among all selection strategies — reducing the required expert data by at least 22% compared to baselines.

**Table 2. Sample efficiency.** Number of active fine-tuning rounds (↓) to reach certain success rates (SR) and final SR (↑) for different demonstration selection strategies. Thresholds not reached within 15 rounds are marked as "—".
SR Threshold	Random	Diversity	SAVE w/ Action-L2	SAVE w/ GU	SAVE w/ VFD
≥ 40%	5.7^±1.2	4.7^±0.5	5.0^±0.8	5.0^±1.4	5.0^±0.8
≥ 45%	6.7^±1.2	7.3^±0.9	6.0^±1.4	5.7^±2.4	5.3^±0.9
≥ 50%	8.7^±2.1	9.7^±1.7	9.0^±2.4	7.3^±1.9	6.0^±0.8
≥ 55%	—	—	9.0^±1.0	11.0^±1.6	8.0^±1.6
≥ 60%	—	—	—	12.7^±1.7	10.0^±0.8
≥ 65%	—	—	—	—	12.5^±0.5
Final SR	54.6^±0.9	54.9^±1.3	56.8^±7.6	64.0^±2.6	67.1^±3.2

Effect of task-sampling temperature on SAVE

Figure 5. Effect of the task-sampling temperature τ on SAVE. Larger τ biases expert queries toward higher-uncertainty tasks, with uniform sampling for τ=0. Left: Uncertainty guidance is beneficial both for sampling tasks and initial observations; legend entries containing uniform sample initial observations within a task uniformly instead of uncertainty-guided. Middle: Uncertainty-based sampling more rapidly reduces the fraction of uncertainty concentrated in the most uncertain task, indicating better allocation of demonstrations to underperforming tasks. Right: SAVE reduces the difference in prior knowledge about tasks; for τ ≤ 2.5, the temperature controls an exploration–exploitation trade-off between task coverage and final success rate.

Highlights

A mathematically grounded epistemic uncertainty estimator for flow-matching models based on velocity-field disagreement (VFD).
SAVE, a framework for uncertainty-guided active fine-tuning that uses VFD to prioritize tasks and initial states for expert demonstration collection.
VFD yields better-calibrated uncertainty estimates predictive of downstream task performance, with a lightweight two-member ensemble.
VFD detects deployment failures more reliably than baselines.
SAVE reduces the expert data required for multitask adaptation by at least 22%.

BibTeX

@article{romer2026uq_vla,
  title={Uncertainty Quantification for Flow-Based Vision-Language-Action Models},
  author={Ralf R{\"o}mer and Maximilian Seeliger and Saida Liu and Ben Sturgis and Marco Bagatella and Daniel Marta and Andreas Krause and Angela P. Schoellig},
  year={2026}
}