Failure Prediction at Runtime for Generative Robot Policies

Römer, Ralf; Kobras, Adrian; Worbis, Luca; Schoellig, Angela P.

Failure Prediction at Runtime for Generative Robot Policies

NeurIPS 2025

Ralf Römer^*,1, Adrian Kobras^*,1, Luca Worbis¹, Angela P. Schoellig¹

¹ Learning Systems and Robotics Lab, Technical University of Munich
^*Equal Contribution

Paper Code arXiv

FIPER

Failure Prediction at Runtime

Abstract

We propose FIPER, a general framework for Failure Prediction at Runtime for generative imitation learning policies without requiring failure data. FIPER combines (i) random network distillation (RND) for out-of-distribution (OOD) detection in the policy’s observation embedding space and (ii) a novel action chunk entropy (ACE) score for quantifying uncertainty in the conditional action distribution. Both failure prediction scores are calibrated on a small set of successful rollouts using conformal prediction and aggregated over short time windows. A failure alarm is triggered when both indicators exceed their respective thresholds. We evaluate FIPER across five simulation and real-world environments involving diverse failure modes. Results show that FIPER better distinguishes actual failures from benign OOD situations and predicts failures more accurately and earlier than existing approaches. We thus consider FIPER an important step towards more interpretable and safer generative robot policies.

Motivation

Recent advances in imitation learning with generative policies (diffusion, flow matching) have enabled robots to solve increasingly complex tasks. However, safe deployment remains challenging:

Distribution shifts or compounding errors can cause unpredictable and erratic behavior, leading to task failures.
Existing methods either:
- Flag any OOD observation/action (many false alarms)
- Use VLMs to retrospectively detect failures (too late)
- Rely on the existence of failure data (unsafe/impractical to collect)
- Use only policy inputs or outputs alone, missing crucial correlations
- Fail to account for the multi-modal action distributions of generative policies

We address these challenges with FIPER, a framework for failure prediction at runtime that provides early and accurate warnings without requiring failure examples.

Figure 1: Generative models are able to learn multi-modal action distributions. In contrast to other methods, FIPER is specifically designed to handle these multi-modal action capabilities.

Method

The design of FIPER is motivated by the insight that failures typically involve unfamiliar observations and ambiguous actions. Therefore, FIPER predicts policy failures during runtime by combining two complementary indicators:

Observation-Based Score (RND-OE): Random Network Distillation applied in the policy’s own observation embedding space detects when current observations deviate from those seen in successful rollouts. Applying RND to the observation embeddings instead of raw observations improves robustness to irrelevant distribution shifts.
Action Chunk Entropy (ACE): The uncertainty in a batch of action chunks is quantified via a novel entropy score in end-effector space, capturing persistent ambiguity in the robot’s intended behavior. Our proposed score is both computationally lightweight and effective at handling multi-modal action distributions.

Both scores are aggregated over short moving windows and calibrated with conformal prediction using only a small number of successful trajectories. A failure alarm is raised only when both indicators exceed their respective thresholds, yielding early and robust prediction while avoiding false alarms on benign OOD states.

Evaluation

We evaluate FIPER across five diverse tasks using both diffusion- and flow-based policies. Calibration relies only on a small set of successful rollouts (50 in simulation, 10 in the real world). Across all environments, FIPER consistently predicts failures earlier and more accurately than baselines, highlighting its potential to enhance uncertainty quantification and safety of generative robot policies.

Examples

The videos below show example rollouts of our real-world rope manipulation task (10x). The left column shows successful rollouts where the robot manages to form a pretzel, while the right column shows failure cases where the robot fails to grasp the rope or the rope does not stay in place. As soon as FIPER predicts a failure, the frame around the video turns red.

Play All Videos

Success Rollout

Failure Rollout

(rope does not stay in place)

Success Rollout

Failure Rollout

(misplacement and grasp failure)

PushT

Although they are often correlated, accurate failure prediction requires differentiating between out-of-distribution (OOD) situations to which the policy can generalize and actual failures. In the PushT environment, OOD conditions are created by varying the scale and dimensions of the T-object. Notably, the polilcy can also fail under in-distribution (ID) conditions. Compared to baselines, our scores RND-OE and ACE better distinguish true failures from OOD situations the policy can generalize to.

Play All Videos

Success Rollout

(ID conditions)

Success Rollout

(OOD conditions)

Failure Rollout

(ID conditions)

Failure Rollout

(OOD conditions)

Sorting

For the sorting task, OOD conditions are created by varying the size of the cubes. As soon as FIPER predicts a failure, the frame around the video turns red.

Play All Videos

Success Rollout

(ID conditions)

Success Rollout

(OOD conditions)

Failure Rollout

(ID conditions)

Failure Rollout

(OOD conditions)

Highlights

FIPER predicts failures of generative robot policies at runtime without requiring failure data,
combines observation-based and action-based indicators for fast and robust failure prediction,
does not require access to the policy's training data and is calibrated using only a small number of successful rollouts,
uses conformal prediction to provide statistical guarantees on prediction performance,
outperforms existing methods in distinguishing true failures from out-of-distribution situations the policy can generalize to.

For more, check out our paper.

BibTeX

@inproceedings{romer2025fiper,
          title={Failure Prediction at Runtime for Generative Robot Policies},
          author={Ralf R{\"o}mer and Adrian Kobras and Luca Worbis and Angela P. Schoellig},
          journal={Advances in Neural Information Processing Systems (NeurIPS)},
          year={2025}
        }