FIPER
Failure Prediction at Runtime
Abstract
We propose FIPER, a general framework for Failure Prediction at Runtime for generative imitation learning policies without requiring failure data. FIPER combines (i) random network distillation (RND) for out-of-distribution (OOD) detection in the policy’s observation embedding space and (ii) a novel action chunk entropy (ACE) score for quantifying uncertainty in the conditional action distribution. Both failure prediction scores are calibrated on a small set of successful rollouts using conformal prediction and aggregated over short time windows. A failure alarm is triggered when both indicators exceed their respective thresholds. We evaluate FIPER across five simulation and real-world environments involving diverse failure modes. Results show that FIPER better distinguishes actual failures from benign OOD situations and predicts failures more accurately and earlier than existing approaches. We thus consider FIPER an important step towards more interpretable and safer generative robot policies.

Motivation
Recent advances in imitation learning with generative policies (diffusion, flow matching) have enabled robots to solve increasingly complex tasks. However, safe deployment remains challenging:
- Distribution shifts or compounding errors can cause unpredictable and erratic behavior, leading to task failures.
- Existing methods either:
- Flag any OOD observation/action (many false alarms)
- Use VLMs to retrospectively detect failures (too late)
- Rely on the existence of failure data (unsafe/impractical to collect)
- Use only policy inputs or outputs alone, missing crucial correlations
- Fail to account for the multi-modal action distributions of generative policies




Method

The design of FIPER is motivated by the insight that failures typically involve unfamiliar observations and ambiguous actions. Therefore, FIPER predicts policy failures during runtime by combining two complementary indicators:
- Observation-Based Score (RND-OE): Random Network Distillation applied in the policy’s own observation embedding space detects when current observations deviate from those seen in successful rollouts. Applying RND to the observation embeddings instead of raw observations improves robustness to irrelevant distribution shifts.
- Action Chunk Entropy (ACE): The uncertainty in a batch of action chunks is quantified via a novel entropy score in end-effector space, capturing persistent ambiguity in the robot’s intended behavior. Our proposed score is both computationally lightweight and effective at handling multi-modal action distributions.
Evaluation
We evaluate FIPER across five diverse tasks using both diffusion- and flow-based policies. Calibration relies only on a small set of successful rollouts (50 in simulation, 10 in the real world). Across all environments, FIPER consistently predicts failures earlier and more accurately than baselines, highlighting its potential to enhance uncertainty quantification and safety of generative robot policies.


Examples
The videos below show example rollouts of our real-world rope manipulation task (10x). The left column shows successful rollouts where the robot manages to form a pretzel, while the right column shows failure cases where the robot fails to grasp the rope or the rope does not stay in place. As soon as FIPER predicts a failure, the frame around the video turns red.
Success Rollout
Failure Rollout
(rope does not stay in place)
Success Rollout
Failure Rollout
(misplacement and grasp failure)
Highlights
- FIPER predicts failures of generative robot policies at runtime without requiring failure data,
- combines observation-based and action-based indicators for fast and robust failure prediction,
- does not require access to the policy's training data and is calibrated using only a small number of successful rollouts,
- uses conformal prediction to provide statistical guarantees on prediction performance,
- outperforms existing methods in distinguishing true failures from out-of-distribution situations the policy can generalize to.
BibTeX
@inproceedings{romer2025fiper,
title={Failure Prediction at Runtime for Generative Robot Policies},
author={Ralf R{\"o}mer and Adrian Kobras and Luca Worbis and Angela P. Schoellig},
journal={Advances in Neural Information Processing Systems (NeurIPS)},
year={2025}
}