Ralf Römer1,*, Adrian Kobras1,*, Luca Worbis1, Angela P. Schoellig1,
1Technical University of Munich
The official code repository for "Failure Prediction at Runtime for Generative Robot Policies," accepted to NeurIPS 2025.
FIPER is a general framework for predicting failures of generative robot policies across different tasks. The repository handles task initialization, dataset management, policy training, evaluation of failure prediction, and result visualization.
fiper/
├── configs/ # Configuration files for tasks, evaluation, and results
│ ├── default.yaml # Default pipeline configuration: Set methods and tasks to evaluate
│ ├── eval/ # Evaluation-specific configurations including method hyperparameters
│ └── task/ # Task-specific configurations including policy parameters
├── data/ # Directory for storing task-specific data (rollouts, models, etc.) and results
│ ├── {task}/ # Subdirectories for each task (e.g., push_t, pretzel)
│ └── results/ # Generated results
├── datasets/ # Data management
│ ├── __init__.py
│ └── rollout_datasets.py # ProcessedRolloutDataset class implementation
├── evaluation/ # Evaluation module
│ ├── __init__.py
│ ├── evaluation_manager.py # Class that manages the evaluation
│ ├── results_manager.py # Class that manages the results generation
│ └── method_eval_classes/ # Base and method-specific evaluation classes
├── scripts/ # Main scripts for running the pipeline and generating results
│ ├── run_fiper.py # Main pipeline script
│ ├── results_generation.py # Script for generating summaries and visualizations of the results
├── shared_utils/ # Shared utility functions
├── rnd/ # Random Network Distillation (RND)-specific modules
# Create and activate the Conda environment
conda env create -f environment_clean.yml
conda activate fiper
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126
The repository requires test and calibration rollouts generated by a generative policy. These rollouts must include all necessary data (e.g., action predictions, observation embeddings, states, RGB images, etc.) for the failure prediction methods.
Our calibration and test rollouts can be downloaded here. After downloading, place the extracted rollouts into the following directory structure:
fiper/data/{task}/rollouts
Replace {task}
with the name of the respective task (e.g., push_t
). After placing the rollouts, each task folder should have a subfolder rollouts
with a test
and a calibration
subfolder.
Rollout Structure Details
Currently, it is assumed that each rollout is saved as an individual .pkl file with one of the following structures:
- Dictionary: A dictionary with two keys,
metadata
androllout
, wheremetadata
is a dictionary containing the metadata of the rollout androllout
is a list with the k-th entry being a dictionary that contains the neccessary rollout data of the k-th rollout timestep. - List: Only the
rollout
part of the Dictionary option. It is checked whether the first entry of the rollout list contains the rollout metadata.
It is recommended to provide task-specific metadata in the corresponding task configuration file. Additionally, basic information (success and rollout ID) can be extracted from the rollout filenames.
Below is an overview of the key configuration components in the configs/
directory:
default.yaml
: Specifies the tasks and methods to evaluate.eval/
: Contains evaluation settings:eval/base.yaml
: Common evaluation setting.eval/{method}.yaml
: Method-specific hyperparameters.
task/{task}.yaml
: Contains task-specific and policy parameters, such as observation and action spaces.results/base.yaml
: Defines how to process results and which plots to generate.
Once the desired settings are configured, run FIPER:
python fiper/scripts/run_fiper.py
After the pipeline run is complete, you can generate various results and visualizations by adjusting the results/base.yaml
configuration file and running:
python fiper/scripts/results_generation.py
During calibration, thresholds are calculated using Conformal Prediction (CP) based on the uncertainty scores of the calibration rollouts. During evaluation, a test rollout is flagged as failed if the uncertainty score at any step surpasses the threshold at that step.
Thresholds are controlled by:
- A quantile parameter, which defines the percentage of calibration rollouts flagged as successful.
- The window size, which indirectly influences the thresholds (see Moving Window Design).
We support the following threshold styles:
-
Constant Thresholds: These thresholds are static and calculated based on the maximum uncertainty score of each calibration rollout.
ct_quantile
: Threshold set to a specific quantile of the maximum scores across calibration rollouts. For example, the 95th percentile ensures that 95% of calibration rollouts are classified as successful.
-
Time-Varying Thresholds: These thresholds vary over time and are calculated for each timestep in the calibration rollouts.
tvt_cp_band
: A time-varying CP threshold.tvt_quantile
: Similar toct_quantile
, but applied at each timestep.
Extension of Time-Varying Thresholds
Since successful rollouts are typically shorter than failed ones, the calibration set may not provide thresholds for the entire length of the test rollouts. To address this, the time-varying thresholds are extended to match the maximum length of the test rollouts. This is implemented in two ways:
- Repeat Last Value (default): Use the last available threshold value for all remaining steps.
- Repeat Mean: Use the mean of the thresholds from the calibration rollouts for the remaining steps.
Extension of Time-Varying Thresholds:
The moving window aggregates uncertainty scores over a fixed number of past steps (defined by window_size
), including the current step. This approach allows the failure predictor to consider past uncertainty scores and improves robustness by smoothing the thresholds and uncertainty scores and thus reducing sensitivity to outliers. For instance, for window_size = 5
the uncertainty score at step t
is calculated as the aggregate of the scores from steps max(t-4 , 0)
to t
.
-
Create a Task Configuration File: Add a task-specific configuration file in
fiper/configs/task/{task_name}.yaml
. -
Load Raw Rollouts: Place the raw rollouts for the task in
fiper/data/{task_name}/rollouts/
. -
Update the Default Configuration: Add the task to
available_tasks
andtasks
infiper/configs/default.yaml
.
-
Add a new evaluation class in
fiper/evaluation/method_eval_classes/
, inheriting fromBaseEvalClass
infiper/evaluation/method_eval_classes/base_eval_class.py
. -
Implement the
calculate_uncertainty_score
function to compute uncertainty scores for each rollout step based on the required elements. -
If the method requires model loading or preprocessing, implement the
load_model
andexecute_preprocessing
functions.
Naming Convention: The class name of a method evaluation class is given by f"{{method_name}.replace('_', '').upper()}Eval"
. For instance, the evaluation class of the rnd_oe
method is RNDOEEval
.
- Add a configuration file in
fiper/configs/eval/{method_name}.yaml
, inheriting from the base evaluation configuration file. - Define the method-specific parameters in this file.
- Add the new method to the
methods
andimplemented_methods
lists infiper/configs/default.yaml
.
Workflow & Module Details
The pipeline is designed to evaluate failure prediction methods on generative robot policies. It consists of the following key stages and modules:
- Task Initialization:
- Use the
TaskManager
to load raw rollouts, extract metadata, and initialize the dataset.
- Use the
- Dataset Management:
- Preprocess, normalize, and manage data using the
ProcessedRolloutDataset
class.
- Preprocess, normalize, and manage data using the
- Method Training:
- Train failure prediction models such as Random Network Distillation (RND) or entropy-based methods.
- Evaluation:
- Evaluate failure prediction methods using calibration and test rollouts.
- Result Management:
- Summarize, save, and visualize evaluation results using the
ResultsManager
.
- Summarize, save, and visualize evaluation results using the
The TaskManager
is responsible for managing task-specific data and configurations.
-
Responsibilities:
- Loads raw rollouts from the specified directories.
- Extracts metadata and converts rollouts into a standardized format.
- Initializes and manages the
ProcessedRolloutDataset
.
-
Key Features:
- Supports task-specific configurations via Hydra.
- Handles metadata extraction, including rollout labels (e.g., calibration, test, ID, OOD).
- Converts raw rollouts into tensors compatible with the dataset class.
-
Relevant Functions:
_load_and_convert_raw_rollouts
: Loads and processes raw rollouts.get_rollout_dataset
: Initializes or updates the dataset with new rollouts.
The ProcessedRolloutDataset
class manages the data required for training and evaluation.
-
Responsibilities:
- Stores and organizes rollout data as tensors.
- Normalizes data using calibration rollouts.
- Provides utilities for iterating over episodes and retrieving subsets of data.
-
Key Features:
- Supports metadata management, including episode indices and rollout labels.
- Allows normalization of tensors for consistent preprocessing.
- Enables filtering and iteration over specific subsets (e.g., calibration, test, ID, OOD).
-
Relevant Functions:
init_dataset
: Initializes the dataset with metadata and tensors.normalize
: Normalizes tensors using calibration rollouts.iterate_episodes
: Iterates over rollout episodes with optional filtering and history augmentation.
Failure prediction methods are trained using the processed dataset.
-
Responsibilities:
- Trains models such as Random Network Distillation (RND) or entropy-based methods.
- Prepares models for evaluation by calibrating thresholds or other parameters.
-
Key Features:
- Modular design allows easy integration of new training methods.
- Supports training on various input modalities (e.g., observation embeddings, RGB images).
-
Example:
- RND-based methods use a distillation loss to train a predictor network on calibration rollouts.
The EvaluationManager
handles the evaluation of failure prediction methods.
-
Responsibilities:
- Evaluates methods on calibration and test rollouts.
- Computes evaluation metrics such as accuracy, detection time, and uncertainty scores.
-
Key Features:
- Supports multiple evaluation metrics and method-specific configurations.
- Allows combining multiple methods for ensemble evaluations.
-
Relevant Functions:
evaluate
: Evaluates specified methods and returns results._get_method_eval_class
: Loads method-specific evaluation classes.
The ResultsManager
summarizes and visualizes evaluation results.
-
Responsibilities:
- Combines results from multiple tasks and methods.
- Generates CSV summaries and visualizations of evaluation metrics.
-
Key Features:
- Creates uncertainty plots and other visualizations for analysis.
- Supports configurable result processing via Hydra.
-
Relevant Functions:
_create_complete_df
: Combines results into a comprehensive DataFrame.combine_results
: Merges new results with existing ones for comparison.
To reproduce the results reported in the paper, execute the steps in Getting Started using the provided rollout datasets and the default settings (e.g., the configuration files as uploaded).
FIPER is offered under the MIT License agreement. If you find FIPER useful, please consider citing our work:
@inproceedings{romer2025fiper,
title={Failure Prediction at Runtime for Generative Robot Policies},
author={Ralf R{\"o}mer and Adrian Kobras and Luca Worbis and Angela P. Schoellig},
journal={Advances in Neural Information Processing Systems (NeurIPS)},
year={2025}
}