Skip to content

utiasDSL/fiper

Repository files navigation

FIPER: Failure Prediction at Runtime for Generative Robot Policies

Ralf Römer1,*, Adrian Kobras1,*, Luca Worbis1, Angela P. Schoellig1,

1Technical University of Munich

The official code repository for "Failure Prediction at Runtime for Generative Robot Policies," accepted to NeurIPS 2025.

FIPER

Overview

FIPER is a general framework for predicting failures of generative robot policies across different tasks. The repository handles task initialization, dataset management, policy training, evaluation of failure prediction, and result visualization.

Repository Structure

fiper/
├── configs/                  # Configuration files for tasks, evaluation, and results
│   ├── default.yaml          # Default pipeline configuration: Set methods and tasks to evaluate
│   ├── eval/                 # Evaluation-specific configurations including method hyperparameters
│   └── task/                 # Task-specific configurations including policy parameters
├── data/                     # Directory for storing task-specific data (rollouts, models, etc.) and results
│   ├── {task}/               # Subdirectories for each task (e.g., push_t, pretzel)
│   └── results/              # Generated results
├── datasets/                 # Data management
│   ├── __init__.py
│   └── rollout_datasets.py   # ProcessedRolloutDataset class implementation
├── evaluation/               # Evaluation module
│   ├── __init__.py
│   ├── evaluation_manager.py # Class that manages the evaluation
│   ├── results_manager.py    # Class that manages the results generation
│   └── method_eval_classes/  # Base and method-specific evaluation classes
├── scripts/                  # Main scripts for running the pipeline and generating results
│   ├── run_fiper.py          # Main pipeline script
│   ├── results_generation.py # Script for generating summaries and visualizations of the results
├── shared_utils/             # Shared utility functions
├── rnd/                      # Random Network Distillation (RND)-specific modules

Getting Started

Installation 🛠️

# Create and activate the Conda environment
conda env create -f environment_clean.yml
conda activate fiper
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126

Download Data 📁

The repository requires test and calibration rollouts generated by a generative policy. These rollouts must include all necessary data (e.g., action predictions, observation embeddings, states, RGB images, etc.) for the failure prediction methods.

Our calibration and test rollouts can be downloaded here. After downloading, place the extracted rollouts into the following directory structure:

fiper/data/{task}/rollouts

Replace {task} with the name of the respective task (e.g., push_t). After placing the rollouts, each task folder should have a subfolder rollouts with a test and a calibration subfolder.

Rollout Structure Details

Currently, it is assumed that each rollout is saved as an individual .pkl file with one of the following structures:

  • Dictionary: A dictionary with two keys, metadata and rollout, where metadata is a dictionary containing the metadata of the rollout and rollout is a list with the k-th entry being a dictionary that contains the neccessary rollout data of the k-th rollout timestep.
  • List: Only the rollout part of the Dictionary option. It is checked whether the first entry of the rollout list contains the rollout metadata.

It is recommended to provide task-specific metadata in the corresponding task configuration file. Additionally, basic information (success and rollout ID) can be extracted from the rollout filenames.

Adjusting the Pipeline Settings

Below is an overview of the key configuration components in the configs/ directory:

  • default.yaml: Specifies the tasks and methods to evaluate.
  • eval/: Contains evaluation settings:
    • eval/base.yaml: Common evaluation setting.
    • eval/{method}.yaml: Method-specific hyperparameters.
  • task/{task}.yaml: Contains task-specific and policy parameters, such as observation and action spaces.
  • results/base.yaml: Defines how to process results and which plots to generate.

Running the Pipeline

Once the desired settings are configured, run FIPER:

python fiper/scripts/run_fiper.py

Managing and Visualizing Results

After the pipeline run is complete, you can generate various results and visualizations by adjusting the results/base.yaml configuration file and running:

python fiper/scripts/results_generation.py

Evaluation Details 📊

Calibration & Threshold Design

During calibration, thresholds are calculated using Conformal Prediction (CP) based on the uncertainty scores of the calibration rollouts. During evaluation, a test rollout is flagged as failed if the uncertainty score at any step surpasses the threshold at that step.

Thresholds are controlled by:

  • A quantile parameter, which defines the percentage of calibration rollouts flagged as successful.
  • The window size, which indirectly influences the thresholds (see Moving Window Design).

We support the following threshold styles:

  1. Constant Thresholds: These thresholds are static and calculated based on the maximum uncertainty score of each calibration rollout.

    • ct_quantile: Threshold set to a specific quantile of the maximum scores across calibration rollouts. For example, the 95th percentile ensures that 95% of calibration rollouts are classified as successful.
  2. Time-Varying Thresholds: These thresholds vary over time and are calculated for each timestep in the calibration rollouts.

    • tvt_cp_band: A time-varying CP threshold.
    • tvt_quantile: Similar to ct_quantile, but applied at each timestep.
Extension of Time-Varying Thresholds

Since successful rollouts are typically shorter than failed ones, the calibration set may not provide thresholds for the entire length of the test rollouts. To address this, the time-varying thresholds are extended to match the maximum length of the test rollouts. This is implemented in two ways:

  • Repeat Last Value (default): Use the last available threshold value for all remaining steps.
  • Repeat Mean: Use the mean of the thresholds from the calibration rollouts for the remaining steps.

Extension of Time-Varying Thresholds:

Moving Window Design

The moving window aggregates uncertainty scores over a fixed number of past steps (defined by window_size), including the current step. This approach allows the failure predictor to consider past uncertainty scores and improves robustness by smoothing the thresholds and uncertainty scores and thus reducing sensitivity to outliers. For instance, for window_size = 5 the uncertainty score at step t is calculated as the aggregate of the scores from steps max(t-4 , 0) to t.

Adding Tasks

  1. Create a Task Configuration File: Add a task-specific configuration file in fiper/configs/task/{task_name}.yaml.

  2. Load Raw Rollouts: Place the raw rollouts for the task in fiper/data/{task_name}/rollouts/.

  3. Update the Default Configuration: Add the task to available_tasks and tasks in fiper/configs/default.yaml.

Adding Failure Prediction Methods

1. Create an Evaluation Class

  • Add a new evaluation class in fiper/evaluation/method_eval_classes/, inheriting from BaseEvalClass in fiper/evaluation/method_eval_classes/base_eval_class.py.

  • Implement the calculate_uncertainty_score function to compute uncertainty scores for each rollout step based on the required elements.

  • If the method requires model loading or preprocessing, implement the load_model and execute_preprocessing functions.

Naming Convention: The class name of a method evaluation class is given by f"{{method_name}.replace('_', '').upper()}Eval". For instance, the evaluation class of the rnd_oe method is RNDOEEval.

2. Create a Configuration File

  • Add a configuration file in fiper/configs/eval/{method_name}.yaml, inheriting from the base evaluation configuration file.
  • Define the method-specific parameters in this file.

3. Update the Default Configuration

  • Add the new method to the methods and implemented_methods lists in fiper/configs/default.yaml.

Workflow & Module Details

The pipeline is designed to evaluate failure prediction methods on generative robot policies. It consists of the following key stages and modules:

  1. Task Initialization:
    • Use the TaskManager to load raw rollouts, extract metadata, and initialize the dataset.
  2. Dataset Management:
    • Preprocess, normalize, and manage data using the ProcessedRolloutDataset class.
  3. Method Training:
    • Train failure prediction models such as Random Network Distillation (RND) or entropy-based methods.
  4. Evaluation:
    • Evaluate failure prediction methods using calibration and test rollouts.
  5. Result Management:
    • Summarize, save, and visualize evaluation results using the ResultsManager.

1. TaskManager

The TaskManager is responsible for managing task-specific data and configurations.

  • Responsibilities:

    • Loads raw rollouts from the specified directories.
    • Extracts metadata and converts rollouts into a standardized format.
    • Initializes and manages the ProcessedRolloutDataset.
  • Key Features:

    • Supports task-specific configurations via Hydra.
    • Handles metadata extraction, including rollout labels (e.g., calibration, test, ID, OOD).
    • Converts raw rollouts into tensors compatible with the dataset class.
  • Relevant Functions:

    • _load_and_convert_raw_rollouts: Loads and processes raw rollouts.
    • get_rollout_dataset: Initializes or updates the dataset with new rollouts.

2. ProcessedRolloutDataset

The ProcessedRolloutDataset class manages the data required for training and evaluation.

  • Responsibilities:

    • Stores and organizes rollout data as tensors.
    • Normalizes data using calibration rollouts.
    • Provides utilities for iterating over episodes and retrieving subsets of data.
  • Key Features:

    • Supports metadata management, including episode indices and rollout labels.
    • Allows normalization of tensors for consistent preprocessing.
    • Enables filtering and iteration over specific subsets (e.g., calibration, test, ID, OOD).
  • Relevant Functions:

    • init_dataset: Initializes the dataset with metadata and tensors.
    • normalize: Normalizes tensors using calibration rollouts.
    • iterate_episodes: Iterates over rollout episodes with optional filtering and history augmentation.

3. Method Training

Failure prediction methods are trained using the processed dataset.

  • Responsibilities:

    • Trains models such as Random Network Distillation (RND) or entropy-based methods.
    • Prepares models for evaluation by calibrating thresholds or other parameters.
  • Key Features:

    • Modular design allows easy integration of new training methods.
    • Supports training on various input modalities (e.g., observation embeddings, RGB images).
  • Example:

    • RND-based methods use a distillation loss to train a predictor network on calibration rollouts.

4. Evaluation

The EvaluationManager handles the evaluation of failure prediction methods.

  • Responsibilities:

    • Evaluates methods on calibration and test rollouts.
    • Computes evaluation metrics such as accuracy, detection time, and uncertainty scores.
  • Key Features:

    • Supports multiple evaluation metrics and method-specific configurations.
    • Allows combining multiple methods for ensemble evaluations.
  • Relevant Functions:

    • evaluate: Evaluates specified methods and returns results.
    • _get_method_eval_class: Loads method-specific evaluation classes.

5. ResultsManager

The ResultsManager summarizes and visualizes evaluation results.

  • Responsibilities:

    • Combines results from multiple tasks and methods.
    • Generates CSV summaries and visualizations of evaluation metrics.
  • Key Features:

    • Creates uncertainty plots and other visualizations for analysis.
    • Supports configurable result processing via Hydra.
  • Relevant Functions:

    • _create_complete_df: Combines results into a comprehensive DataFrame.
    • combine_results: Merges new results with existing ones for comparison.

Reproducing Results 🖥️

To reproduce the results reported in the paper, execute the steps in Getting Started using the provided rollout datasets and the default settings (e.g., the configuration files as uploaded).

Citation 🏷️

FIPER is offered under the MIT License agreement. If you find FIPER useful, please consider citing our work:

@inproceedings{romer2025fiper,
          title={Failure Prediction at Runtime for Generative Robot Policies},
          author={Ralf R{\"o}mer and Adrian Kobras and Luca Worbis and Angela P. Schoellig},
          journal={Advances in Neural Information Processing Systems (NeurIPS)},
          year={2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages