Jiani Huang • Amish Sethi • Matthew Kuo • Mayank Keoliya • Neelay Velingker • JungHo Jung • Ser-Nam Lim • Ziyang Li • Mayur Naik
University of Pennsylvania · Johns Hopkins University · University of Central Florida
ESCA contextualizes task planners with grounded visual features represented as scene graphs, enabling more precise and context-aware embodied agent decision-making.
- [2025.10.28] 🎉 ESCA, demonstrating the usage of LASER model in an embodied environment, is accepted as NeurIPS 2025 Spotlight!
- [2025.08.30] 🤗 We have open sourced our scene graph generation model
- [2025.08.30] 📊 We have open sourced our training data
We introduce ESCA (Embodied and Scene-Graph Contextualized Agent), a framework designed to contextualize Multi-modal Large Language Models (MLLMs) through open-domain scene graph generation. ESCA provides structured visual grounding, helping MLLMs make sense of complex and ambiguous sensory environments. At its core is SGClip, a CLIP-based model that captures semantic visual features, including entity classes, physical attributes, actions, interactions, and inter-object relations.
-
🛠️ Structured Scene Understanding: ESCA decomposes visual understanding into four modular stages: concept extraction, object identification, scene graph prediction, and visual summarization.
-
🎯 SGClip Model: A CLIP-based foundation model for structured scene understanding that supports open-domain concept coverage and probabilistic predictions.
-
⚡ Transfer Protocol: A general transfer protocol based on customizable prompt templates that enables ESCA to generalize across different downstream tasks.
-
🏹 ESCA-Video-87K Dataset: A large-scale dataset with 87K video clips, paired with natural language captions, object traces, and spatial-temporal programmatic specifications.
-
🔧 Neurosymbolic Learning: A model-driven, self-supervised learning pipeline that eliminates the need for manual scene graph annotations.
Note: we need to install three conda environments, one for EB-ALFRED and EB-Habitat, one for EB-Navigation, and one for EB-Manipulation. Please use ssh download instead of HTTP download to avoid error during git lfs pull.
Download repo
git clone git@github.com:EmbodiedBench/EmbodiedBench.git
cd EmbodiedBenchYou have two options for installation: you can either use
bash install.sh or manually run the provided commands. After completing the installation with bash install.sh, you will need to start the headless server and verify that each environment is properly set up.
1️⃣ Environment for Habitat and Alfred
conda env create -f conda_envs/environment.yaml
conda activate embench
pip install -e .2️⃣ Environment for EB-Navigation
conda env create -f conda_envs/environment_eb-nav.yaml
conda activate embench_nav
pip install -e .3️⃣ Environment for EB-Manipulation
conda env create -f conda_envs/environment_eb-man.yaml
conda activate embench_man
pip install -e .- Install Coppelia Simulator
CoppeliaSim V4.1.0 required for Ubuntu 20.04; you can find other versions here (https://www.coppeliarobotics.com/previousVersions#)
conda activate embench_man
cd embodiedbench/envs/eb_manipulation
wget https://downloads.coppeliarobotics.com/V4_1_0/CoppeliaSim_Pro_V4_1_0_Ubuntu20_04.tar.xz
tar -xf CoppeliaSim_Pro_V4_1_0_Ubuntu20_04.tar.xz
rm CoppeliaSim_Pro_V4_1_0_Ubuntu20_04.tar.xz
mv CoppeliaSim_Pro_V4_1_0_Ubuntu20_04/ /PATH/YOU/WANT/TO/PLACE/COPPELIASIM- Add the following to your ~/.bashrc file:
export COPPELIASIM_ROOT=/PATH/YOU/WANT/TO/PLACE/COPPELIASIM
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$COPPELIASIM_ROOT
export QT_QPA_PLATFORM_PLUGIN_PATH=$COPPELIASIM_ROOTRemember to source your bashrc (
source ~/.bashrc) or zshrc (source ~/.zshrc) after this.
- Install the PyRep, EB-Manipulation package and dataset:
git clone https://github.com/stepjam/PyRep.git
cd PyRep
pip install -r requirements.txt
pip install -e .
cd ..
pip install -r requirements.txt
pip install -e .
cp ./simAddOnScript_PyRep.lua $COPPELIASIM_ROOT
git clone https://huggingface.co/datasets/EmbodiedBench/EB-Manipulation
mv EB-Manipulation/data/ ./
rm -rf EB-Manipulation/
cd ../../..Remember that whenever you re-install the PyRep, simAddOnScript_PyRep.lua will be overwritten. Then, you should copy this again.
- Run the following code to ensure the EB-Manipulation is working correctly (start headless server if you have not):
conda activate embench_man
export DISPLAY=:1
python -m embodiedbench.envs.eb_manipulation.EBManEnvNote: EB-Alfred, EB-Habitat and EB-Manipulation require downloading large datasets from Hugging Face or GitHub repositories. Ensure Git LFS is properly initialized by running the following commands:
git lfs install
git lfs pullPlease run startx.py script before running experiment on headless servers. The server should be started in another tmux window. We use X_DISPLAY id=1 by default.
python -m embodiedbench.envs.eb_alfred.scripts.startx 1Download dataset from huggingface.
# Create checkpoints directory
mkdir -p GroundingDINO/checkpoints
# Download the model checkpoint
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0/groundingdino_swint_ogc.pth -P GroundingDINO/checkpoints/- Set the environment variable for GroundingDINO path:
export GROUNDING_DINO_PATH=/path/to/your/GroundingDINO- Clone the SAM2 repository inside the EmbodiedBench folder:
git clone https://github.com/facebookresearch/sam2.git- Follow the instructions in SAM2's README to finish the setup
- Set the environment variable for the SAM2 repo path
export SAM_REPO_PATH=/path/to/your/SAM2Run the following code to ensure the EB-Navigation environment is working correctly.
conda activate embench_nav
python -m embodiedbench.envs.eb_navigation.EBNavEnv- Install Coppelia Simulator
CoppeliaSim V4.1.0 required for Ubuntu 20.04; you can find other versions here (https://www.coppeliarobotics.com/previousVersions#)
conda activate embench_man
cd embodiedbench/envs/eb_manipulation
wget https://downloads.coppeliarobotics.com/V4_1_0/CoppeliaSim_Pro_V4_1_0_Ubuntu20_04.tar.xz
tar -xf CoppeliaSim_Pro_V4_1_0_Ubuntu20_04.tar.xz
rm CoppeliaSim_Pro_V4_1_0_Ubuntu20_04.tar.xz
mv CoppeliaSim_Pro_V4_1_0_Ubuntu20_04/ /PATH/YOU/WANT/TO/PLACE/COPPELIASIM- Add the following to your ~/.bashrc file:
mv /path/to/downloaded/model_file.model /path/to/EmbodiedBench/models/Please run startx.py script before running experiment on headless servers. The server should be started in another tmux window. We use X_DISPLAY id=1 by default.
python -m embodiedbench.envs.eb_alfred.scripts.startx 1EmbodiedBench now uses an organized task-based structure. Evaluators are organized by task (alfred, navigation, habitat, manipulation) with three variants:
- base: Baseline VLM implementation
- gd: Grounding DINO integration
- esca: Enhanced Scene Context Awareness (LASER + Grounding DINO)
Run the baseline evaluator:
conda activate embench
python -m embodiedbench.evaluator.alfred.baseRun with Grounding DINO:
conda activate embench
python -m embodiedbench.evaluator.alfred.gdRun with ESCA (recommended):
conda activate embench
python -m embodiedbench.evaluator.alfred.escaRun the baseline evaluator:
conda activate embench_nav
python -m embodiedbench.evaluator.navigation.baseRun with Grounding DINO:
conda activate embench_nav
python -m embodiedbench.evaluator.navigation.gdRun with ESCA (recommended):
conda activate embench_nav
python -m embodiedbench.evaluator.navigation.escaconda activate embench
python -m embodiedbench.evaluator.habitat.baseRun the baseline evaluator:
conda activate embench_man
python -m embodiedbench.evaluator.manipulation.baseRun with VLA:
conda activate embench_man
python -m embodiedbench.evaluator.manipulation.vlaYou can also use the unified main.py interface:
conda activate embench
python -m embodiedbench.main env=eb-alf model_name=Qwen/Qwen2-VL-7B-Instruct model_type=local exp_name='baseline' tp=1
python -m embodiedbench.main env=eb-hab model_name=OpenGVLab/InternVL2_5-8B model_type=local exp_name='baseline' tp=1All evaluators support various configuration options through command-line arguments or config files. Key parameters include:
model_name: The MLLM to use (e.g., 'gpt-4o', 'gemini-2.0-flash')n_shots: Number of examples for few-shot learningdetection_box: Enable/disable detection box visualizationsg_text: Enable/disable scene graph text outputgd_only: Use only Grounding DINO for object detection without the scene graph generation of ESCAtop_k: Number of top predictions to consideraggr_thres: Aggregation threshold for predictions
Our experiments is developed on the EmbodiedBench benchmark; if you use ESCA or EmbodiedBench in your research, please consider cite the works below.
@inproceedings{huang2025esca,
title={ESCA: Contextualizing Embodied Agents via Scene-Graph Generation},
author={Jiani Huang, Amish Sethi, Matthew Kuo, Mayank Keoliya, Neelay Velingker, JungHo Jung, Ser-Nam Lim, Ziyang Li, Mayur Naik},
year={2025},
booktitle={The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS)},
}
@inproceedings{
yang2025embodiedbench,
title={EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents},
author={Rui Yang and Hanyang Chen and Junyu Zhang and Mark Zhao and Cheng Qian and Kangrui Wang and Qineng Wang and Teja Venkat Koripella and Marziyeh Movahedi and Manling Li and Heng Ji and Huan Zhang and Tong Zhang},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=DgGF2LEBPS}
}