🎬 Vid2Sim 🤖: Realistic and Interactive Simulation from Video for Urban Navigation

Ziyang Xie, Zhizheng Liu, Zhenghao Peng, Wayne Wu, Bolei Zhou

Vid2Sim is a novel framework that converts monocular videos into photorealistic and physically interactive simulation environments for training embodied agents with minimal sim-to-real gap.

🚧 Installation

# Clone the repository
git clone https://github.com/Vid2Sim/Vid2Sim.git --recursive
cd Vid2Sim

# Create a new environment
conda create -n vid2sim python=3.10
conda activate vid2sim

# Install dependencies
pip install -e .

# Install reconstruction dependencies
pip install -e submodules/vid2sim-rasterizer
pip install -e submodules/vid2sim-deva-segmentation
pip install -e submodules/simple-knn

# Install RL dependencies
pip install -r src/vid2sim_rl/requirements.txt
pip install -e submodules/ml-agents/ml-agents
[Optional] pip install -e submodules/r3m

🎥 Reconstruct the simulation envs from videos

Vid2Sim transforms monocular videos into simulation environments by reconstructing the scene geometry and appearance. The generated environments preserve real-world diversity and visual fidelity, providing minimal sim-to-real gap for agent training.

👉 To get started, follow the reconstruction guide in vid2sim_recon to reconstruct the simulation environment from video.

🤖 Train the Agent in Real-to-Sim Environments

After the environment is reconstructed, Vid2Sim translates the real-to-sim environments into a interactive environment with both realistic visual appearance and physical collision to train the agent in diverse situations.

👉 To set up the environment and launch RL training, refer to vid2sim_rl.

📦 Repository Structure

Vid2Sim/
├── data/ # Source data
├── src/
│   ├── vid2sim_recon/ # Reconstruct the simulation environment from video
│   ├── vid2sim_rl/ # Train the agent in real-to-sim environments
├── tools/ # Tools scripts
├── README.md # This file

📚 Vid2Sim Dataset

The Vid2Sim dataset includes 30 high-quality real-to-sim simulation environments reconstructed from video clips sourced from 9 web videos. Each clip includes 15 seconds of forward-facing video recorded at 30 fps, providing 450 frames per scene for environment reconstruction and simulation.

We provide the source video data, and interactive Unity environments for agent training.

Citation 📝

If you find this work useful in your research, please consider citing:

@article{xie2024vid2sim,
  title={Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation},
  author={Xie, Ziyang and Liu, Zhizheng and Peng, Zhenghao and Wu, Wayne and Zhou, Bolei},
  journal={CVPR},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
src		src
submodules		submodules
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎬 Vid2Sim 🤖: Realistic and Interactive Simulation from Video for Urban Navigation

🚧 Installation

🎥 Reconstruct the simulation envs from videos

🤖 Train the Agent in Real-to-Sim Environments

📦 Repository Structure

📚 Vid2Sim Dataset

Citation 📝

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Vid2Sim/Vid2Sim

Folders and files

Latest commit

History

Repository files navigation

🎬 Vid2Sim 🤖: Realistic and Interactive Simulation from Video for Urban Navigation

🚧 Installation

🎥 Reconstruct the simulation envs from videos

🤖 Train the Agent in Real-to-Sim Environments

📦 Repository Structure

📚 Vid2Sim Dataset

Citation 📝

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages