Skip to content

inspatio/inspatio-world

Repository files navigation

InSpatio-World

HuggingFace Project Page License arXiv

Discord

Requirements

  • Python 3.10
  • CUDA 12.1

1. Create conda environment:

conda env create -f environment.yml
conda activate inspatio_world

2. Install flash-attn:

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Model Weights

Download the following model checkpoints into the checkpoints/ directory:

Model Purpose Source
InSpatio-World-1.3B v2v inference — 1.3B (Step 3) HuggingFace
Wan2.1-T2V-1.3B Text encoder + VAE + base model for 1.3B (Step 3) HuggingFace
DA3 (Depth-Anything-3) Depth estimation (Step 2) HuggingFace
Florence-2-large Video captioning (Step 1) HuggingFace
TAEHV Speed up (Optional) Github
bash scripts/download.sh

Expected directory structure after downloading:

checkpoints/
├── InSpatio-World-1.3B/
│   └── InSpatio-World-1.3B.safetensors
├── Wan2.1-T2V-1.3B/
├── DA3/
├── Florence-2-large/
└── taehv/

Inference

The full pipeline runs in three steps:

  1. Step 1 — Generate video captions using Florence-2。
  2. Step 2 — Estimate depth with DA3, convert to inference format, render point clouds
  3. Step 3 — Run InSpatio-World v2v inference

All steps are wrapped in a single script:

bash run_test_pipeline.sh \
  --input_dir ./test/example \
  --traj_txt_path ./traj/x_y_circle_cycle.txt 

Quick Start

# 1. Place your .mp4 video(s) in a folder
mkdir -p my_videos
cp your_video.mp4 my_videos/

# 2. Run the full pipeline
bash run_test_pipeline.sh \
  --input_dir ./my_videos \
  --traj_txt_path ./traj/x_y_circle_cycle.txt

# 3. Results will be saved to ./output/my_videos/x_y_circle_cycle/

Trajectory Control

The --traj_txt_path argument controls the camera trajectory for novel-view synthesis. Predefined trajectories are provided in the traj/ directory:

File Motion
x_y_circle_cycle.txt Cyclic combined pitch + yaw orbit
zoom_out_in.txt Dolly zoom out + Dolly zoom in

Trajectory File Format

A trajectory file is a plain text file with 3 lines, each containing space-separated keyframe values that are automatically interpolated to match the output frame count:

<line 1>  pitch (degrees): positive = orbit up, negative = orbit down
<line 2>  yaw (degrees):   positive = orbit left, negative = orbit right
<line 3>  displacement:    relative camera displacement scale

Line 3 (displacement) is a relative scale multiplied by the scene's estimated foreground depth:

  • When pitch/yaw are non-zero, it controls the orbit radius (typically set to 1)
  • When both pitch and yaw are zero, it becomes a dolly zoom: positive = move forward (zoom in), negative = move backward (zoom out)

All Arguments

Argument Required Default Description
--input_dir Yes Input folder containing .mp4 files
--traj_txt_path Yes Trajectory file (e.g. ./traj/x_y_circle_cycle.txt)
--checkpoint_path No ./checkpoints/InSpatio-World/InSpatio-World.safetensors InSpatio-World checkpoint
--config_path No configs/inference.yaml Config file (inference_1.3b.yaml for 1.3B)
--da3_model_path No ./checkpoints/DA3 DA3 depth model path
--florence_model_path No ./checkpoints/Florence-2-large Florence-2 model path
--step1_gpus No 0 GPU ID(s) for Step 1 (comma-separated for parallel)
--step2_gpus No 0 GPU ID(s) for Step 2 (comma-separated for parallel)
--step3_gpus No 0 GPU ID(s) for Step 3
--step3_nproc No 1 Number of GPUs for Step 3
--output_folder No ./output/<name>/<traj> Custom output directory
--master_port No 29513 Master port for torchrun (Step 3)
--skip_step1 No false Skip caption generation
--skip_step2 No false Skip depth estimation
--skip_step3 No false Skip v2v inference
--relative_to_source No false Compose trajectory poses relative to initial view
--rotation_only No false Only apply rotation from trajectory, ignore translation (tripod pan/tilt)
--disable_adaptive_frame No false Disable adaptive frame expansion/subsampling (use original frame count as-is)
--freeze_repeat No 0 Repeat a specific frame N extra times to create a time-freeze (pause) effect
--freeze_frame No middle frame Frame index to freeze; defaults to the middle frame if not specified
--use_tae No false Use Tiny Auto Encoder (TAE) instead of WanVAE
--tae_checkpoint_path No ./checkpoints/taehv/taew2_1.pth Path to TAE checkpoint file (required when --use_tae is set)
--compile_dit No false Apply torch.compile to the DiT model

Skip Already-Completed Steps

If Step 1 or Step 2 outputs already exist, you can skip them:

bash run_test_pipeline.sh \
  --input_dir ./my_videos \
  --traj_txt_path ./traj/x_y_circle_cycle.txt \
  --skip_step1 --skip_step2

Generate Temporal Control Videos

bash run_test_pipeline.sh \
  --input_dir ./test/example \
  --traj_txt_path ./traj/x_y_circle_cycle.txt \
  --freeze_repeat 150 \
  --output_folder ./output/example_freeze_repeat_150 \
  --disable_adaptive_frame

You can control the time-stop behavior using two specific parameters: use --freeze_frame to choose which frame to freeze (default middle frame), and --freeze_repeat to determine the duration (number of frames) of the pause.

Autonomous Driving Applications

bash run_test_pipeline.sh \
  --input_dir ./test/example3 \
  --traj_txt_path ./traj/x_y_circle_cycle.txt \
  --relative_to_source \
  --rotation_only \
  --disable_adaptive_frame

Speed Up

bash run_test_pipeline.sh \
  --input_dir ./test/example \
  --traj_txt_path ./traj/x_y_circle_cycle.txt \
  --use_tae \
  --disable_adaptive_frame 

You can switch from VAE to TAE to accelerate the process. Furthermore, you can use --compile_dit to further boost the speed, reaching 24 fps on an H-series NVIDIA GPU (1.3B). However, please note that this operation requires a relatively long warm-up time when triggered for the first time. It is suitable for scenarios where you need to deploy as a service and pursue extreme speed.

License

This project is licensed under the Apache-2.0 License. Note that this license only applies to code in our library, the dependencies and submodules of which (Depth-Anything-3, Florence-2, TAEHV) are separate and individually licensed.


Citation

If you use InSpatio-World in your research, please use the following BibTeX entry.

@misc{inspatio-world,
    title={INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling},
    author={InSpatio Team},
    journal={arXiv preprint arXiv: 2604.07209},
    year={2026}
}

Acknowledgement

InSpatio-World utilizes a backbone based on Wan2.1, with its training code referencing Self-Forcing. Additionally, the TAE component for inference speed-up is built upon TAEV. We sincerely thank the Self-Forcing, Wan and TAEV team for their foundational work and open-source contribution. We also gratefully acknowledge Depth-Anything-3, Florence-2 and ReCamMaster for their excellent work that inspired and supported this project.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors