InSpatio-World

Requirements

Python 3.10
CUDA 12.1

1. Create conda environment:

conda env create -f environment.yml
conda activate inspatio_world

2. Install flash-attn:

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Model Weights

Download the following model checkpoints into the checkpoints/ directory:

Model	Purpose	Source
InSpatio-World-1.3B	v2v inference — 1.3B (Step 3)	HuggingFace
Wan2.1-T2V-1.3B	Text encoder + VAE + base model for 1.3B (Step 3)	HuggingFace
DA3 (Depth-Anything-3)	Depth estimation (Step 2)	HuggingFace
Florence-2-large	Video captioning (Step 1)	HuggingFace
TAEHV	Speed up (Optional)	Github

bash scripts/download.sh

Expected directory structure after downloading:

checkpoints/
├── InSpatio-World-1.3B/
│   └── InSpatio-World-1.3B.safetensors
├── Wan2.1-T2V-1.3B/
├── DA3/
├── Florence-2-large/
└── taehv/

Inference

The full pipeline runs in three steps:

Step 1 — Generate video captions using Florence-2。
Step 2 — Estimate depth with DA3, convert to inference format, render point clouds
Step 3 — Run InSpatio-World v2v inference

All steps are wrapped in a single script:

bash run_test_pipeline.sh \
  --input_dir ./test/example \
  --traj_txt_path ./traj/x_y_circle_cycle.txt

Quick Start

# 1. Place your .mp4 video(s) in a folder
mkdir -p my_videos
cp your_video.mp4 my_videos/

# 2. Run the full pipeline
bash run_test_pipeline.sh \
  --input_dir ./my_videos \
  --traj_txt_path ./traj/x_y_circle_cycle.txt

# 3. Results will be saved to ./output/my_videos/x_y_circle_cycle/

Trajectory Control

The --traj_txt_path argument controls the camera trajectory for novel-view synthesis. Predefined trajectories are provided in the traj/ directory:

File	Motion
`x_y_circle_cycle.txt`	Cyclic combined pitch + yaw orbit
`zoom_out_in.txt`	Dolly zoom out + Dolly zoom in

Trajectory File Format

A trajectory file is a plain text file with 3 lines, each containing space-separated keyframe values that are automatically interpolated to match the output frame count:

<line 1>  pitch (degrees): positive = orbit up, negative = orbit down
<line 2>  yaw (degrees):   positive = orbit left, negative = orbit right
<line 3>  displacement:    relative camera displacement scale

Line 3 (displacement) is a relative scale multiplied by the scene's estimated foreground depth:

When pitch/yaw are non-zero, it controls the orbit radius (typically set to 1)
When both pitch and yaw are zero, it becomes a dolly zoom: positive = move forward (zoom in), negative = move backward (zoom out)

All Arguments

Argument	Required	Default	Description
`--input_dir`	Yes	—	Input folder containing `.mp4` files
`--traj_txt_path`	Yes	—	Trajectory file (e.g. `./traj/x_y_circle_cycle.txt`)
`--checkpoint_path`	No	`./checkpoints/InSpatio-World/InSpatio-World.safetensors`	InSpatio-World checkpoint
`--config_path`	No	`configs/inference.yaml`	Config file (`inference_1.3b.yaml` for 1.3B)
`--da3_model_path`	No	`./checkpoints/DA3`	DA3 depth model path
`--florence_model_path`	No	`./checkpoints/Florence-2-large`	Florence-2 model path
`--step1_gpus`	No	`0`	GPU ID(s) for Step 1 (comma-separated for parallel)
`--step2_gpus`	No	`0`	GPU ID(s) for Step 2 (comma-separated for parallel)
`--step3_gpus`	No	`0`	GPU ID(s) for Step 3
`--step3_nproc`	No	`1`	Number of GPUs for Step 3
`--output_folder`	No	`./output/<name>/<traj>`	Custom output directory
`--master_port`	No	`29513`	Master port for torchrun (Step 3)
`--skip_step1`	No	false	Skip caption generation
`--skip_step2`	No	false	Skip depth estimation
`--skip_step3`	No	false	Skip v2v inference
`--relative_to_source`	No	false	Compose trajectory poses relative to initial view
`--rotation_only`	No	false	Only apply rotation from trajectory, ignore translation (tripod pan/tilt)
`--disable_adaptive_frame`	No	false	Disable adaptive frame expansion/subsampling (use original frame count as-is)
`--freeze_repeat`	No	`0`	Repeat a specific frame N extra times to create a time-freeze (pause) effect
`--freeze_frame`	No	middle frame	Frame index to freeze; defaults to the middle frame if not specified
`--use_tae`	No	false	Use Tiny Auto Encoder (TAE) instead of WanVAE
`--tae_checkpoint_path`	No	`./checkpoints/taehv/taew2_1.pth`	Path to TAE checkpoint file (required when --use_tae is set)
`--compile_dit`	No	false	Apply torch.compile to the DiT model

Skip Already-Completed Steps

If Step 1 or Step 2 outputs already exist, you can skip them:

bash run_test_pipeline.sh \
  --input_dir ./my_videos \
  --traj_txt_path ./traj/x_y_circle_cycle.txt \
  --skip_step1 --skip_step2

Generate Temporal Control Videos

bash run_test_pipeline.sh \
  --input_dir ./test/example \
  --traj_txt_path ./traj/x_y_circle_cycle.txt \
  --freeze_repeat 150 \
  --output_folder ./output/example_freeze_repeat_150 \
  --disable_adaptive_frame

You can control the time-stop behavior using two specific parameters: use --freeze_frame to choose which frame to freeze (default middle frame), and --freeze_repeat to determine the duration (number of frames) of the pause.

Autonomous Driving Applications

bash run_test_pipeline.sh \
  --input_dir ./test/example3 \
  --traj_txt_path ./traj/x_y_circle_cycle.txt \
  --relative_to_source \
  --rotation_only \
  --disable_adaptive_frame

Speed Up

bash run_test_pipeline.sh \
  --input_dir ./test/example \
  --traj_txt_path ./traj/x_y_circle_cycle.txt \
  --use_tae \
  --disable_adaptive_frame

You can switch from VAE to TAE to accelerate the process. Furthermore, you can use --compile_dit to further boost the speed, reaching 24 fps on an H-series NVIDIA GPU (1.3B). However, please note that this operation requires a relatively long warm-up time when triggered for the first time. It is suitable for scenarios where you need to deploy as a service and pursue extreme speed.

License

This project is licensed under the Apache-2.0 License. Note that this license only applies to code in our library, the dependencies and submodules of which (Depth-Anything-3, Florence-2, TAEHV) are separate and individually licensed.

Citation

If you use InSpatio-World in your research, please use the following BibTeX entry.

@misc{inspatio-world,
    title={INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling},
    author={InSpatio Team},
    journal={arXiv preprint arXiv: 2604.07209},
    year={2026}
}

Acknowledgement

InSpatio-World utilizes a backbone based on Wan2.1, with its training code referencing Self-Forcing. Additionally, the TAE component for inference speed-up is built upon TAEV. We sincerely thank the Self-Forcing, Wan and TAEV team for their foundational work and open-source contribution. We also gratefully acknowledge Depth-Anything-3, Florence-2 and ReCamMaster for their excellent work that inspired and supported this project.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
configs		configs
datasets		datasets
demo_utils		demo_utils
depth		depth
pipeline		pipeline
scripts		scripts
test		test
traj		traj
utils		utils
wan		wan
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
inference_causal_test.py		inference_causal_test.py
requirements.txt		requirements.txt
run_example.sh		run_example.sh
run_test_pipeline.sh		run_test_pipeline.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InSpatio-World

Requirements

Model Weights

Inference

Quick Start

Trajectory Control

Trajectory File Format

All Arguments

Skip Already-Completed Steps

Generate Temporal Control Videos

Autonomous Driving Applications

Speed Up

License

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

InSpatio-World

Requirements

Model Weights

Inference

Quick Start

Trajectory Control

Trajectory File Format

All Arguments

Skip Already-Completed Steps

Generate Temporal Control Videos

Autonomous Driving Applications

Speed Up

License

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages