Skip to content

Official code for VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

Notifications You must be signed in to change notification settings

gohyojun15/VIST3A

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 

Repository files navigation

VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

The official code for the paper: "Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator".

Hyojun Go, Dominik Nanhofer, Goutam Bhat, Prune Troung, Federico Tombari, Konrad Schindler.

ETH Zurich, Google

output_grid.mp4

πŸ”₯ Highlights

VIST3A is a framework for text-to-3D generation that combines a multi-view reconstruction network with a video generator LDM.

  • Text β†’ 3DGS in one LDM path. Generates high-quality, 3D-consistent Gaussian splats directly from text prompts β€” even with long and detailed descriptions, maintaining both semantic fidelity and visual realism.
  • Models. Based on Wan 2.1-14B and Wan 2.1-1.3B, we release our own VIST3A-1.3B and VIST3A-14B models.

πŸ“¦ Installation

TODO:

πŸš€ Quickstart

TODO:

🧠 Training

🩹 Model stitching

TODO:

🎯 Reward Alignment

TODO:

πŸ™ Acknowledgements

We build upon open-source implementations of video LDMs (Wan, CogVideoX, HunyuanVideo, SVD), multi-view recon (AnySplat, VGGT, MVDust3R), and Gsplat. Thanks to the respective authors and communities.

About

Official code for VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published