VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

The official code for the paper: "Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator".

Hyojun Go, Dominik Nanhofer, Goutam Bhat, Prune Troung, Federico Tombari, Konrad Schindler.

ETH Zurich, Google

output_grid.mp4

🔥 Highlights

VIST3A is a framework for text-to-3D generation that combines a multi-view reconstruction network with a video generator LDM.

Text → 3DGS in one LDM path. Generates high-quality, 3D-consistent Gaussian splats directly from text prompts — even with long and detailed descriptions, maintaining both semantic fidelity and visual realism.
Models. Based on Wan 2.1-14B and Wan 2.1-1.3B, we release our own VIST3A-1.3B and VIST3A-14B models.

📦 Installation

TODO:

🚀 Quickstart

TODO:

🧠 Training

🩹 Model stitching

TODO:

🎯 Reward Alignment

TODO:

🙏 Acknowledgements

We build upon open-source implementations of video LDMs (Wan, CogVideoX, HunyuanVideo, SVD), multi-view recon (AnySplat, VGGT, MVDust3R), and Gsplat. Thanks to the respective authors and communities.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
asset		asset
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

🔥 Highlights

📦 Installation

🚀 Quickstart

🧠 Training

🩹 Model stitching

🎯 Reward Alignment

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

gohyojun15/VIST3A

Folders and files

Latest commit

History

Repository files navigation

VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

🔥 Highlights

📦 Installation

🚀 Quickstart

🧠 Training

🩹 Model stitching

🎯 Reward Alignment

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages