The official code for the paper: "Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator".
Hyojun Go, Dominik Nanhofer, Goutam Bhat, Prune Troung, Federico Tombari, Konrad Schindler.
ETH Zurich, Google
output_grid.mp4
VIST3A is a framework for text-to-3D generation that combines a multi-view reconstruction network with a video generator LDM.
- Text β 3DGS in one LDM path. Generates high-quality, 3D-consistent Gaussian splats directly from text prompts β even with long and detailed descriptions, maintaining both semantic fidelity and visual realism.
- Models. Based on Wan 2.1-14B and Wan 2.1-1.3B, we release our own VIST3A-1.3B and VIST3A-14B models.
TODO:
TODO:
TODO:
TODO:
We build upon open-source implementations of video LDMs (Wan, CogVideoX, HunyuanVideo, SVD), multi-view recon (AnySplat, VGGT, MVDust3R), and Gsplat. Thanks to the respective authors and communities.