Hao Zhang1,2 ·
Mohamed El Banani1 ·
Jen-Hao Cheng1 ·
Paul Zhang1
Yi Hua1 ·
Ben Mildenhall1 ·
Christoph Lassner1 ·
Narendra Ahuja2 ·
Gengshan Yang1
1World Labs 2University of Illinois Urbana-Champaign
Click the image above to play the demo reel inline (720p, ~11 MB).
Full 1440p / interactive version on the
project page.
Image-to-3D point cloud prediction via flow-matching diffusion over layered
geometry. A single forward pass produces L registered XYZ maps that
together cover the visible surface and the (partially) occluded surfaces
behind it, giving a richer 3D scaffold than a single mono-depth map.
The project page hosts the curated demo samples as an interactive 3D viewer. This repository ships the code that produced those samples plus the public model weights so you can reproduce them on any RGBA-friendly image of your own.
This is the inference-only release.
💛 If you find World Tracing useful, please consider ⭐ starring this repo and citing our paper — it really helps us prioritise future releases.
The released checkpoints are hosted on Hugging Face Hub:
| config name | task | image size | params | Hugging Face repo |
|---|---|---|---|---|
r75b |
object | 504 × 504 | 1.7 B | haoz19/object-model-6layer |
r69l |
scene | 840 × 840 | 1.5 B | haoz19/scene-model-6layer-840 |
r76 |
dynamic object (16 frames) | 336 × 336 | 2.1 B | haoz19/dynamic-model-16frame |
Scene model —
r69l(840 × 840). The released scene model is the high-resolutionr69lcheckpoint (haoz19/scene-model-6layer-840,r69l_v2_evermotion_ithappy_840_opp). This is the same checkpoint that powers the Scene tab of the interactive demo. It must be run at 840 × 840 (ther69lconfig inwt/checkpoint.pyhandles this); feeding it 504-res inputs is heavily out-of-distribution.
Pass --ckpt <config-name> (e.g. --ckpt r75b) and wt will fetch the
weights from the Hub on first use and cache them under
~/.cache/huggingface/. You can also pass an hf:// URI or a local
.pt path -- see the checkpoint section below.
git clone https://github.com/haoz19/world-tracing.git
cd world-tracing
pip install -e ".[viz]"Tested with Python ≥ 3.10 on Linux with CUDA 12. The base install pulls
in torch, numpy, Pillow, opencv-python, einops, safetensors,
huggingface_hub, structlog, beartype, jaxtyping. The viz extra
adds rerun-sdk (for the .rrd viewer) and
scipy.
Optional extras:
pip install -e ".[viz,bg]" # + BiRefNet-based foreground matting for RGB inputs
pip install -e ".[viz,flash]" # + flash-attn (auto-detected at runtime)
pip install -e ".[viz,textured-mesh]" # + helpers for image → textured GLB (see below)The bg extra pulls in
ZhengPeng7/BiRefNet_HR
(MIT, SOTA dichotomous segmentation) so that infer_rgba.py can
auto-matte RGB inputs that don't already carry an alpha channel.
Without this extra it falls back to a fast near-white-background
heuristic and a warning.
The sample images we used for the demo video live in
examples/test_images/-- pre-organised by mode (object/,scene/,dynamic/). All quickstart commands below use the same set so you can reproduce a demo on a fresh checkout without finding your own inputs.Every example below runs a 4-seed sweep by default (seeds
42, 43, 44, 45) and writes one.rrdwith the four samples laid out side-by-side along+Xso you can pick the best. Pass--seed Nto run a single deterministic seed instead.
python examples/infer_rgba.py \
--image examples/test_images/object/obj014_leather_briefcase.png \
--ckpt r75b \
--config r75b \
--out /tmp/wt_obj014.rrd
rerun /tmp/wt_obj014.rrdIf your input is an RGB image whose object is matted onto a near-white
background (common for SAM / Stable-Diffusion outputs), wt auto-derives
a binary alpha; pass --no-auto-alpha to disable that.
Tip: pass
--layer-timelineto log the prediction along alayertimeline. Scrubbing the timeline slider in Rerun adds one layer at a time and makes it obvious how each layer carves out the occluded geometry behind the previous one.
python examples/infer_scene.py \
--image examples/test_images/scene/scene_outdoor_14_brooklyn_apartment__seed61.png \
--ckpt r69l \
--config r69l \
--out /tmp/wt_scene.rrdScene mode treats the entire frame as foreground (no alpha mask) and
keeps the raw RGB (no background overwrite). The released scene model
was trained on indoor renders without sky, so for outdoor scenes with
large sky regions you should pre-mask the sky externally (any matting
tool you like — e.g. running the same wt.data.segment_foreground
on the inverted image, or your favourite ADE20K segmenter outside
the pipeline) before feeding the result into infer_scene.py.
python examples/infer_video.py \
--image_dir examples/test_images/dynamic/davis__camel/ # 16 PNG frames
--ckpt r76 \
--config r76 \
--out /tmp/wt_camel.rrdThe resulting .rrd uses the frame timeline; scrub the slider in
Rerun to animate the predicted point cloud over time. All frames share
a single crop so the temporal-attention blocks can establish per-pixel
correspondences. Pass --frame_indices "0,2,4,6,8,10,12,14" to pick
a subset of the 16 supplied frames.
The default 4-seed sweep emits all four samples in a single .rrd,
spread along +X. To run a single deterministic seed, pass
--seed:
python examples/infer_rgba.py \
--image examples/test_images/object/obj063_trex_dinosaur.png \
--ckpt r75b \
--seed 7 \
--out /tmp/wt_obj063_seed7.rrd--num-seeds K runs a custom sweep size starting at the default
base seed 42 (combine with --seed N to shift the base: --seed 100 --num-seeds 4 runs 100, 101, 102, 103). --num-seeds 1 is the
fastest single-sample mode.
Chains the released multilayer-geometry model with the public TRELLIS.2 image-to-3D pipeline: we skip TRELLIS.2's Stage-1 sparse-structure diffusion and feed it the voxel coords derived from our predicted XYZ. Stages 2
- 3 (shape SLat + texture SLat + mesh decode) then produce a textured GLB.
# 1. one-time TRELLIS.2 setup (only needed once; ~30 min of dep install)
git clone https://github.com/microsoft/TRELLIS.2
cd TRELLIS.2
bash setup.sh --new-env --basic --flash-attn --o-voxel \
--nvdiffrast --cumesh --flexgemm
conda activate trellis2
# 2. install wt in that same env
pip install -e /path/to/world-tracing[viz,textured-mesh]
# 3. run end-to-end (default 4-seed sweep -- writes obj014_seed{42,43,44,45}.glb)
python examples/infer_textured_mesh.py \
--image examples/test_images/object/obj014_leather_briefcase.png \
--ckpt r75b \
--out /tmp/wt_obj014.glb \
--rrd /tmp/wt_obj014.rrd \
--trellis2-path /path/to/TRELLIS.2The --pipeline-type flag selects the TRELLIS.2 stage configuration
(1024_cascade is the default — best quality / time trade-off).
By default a 4-seed sweep writes <out_stem>_seed{42,43,44,45}.glb so
you can keep the best mesh; pass --seed N (or --num-seeds 1) to
run a single seed and write to the plain --out path. --rrd
additionally dumps the multilayer point cloud for sanity-check viewing
in Rerun.
--ckpt accepts any of:
- Bare config name (
--ckpt r75b) — fetched from the default Hugging Face repo for that config (see the table at the top). - HF shorthand (
--ckpt hf://haoz19/object-model-6layer) — usesmodel.ptfrom the given repo. Add a file path for a non-default filename:--ckpt hf://my-fork/object/model.pt. - Local path (
--ckpt /path/to/checkpoint.pt) — useful for fine-tuned weights. Accepts both rawstate_dictand{"model_state_dict": ..., "ema_state_dict": ...}formats; EMA is preferred when present.
Resolution and download happen in wt.checkpoint.resolve_ckpt_path --
the cached file lives under ~/.cache/huggingface/hub/ so subsequent
runs are instant.
The released models predict per-layer geometry only:
| name | shape | meaning |
|---|---|---|
xyz_pred |
[B, L, H, W, 3] |
Per-layer XYZ in camera space (metric units for r75b / r76; relative scale for r69l median-log) |
The per-layer validity mask is taken from the input alpha (the model's output is unmasked geometry over the full grid); per-pixel colour is sampled from the input RGB at the corresponding location. No colour or visibility is predicted by the model.
Camera intrinsics for the predicted point cloud can be recovered from
layer-0 with wt.solve_intrinsics_from_xyz; this lets
you turn the prediction into a textured mesh or render it through any
camera. No MoGe / pose estimator required at inference time.
The model's frozen image encoder reads the raw RGB pixels regardless of the validity mask. If your input has a coloured background, the encoder will treat it as valid content and the model will produce "ghost" geometry over it.
preprocess_rgba_for_model therefore overwrites the background (alpha ≤
127) region with a fixed RGB triple before the resize. The default is
black (0, 0, 0), which matches the training-set renders (Objaverse +
composite scenes) and the bg_randomize augmentation that ran during
training. Pass --bg-color none to keep the raw RGB (only useful for
scene mode or explicit ablations).
wt/ ← installable Python package
├── model.py ← MultilayerXYZModel (configurable wrapper around MultilayerBackbone)
├── inference.py ← inference_diffusion / inference_diffusion_multiview / inference_video_diffusion
├── sampling.py ← Euler ODE flow-matching sampler (replaces FMLossWrapper)
├── data.py ← Image loaders (BiRefNet auto-matting), preprocessing, video clip
├── viz.py ← Rerun .rrd output helpers (single image, video timeline, multi-seed)
├── intrinsics.py ← Solve K from predicted XYZ (replaces MoGe at inference)
├── checkpoint.py ← Released model configs + checkpoint loader + HF Hub resolver
├── postproc.py ← Optional point-cloud cleanup (edge-flyer filter for dynamic outputs)
├── cli.py ← Shared CLI helpers
├── textured_mesh/ ← TRELLIS.2 bridge: ours_v4 voxelisation + stage-2/3 driver
│ ├── canon.py ← Camera ↔ TRELLIS canonical-frame transform
│ ├── voxelise.py ← expand_cloud_ray_xyz + v4_ray_fill
│ └── pipeline.py ← load_trellis2_pipeline + inject_coords_into_trellis2 + save_mesh_glb
└── _core/ ← Vendored deps (Wan2.1 layer init, MoGe backbone, ...)
examples/
├── infer_rgba.py ← Single RGBA image (object model; 4-seed sweep by default)
├── infer_scene.py ← Single scene RGB (r69l; 4-seed sweep by default)
├── infer_video.py ← Dynamic clip (r76; 4-seed sweep by default)
└── infer_textured_mesh.py ← Image → multilayer geometry → TRELLIS.2 stages 2+3 → textured GLB (4-seed sweep by default)
Tested on a single NVIDIA A100 / H100 (80 GB) with bfloat16 autocast.
| Config | Image size | Inference time (20 steps) |
|---|---|---|
r75b |
504 × 504 | ~13 s / image |
r69l |
840 × 840 | ~17 s / image |
r76 |
336 × 336 × 8 frames | ~30 s / clip |
The default 4-seed sweep is therefore ~4× the single-seed numbers above.
Smaller GPUs work with reduced --num-steps or by sampling at a
smaller resolution.
- More published checkpoints. Updated
r75b/r69l/r76from later training rounds, and a single-image multi-view variant.
@misc{zhang2026worldtracinggenerativepixelaligned,
title = {World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible},
author = {Hao Zhang and Mohamed El Banani and Jen-Hao Cheng and Paul Zhang
and Yi Hua and Ben Mildenhall and Christoph Lassner
and Narendra Ahuja and Gengshan Yang},
year = {2026},
eprint = {2606.13652},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2606.13652}
}CC BY-NC-ND 4.0
(Creative Commons Attribution-NonCommercial-NoDerivatives 4.0
International) — see LICENSE. The code, model weights, and demo are
released for non-commercial research use only; no derivatives may be
redistributed.
The model architecture borrows from:
We thank the authors for releasing their code.