Skip to content

NVlabs/4D-RGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

[CVPR 2026 (Highlight)] 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Project Page arXiv Hugging Face Model Hugging Face Dataset Hugging Face Paper

This is the official repository for the CVPR'26 Highlight paper 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation.

Chiao-An Yang*, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen
(*Work done during the internship at NVIDIA Research)


πŸ“– Introduction

Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Furthermore, existing 3D and 4D Video Question Answering (VQA) benchmarks emphasize static scenes and lack region-level prompting.

To tackle these issues, we introduce:

  • 4D-RGPT: A specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal and spatial perception.
  • Perceptual 4D Distillation (P4D): A training-only framework that transfers 4D representations (e.g., depth, optical flow) from a frozen expert model into 4D-RGPT for comprehensive 4D perceptionβ€”without introducing any additional inference cost.
  • R4D-Bench: A rigorous benchmark for depth-aware dynamic scenes featuring region-level prompting, built via a hybrid automated and human-verified pipeline.

Our experiments demonstrate that 4D-RGPT achieves notable improvements over strong baselines on existing 3D/4D benchmarks (+5.3% on average across 6 benchmarks) as well as our proposed region-based R4D-Bench (+4.3%).

πŸ’₯ News πŸ’₯

  • [Jun. 2026] πŸ”₯ Training + inference code and 4D-RGPT-8B model weights released.
  • [Apr. 2026] πŸ”₯πŸ”₯ 4D-RGPT is selected as a Highlight paper at CVPR 2026! πŸŽ‰πŸŽ‰
  • [Apr. 2026] R4D-Bench Dataset release.
  • [Feb. 2026] πŸ”₯πŸ”₯ 4D-RGPT is accepted to CVPR 2026! πŸŽ‰πŸŽ‰
  • [Dec. 2025] Paper, Project Page, and Hugging Face page released.

Contact: Chiao-An Yang β€” yang2300@purdue.edu


Built on VILA

This repo is a fork of the official NVlabs/VILA and inherits the NVILA model code. The 4D-RGPT-specific additions are:

Note on transformers version compatibility

Our code runs against transformers==4.46 + torch==2.3 (see pyproject.toml). We initially built on transformers==5.0, but pivoted back to 4.x: with transformers==5.0, SDPA on Qwen2-family LLMs produces a ~8pt bf16 argmax drift on close-call MCQ tokens (R4D-Bench: 30 vs the v4 reference 38.9 on NVILA-Lite-8B). The drift is in matmul kernel selection, not in the model weights.

Practical consequence: a VILA checkpoint cannot be transferred between the v4 and v5 stacks as-is, because transformers 5 only emits the fast tokenizer.json, while transformers 4.46's AutoTokenizer(..., use_fast=False) expects vocab.json / merges.txt. llava/model/language_model/builder.py:118-121 handles this with a guarded slow→fast tokenizer fallback so most checkpoints load under either stack, but the saved artifacts still need to be re-emitted on the target stack.

The pinned flash_attn==2.5.6+cu121torch2.3 wheel matters β€” later builds either fail GLIBC_2.32 not found or change SDPA fallbacks in ways that move accuracy.


Quick start

  1. Setup
    pip install uv && uv venv --python 3.11 && uv sync && source .venv/bin/activate
    cp .env.example .env        # fill in SLURM_ACCOUNT / DATASETS_ROOT
  2. Wire datasets β€” edit data/registry.yaml (eval) and llava/data/registry/datasets/<cluster>.yaml (train). See data/README.md.

Train

# 1 GPU / debugging β€” sft.sh auto-detects single node and shrinks the batch.
bash scripts/nvila/sft.sh [BASE_MODEL] [RUN_NAME] [DATA_MIXTURE]
# defaults: BASE_MODEL=NVILA-Lite-8B, MIXTURE=sat+vstibench+robofac+wolf

SLURM (multi-node): NUM_NODES=8 bash scripts/slurm/submit.sh scripts/nvila/sft.sh. See scripts/nvila/sft.sh for the full training-arg surface (TPE, L_LD/L_ED weights, region extractor, etc.).

Eval

Released checkpoint: huggingface.co/nvidia/4D-RGPT-8B was trained on transformers==4.46 (v4 stack). Load and run inference under the same stack β€” moving the checkpoint to transformers==5.x is not transparent (tokenizer artifacts differ; see Note on transformers version compatibility below). To extend or retrain, stay on the v4 stack as well.

# 4D-RGPT / NVILA-family
MODEL_PATH=runs/train/4D-RGPT-8B-v4 \
  bash scripts/lmms_eval/tasks/r4d_bench.sh

# Qwen2.5-VL baseline
bash scripts/lmms_eval/models/qwen2_5_vl.sh r4d_bench    # default model

Available tasks: r4d_bench, vlm4d, vstibench (under scripts/lmms_eval/tasks/).

R4D-Bench is the headline benchmark. Eval reports a 12-column accuracy table:

Col Meaning
Avg, Sta, Dyn overall + static / dynamic splits
VG, DM, SR 3D Video Grounding Β· Dimensional Measurement Β· Spatial Relation
R, C, T Rotational Β· Counting Β· Translational
FP, SA, DP False Positives Β· Speed & Acceleration Β· Displacement & Path Length

Reference numbers from this codebase (16 frames):

Model # Frames Avg Sta Dyn VG DM SR R C T FP SA DP
NVILA-Lite-8B (base) 16 39.0 33.0 40.8 35.9 30.1 39.0 43.4 35.7 45.2 24.4 56.8 12.5
4D-RGPT-8B (ours) 16 46.2 41.0 47.8 44.7 37.7 46.3 50.9 46.5 46.6 49.6 59.5 45.8

Note β€” numbers differ from the paper. Post-publication, R4D-Bench was refreshed (test split re-curated) and the eval pipeline was migrated to lmms-eval for reproducibility.


Reusing the components in your own model

The three additions drop into any NVILA-shaped MLLM. Each lives in a single file β€” read the source, the code is short:

1. Latent distillation (L_LD)

Match a small student projector's per-layer features against a frozen L4P teacher's encoder taps (MSE, per-token).

Tuning note: ld_weight ~ 0.01 in our runs. hooks_idx (which teacher layers to distill from) must be the same list in the teacher and the projector β€” keep it in one config, not two.

2. Explicit distillation (L_ED)

Decode the student's projected features through the teacher's task heads and match per-task predictions (independently weighted).

Tuning note: natural-scale losses diverge ~100Γ— across tasks (flow is unbounded pixel displacement; depth is bounded log-meters), so equal weighting silently lets flow dominate. Our paper run uses depth=0.1, flow_2d_backward=0.001, dyn_mask=0.01.

3. Timestamp Positional Encoding (TPE)

Sinusoidal encoding of per-frame timestamps, added to vision tokens before they enter the LLM. Independent of the spatial RoPE / 2D-PE already in the vision encoder. No learned params.


Running on your own cluster

The scripts/slurm/ launchers and scripts/setups/ env helpers are written for our internal NVIDIA cluster (ADLR / NV SLURM). To adapt them:

  • scripts/slurm/submit.sh + scripts/slurm/interactive.sh call the submit_job CLI (NVIDIA-internal wrapper around sbatch). Replace the submit_job ... --autoresume_* --command ... invocation with a plain sbatch heredoc.
  • Partition list polar4,polar3,grizzly,polar in scripts/slurm/submit.sh:45 β€” swap for your cluster's partitions.
  • SLURM_ACCOUNT + other env: copy .env.example to .env and fill in.
  • The pyxis container image is resolved by scripts/slurm/_startup.sh (resolve_image) β€” point this at your own enroot/pyxis image, or skip pyxis entirely and activate the .venv directly in your sbatch script.
  • The autoresume callback at llava/train/callbacks/autoresume_callback.py targets ADLR's AutoResume SDK. If not on that infra, remove the AutoResumeCallback(...) line in llava/train/train.py and drop the import.
  • Dataset paths are now in YAML registries, not in code β€” data/registry.yaml (eval) and llava/data/registry/datasets/<cluster>.yaml (train), both gitignored. See data/README.md.

Dataset Preparation

R4D-Bench

Please follow our HF dataset instructions here: https://huggingface.co/datasets/nvidia/R4D-Bench.

Standard 4D/Spatial Benchmarks

Please follow the official instructions of the datasets used in the paper: STI-Bench, VLM4D, OmniSpatial, MMSI-Bench, SAT, and VSTI-Bench.

Training mixture for 4D-RGPT-8B: VSTI-Bench (training split), Wolf (NuScenes), RoboFAC, SAT. R4D-Bench is curated from STI-Bench and VLM4D via SoM prompting + human verification. See data/README.md for the registry-yaml wiring.


Citation

If you find our work useful, please consider giving a star and citation:

@inproceedings{yang20264d,
  title={4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation},
  author={Yang, Chiao-An and Hachiuma, Ryo and Liu, Sifei and Radhakrishnan, Subhashree and Yeh, Raymond A and Wang, Yu-Chiang Frank and Chen, Min-Hung},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={31042--31053},
  year={2026}
}

Licenses

Copyright Β© 2026, NVIDIA Corporation. All rights reserved.

This work is made available under the NVIDIA Source Code License-NC. Click here to view a copy of this license.