-
Institute of Automation,Chinese Academy of Sciences
- Beijing
Highlights
- Pro
Stars
Official codebase for Fast-WAM: Do World Action Models Need Test-time Future Imagination?
[ICLR 2026] The offical Implementation of "Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model"
StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing
Dexbotic: Open-Source Vision-Language-Action Toolbox
The implementation of our ICRA2024 submission manuscript paper "Complementing Onboard Sensors with Satellite Map: A New Perspective for HD Map Construction"
[ICLR'23 Spotlight & ECCV'24 & IJCV'24] MapTR: Structured Modeling and Learning for Online Vectorized HD Map Construction
SPAgent, a foundation agent for understanding, reasoning over, and operating within the physical and spatial world.
Wan: Open and Advanced Large-Scale Video Generative Models
A critical analysis of the Cambrian-S model and VSI-Super benchmarks
[ICLR 2026] Streaming 4D Visual Geometry Transformer
A procedural Blender pipeline for photorealistic training image generation
[CVPR 2025 Best Paper Award] VGGT: Visual Geometry Grounded Transformer
[ICCV 2025] PartField: Learning 3D Feature Fields for Part Segmentation and Beyond
[NeurIPS 2025] 3DRS: MLLMs Need 3D-Aware Representation Supervision for Scene Understanding
[ICML 2024] LEO: An Embodied Generalist Agent in 3D World
[CVPR 2026] Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
The code for paper 'Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors'
A paper list for spatial reasoning
Collection of papers on human-object-interaction generation
[ICLR 2025] Official Implementation for 3D-AffordanceLLM: Harnessing Large Language Models for Open-Vocabulary Affordance Detection in 3D Worlds]
Code release for "Masked-attention Mask Transformer for Universal Image Segmentation"
Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"