Stars
Open-Sora: Democratizing Efficient Video Production for All
State-of-the-art 2D and 3D Face Analysis Project
MiniCPM-V 4.5: A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding on Your Phone
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Implementation of Nougat Neural Optical Understanding for Academic Documents
BoxMOT: Pluggable SOTA multi-object tracking modules modules for segmentation, object detection and pose estimation models
Official repository of "SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory"
[NeurIPS 2024] Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation
Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
Data processing for and with foundation models! π π π½ β‘οΈ β‘οΈπΈ πΉ π·
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
SAPIEN Manipulation Skill Framework, an open source GPU parallelized robotics simulator and benchmark, led by Hillbot, Inc.
PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838
RetinaFace: Deep Face Detection Library for Python
[ICLR & NeurIPS 2025] Repository for Show-o series, One Single Transformer to Unify Multimodal Understanding and Generation.
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
SEED-Voken: A Series of Powerful Visual Tokenizers
Official Algorithm Implementation of ICML'23 Paper "VIMA: General Robot Manipulation with Multimodal Prompts"
[RSS 2025] Learning to Act Anywhere with Task-centric Latent Actions
RoboBrain 2.0: Advanced version of RoboBrain. See Better. Think Harder. Do Smarter. πππ
Official repo and evaluation implementation of VSI-Bench
Low-level locomotion policy training in Isaac Lab
[RSS 2024 & RSS 2025] VLN-CE evaluation code of NaVid and Uni-NaVid
PyTorch implementation of paper "ARTrack" and "ARTrackV2"
[CoRL 2025] Repository relating to "TrackVLA: Embodied Visual Tracking in the Wild"
Vision-Language Navigation Benchmark in Isaac Lab
π€ RoboOS: A Universal Embodied Operating System for Cross-Embodied and Multi-Robot Collaboration
Embodied Reasoning Question Answer (ERQA) Benchmark